← Back to Blog
architecturecloudflarednsredundancykubernetessaasuptimeinfrastructure

Surviving the Cloudflare Outage: How a Thoughtfully Engineered Architecture Kept a SaaS Platform Online With Zero Downtime

Last week, a major Cloudflare outage swept across the internet, bringing down countless websites, SaaS platforms and online tools. In the middle of this, a SaaS platform whose architecture I designed as a consultant experienced zero downtime. This case study explains how — and why — deliberate architectural choices matter far more than throwing money at infrastructure.

Last week, a major Cloudflare outage swept across the internet, bringing down countless websites, SaaS platforms and online tools. Entire industries experienced cascading failures: sites unreachable, dashboards offline, uptime monitors dead, and even Cloudflare's own portal inaccessible to customers using Cloudflare DNS.

In the middle of this, a SaaS platform whose architecture I designed as a consultant experienced zero downtime.

This case study explains how — and why — deliberate architectural choices matter far more than throwing money at infrastructure.

About me — and why I was hired

Today I work as a system architect with strong DevOps and cloud infrastructure expertise, but I come from a professional background as a backend programmer. My strength has always been designing and building complex systems, where application architecture and infrastructure need to work together.

Over the past years, I've specialised in:

  • Cloud architecture
  • Redundancy engineering
  • Fault-tolerant systems
  • Scaling strategies
  • Kubernetes
  • High-availability design
  • Performance and infrastructure optimisation

Companies typically bring me in when their systems have outgrown their original design — when scaling, stability and reliability become business-critical.

This SaaS platform was one of those cases.

The SaaS: High uptime requirements with real-world budget constraints

The company operates a niche website-builder SaaS for a highly regulated sector. They serve hundreds of customer websites with clear uptime expectations.

But unlike enterprise web hosting, this market is price-sensitive. Customers are not paying thousands of euros a month, so the system infrastructure must remain:

  • Cost-efficient
  • Easy to maintain
  • Highly reliable
  • Secure
  • Redundant
  • Capable of absorbing traffic spikes

Most sites are static or semi-static public pages — no logged-in user logic — but they change frequently and must be delivered fast globally.

The original architecture — functional but fundamentally limited

When I was brought into the project, the original platform had several key limitations:

  • It was a traditional monolithic application, tightly coupled and difficult to scale horizontally
  • Internal session handling and memory-bound logic meant it could only run reliably on a single machine
  • The CDN in front was a legacy, commercial CDN product that had become unstable, slow and poorly supported
  • The system had no elegant way to scale out traffic or isolate workloads
  • Redundancy was minimal
  • Infrastructure costs increased rapidly as load grew

It wasn't built for a future where hundreds of customer websites were being served around the clock.

The company needed a full architectural redesign — not just infrastructure, but the application itself.

The redesign: Modern, distributed, resilient

Over the course of the project, I redesigned the entire architecture and implemented the new system end-to-end.

1. Breaking the monolith

The application was decomposed into dedicated services:

  • Website rendering
  • CMS operations
  • Static delivery layer
  • Integrations and background tasks

Each service could scale independently.

2. Migrating to Kubernetes

Deploying workloads into Kubernetes enabled:

  • Horizontal scaling
  • Rolling deployments
  • Resilience
  • Smart resource utilization
  • Self-healing
  • Infrastructure consistency

3. Highly available database cluster

A proper HA database layer replaced the single-node model.

4. Moving to Cloudflare for edge delivery

Cloudflare was introduced for:

  • High-performance global caching
  • Traffic acceleration
  • DDoS protection
  • WAF
  • HTTP/3 and modern protocol support

Cache hit ratios quickly hit 90–98%, drastically reducing load on origin.

All good — but Cloudflare became a potential single point of failure.

And last week proved exactly why that matters.

The DNS problem: The hidden failure point that kills SaaS platforms

Most SaaS companies tell clients to point their domains directly to Cloudflare.

That creates an enormous risk:

  • If Cloudflare goes down
  • And they control your DNS
  • You cannot change anything.

During the outage last week:

  • Many businesses' websites were offline
  • Their DNS was unreachable
  • They couldn't log in to Cloudflare
  • They couldn't add DNS records
  • They couldn't reroute traffic
  • They couldn't even switch to a backup CDN
  • They were stuck.

To avoid this trap, I designed a middle-layer DNS architecture.

Clients → Our DNS layer → Cloudflare → Origin

This gives the SaaS full control during outages.

Now, if Cloudflare has issues, we simply reroute traffic away from them.

But our DNS becomes a potential single point of failure

To solve this, I implemented:

  • A primary DNS provider
  • A secondary DNS provider
  • Both authoritative.

If one fails, the other continues operating. DNS resilience solved.

Multi-cloud redundancy? Not realistic for this SaaS

Yes, in theory we could add:

  • A second Kubernetes cluster in a different cloud
  • Full cross-provider data replication
  • Synchronized object storage
  • Multi-cloud deployments

But this:

  • Doubles the operational cost
  • Doubles the storage cost
  • Doubles the infrastructure complexity
  • Requires constant state syncing
  • Is unrealistic for a niche SaaS with strict budget constraints

We needed a smarter, more elegant solution.

The unconventional — but powerful — solution: A lightweight external nginx–varnish cluster

This turned out to be the single most important resilience mechanism in the entire system.

What it is

A tiny cluster running:

  • Nginx
  • Varnish (HTTP accelerator/cache)
  • Hosted outside the main cloud provider.

What it stores

A full, preloaded, always-updated cache of every customer website.

Whenever a customer updates content:

  • Cloudflare cache is purged
  • The origin updates
  • A "fire-and-forget" request instructs the varnish cluster to preload the updated page

This keeps varnish permanently warm with the latest version of every site.

Why this matters

This cluster:

  • Requires almost no compute
  • Requires no database
  • Doesn't need the CMS
  • Doesn't run business logic
  • Scales in seconds
  • Costs a fraction of full multi-cloud redundancy

It is essentially a "shadow delivery layer" that sits dormant — until it's needed.

What happened during the Cloudflare outage

Because of this architecture, here's what happened:

  1. Cloudflare failed
    DNS, CDN and even the Cloudflare dashboard were unreachable for many companies. Those using Cloudflare for DNS couldn't even log in to fix records.
  2. Our DNS layer stayed fully operational
    Because it uses two independent DNS providers.
  3. Traffic was rerouted away from Cloudflare
    Instantly.
  4. Traffic flowed to the external varnish cluster
    Which already had all pages preloaded.
  5. Auto-scaling kicked in
    The varnish cluster scaled up in ~4 seconds to handle the traffic.
  6. Fewer than 50 requests failed during the entire transition
    All other requests served successfully.

Result: Zero downtime. Zero customer complaints. Zero impact.

Meanwhile:

  • Monitoring tools were down
  • Status dashboards were down
  • Many SaaS platforms were down
  • Developers couldn't even access Cloudflare
  • Even uptime-checking services had outages

But this platform remained completely online.

Why this architecture works — and why it's cost-effective

This system is resilient because it removes single points of failure at the most critical layers:

  • DNS (two authoritative providers)
  • CDN (Cloudflare + independent varnish)
  • Origin (Kubernetes + HA database)
  • Application (multi-service architecture)

It avoids duplicating the entire infrastructure stack while still maintaining:

  • Multi-layer failover paths
  • High cache coverage
  • Instant CDN bypass capability
  • Fast recoverability
  • Full customer transparency

This is what good architecture looks like: high reliability without high cost.

Conclusion: You don't need two clouds — you need the right design

The Cloudflare outage was a real-world stress test.

While a large part of the internet struggled — including DNS, dashboards, uptime monitors, and SaaS platforms — the system I architected stayed online.

Not because it had the biggest budget. Not because it replicated everything twice. But because it was designed thoughtfully, with:

  • Multi-layer redundancy
  • No single points of failure
  • Smart caching
  • DNS control
  • Application decomposition
  • Lightweight failover infrastructure

Resilience isn't bought — it's engineered.

And with the right architecture, even smaller SaaS companies can withstand the outages of major global providers.

Surviving the Cloudflare Outage: How a Thoughtfully Engineered Architecture Kept a SaaS Platform Online With Zero Downtime | SiteVitals