Surviving the Cloudflare Outage: How a Thoughtfully Engineered Architecture Kept a SaaS Platform Online With Zero Downtime

Last week, a major Cloudflare outage swept across the internet, bringing down countless websites, SaaS platforms and online tools. Entire industries experienced cascading failures: sites unreachable, dashboards offline, uptime monitors dead, and even Cloudflare's own portal inaccessible to customers using Cloudflare DNS.

In the middle of this, a SaaS platform whose architecture I designed as a consultant experienced zero downtime.

This case study explains how — and why — deliberate architectural choices matter far more than throwing money at infrastructure.

About me — and why I was hired

Today I work as a system architect with strong DevOps and cloud infrastructure expertise, but I come from a professional background as a backend programmer. My strength has always been designing and building complex systems, where application architecture and infrastructure need to work together.

Over the past years, I've specialised in:

Cloud architecture
Redundancy engineering
Fault-tolerant systems
Scaling strategies
Kubernetes
High-availability design
Performance and infrastructure optimisation

Companies typically bring me in when their systems have outgrown their original design — when scaling, stability and reliability become business-critical.

This SaaS platform was one of those cases.

The SaaS: High uptime requirements with real-world budget constraints

The company operates a niche website-builder SaaS for a highly regulated sector. They serve hundreds of customer websites with clear uptime expectations.

But unlike enterprise web hosting, this market is price-sensitive. Customers are not paying thousands of euros a month, so the system infrastructure must remain:

Cost-efficient
Easy to maintain
Highly reliable
Secure
Redundant
Capable of absorbing traffic spikes

Most sites are static or semi-static public pages — no logged-in user logic — but they change frequently and must be delivered fast globally.

The original architecture — functional but fundamentally limited

When I was brought into the project, the original platform had several key limitations:

It was a traditional monolithic application, tightly coupled and difficult to scale horizontally
Internal session handling and memory-bound logic meant it could only run reliably on a single machine
The CDN in front was a legacy, commercial CDN product that had become unstable, slow and poorly supported
The system had no elegant way to scale out traffic or isolate workloads
Redundancy was minimal
Infrastructure costs increased rapidly as load grew

It wasn't built for a future where hundreds of customer websites were being served around the clock.

The company needed a full architectural redesign — not just infrastructure, but the application itself.

The redesign: Modern, distributed, resilient

Over the course of the project, I redesigned the entire architecture and implemented the new system end-to-end.

1. Breaking the monolith

The application was decomposed into dedicated services:

Website rendering
CMS operations
Static delivery layer
Integrations and background tasks

Each service could scale independently.

2. Migrating to Kubernetes

Deploying workloads into Kubernetes enabled:

Horizontal scaling
Rolling deployments
Resilience
Smart resource utilization
Self-healing
Infrastructure consistency

3. Highly available database cluster

A proper HA database layer replaced the single-node model.

4. Moving to Cloudflare for edge delivery

Cloudflare was introduced for:

High-performance global caching
Traffic acceleration
DDoS protection
WAF
HTTP/3 and modern protocol support

Cache hit ratios quickly hit 90–98%, drastically reducing load on origin.

All good — but Cloudflare became a potential single point of failure.

And last week proved exactly why that matters.

The DNS problem: The hidden failure point that kills SaaS platforms

Most SaaS companies tell clients to point their domains directly to Cloudflare.

That creates an enormous risk:

If Cloudflare goes down
And they control your DNS
You cannot change anything.

During the outage last week:

Many businesses' websites were offline
Their DNS was unreachable
They couldn't log in to Cloudflare
They couldn't add DNS records
They couldn't reroute traffic
They couldn't even switch to a backup CDN
They were stuck.

To avoid this trap, I designed a middle-layer DNS architecture.

Clients → Our DNS layer → Cloudflare → Origin

This gives the SaaS full control during outages.

Now, if Cloudflare has issues, we simply reroute traffic away from them.

But our DNS becomes a potential single point of failure

To solve this, I implemented:

A primary DNS provider
A secondary DNS provider
Both authoritative.

If one fails, the other continues operating. DNS resilience solved.

Multi-cloud redundancy? Not realistic for this SaaS

Yes, in theory we could add:

A second Kubernetes cluster in a different cloud
Full cross-provider data replication
Synchronized object storage
Multi-cloud deployments

But this:

Doubles the operational cost
Doubles the storage cost
Doubles the infrastructure complexity
Requires constant state syncing
Is unrealistic for a niche SaaS with strict budget constraints

We needed a smarter, more elegant solution.

The unconventional — but powerful — solution: A lightweight external nginx–varnish cluster

This turned out to be the single most important resilience mechanism in the entire system.

What it is

A tiny cluster running:

Nginx
Varnish (HTTP accelerator/cache)
Hosted outside the main cloud provider.

What it stores

A full, preloaded, always-updated cache of every customer website.

Whenever a customer updates content:

Cloudflare cache is purged
The origin updates
A "fire-and-forget" request instructs the varnish cluster to preload the updated page

This keeps varnish permanently warm with the latest version of every site.

Why this matters

This cluster:

Requires almost no compute
Requires no database
Doesn't need the CMS
Doesn't run business logic
Scales in seconds
Costs a fraction of full multi-cloud redundancy

It is essentially a "shadow delivery layer" that sits dormant — until it's needed.

What happened during the Cloudflare outage

Because of this architecture, here's what happened:

Cloudflare failed
DNS, CDN and even the Cloudflare dashboard were unreachable for many companies. Those using Cloudflare for DNS couldn't even log in to fix records.
Our DNS layer stayed fully operational
Because it uses two independent DNS providers.
Traffic was rerouted away from Cloudflare
Instantly.
Traffic flowed to the external varnish cluster
Which already had all pages preloaded.
Auto-scaling kicked in
The varnish cluster scaled up in ~4 seconds to handle the traffic.
Fewer than 50 requests failed during the entire transition
All other requests served successfully.

Result: Zero downtime. Zero customer complaints. Zero impact.

Meanwhile:

Monitoring tools were down
Status dashboards were down
Many SaaS platforms were down
Developers couldn't even access Cloudflare
Even uptime-checking services had outages

But this platform remained completely online.

Why this architecture works — and why it's cost-effective

This system is resilient because it removes single points of failure at the most critical layers:

DNS (two authoritative providers)
CDN (Cloudflare + independent varnish)
Origin (Kubernetes + HA database)
Application (multi-service architecture)

It avoids duplicating the entire infrastructure stack while still maintaining:

Multi-layer failover paths
High cache coverage
Instant CDN bypass capability
Fast recoverability
Full customer transparency

This is what good architecture looks like: high reliability without high cost.

Conclusion: You don't need two clouds — you need the right design

The Cloudflare outage was a real-world stress test.

While a large part of the internet struggled — including DNS, dashboards, uptime monitors, and SaaS platforms — the system I architected stayed online.

Not because it had the biggest budget. Not because it replicated everything twice. But because it was designed thoughtfully, with:

Multi-layer redundancy
No single points of failure
Smart caching
DNS control
Application decomposition
Lightweight failover infrastructure

Resilience isn't bought — it's engineered.

And with the right architecture, even smaller SaaS companies can withstand the outages of major global providers.