Surviving the Cloudflare Outage: How a Thoughtfully Engineered Architecture Kept a SaaS Platform Online With Zero Downtime
Last week, a major Cloudflare outage swept across the internet, bringing down countless websites, SaaS platforms and online tools. In the middle of this, a SaaS platform whose architecture I designed as a consultant experienced zero downtime. This case study explains how — and why — deliberate architectural choices matter far more than throwing money at infrastructure.
Last week, a major Cloudflare outage swept across the internet, bringing down countless websites, SaaS platforms and online tools. Entire industries experienced cascading failures: sites unreachable, dashboards offline, uptime monitors dead, and even Cloudflare's own portal inaccessible to customers using Cloudflare DNS.
In the middle of this, a SaaS platform whose architecture I designed as a consultant experienced zero downtime.
This case study explains how — and why — deliberate architectural choices matter far more than throwing money at infrastructure.
About me — and why I was hired
Today I work as a system architect with strong DevOps and cloud infrastructure expertise, but I come from a professional background as a backend programmer. My strength has always been designing and building complex systems, where application architecture and infrastructure need to work together.
Over the past years, I've specialised in:
- Cloud architecture
- Redundancy engineering
- Fault-tolerant systems
- Scaling strategies
- Kubernetes
- High-availability design
- Performance and infrastructure optimisation
Companies typically bring me in when their systems have outgrown their original design — when scaling, stability and reliability become business-critical.
This SaaS platform was one of those cases.
The SaaS: High uptime requirements with real-world budget constraints
The company operates a niche website-builder SaaS for a highly regulated sector. They serve hundreds of customer websites with clear uptime expectations.
But unlike enterprise web hosting, this market is price-sensitive. Customers are not paying thousands of euros a month, so the system infrastructure must remain:
- Cost-efficient
- Easy to maintain
- Highly reliable
- Secure
- Redundant
- Capable of absorbing traffic spikes
Most sites are static or semi-static public pages — no logged-in user logic — but they change frequently and must be delivered fast globally.
The original architecture — functional but fundamentally limited
When I was brought into the project, the original platform had several key limitations:
- It was a traditional monolithic application, tightly coupled and difficult to scale horizontally
- Internal session handling and memory-bound logic meant it could only run reliably on a single machine
- The CDN in front was a legacy, commercial CDN product that had become unstable, slow and poorly supported
- The system had no elegant way to scale out traffic or isolate workloads
- Redundancy was minimal
- Infrastructure costs increased rapidly as load grew
It wasn't built for a future where hundreds of customer websites were being served around the clock.
The company needed a full architectural redesign — not just infrastructure, but the application itself.
The redesign: Modern, distributed, resilient
Over the course of the project, I redesigned the entire architecture and implemented the new system end-to-end.
1. Breaking the monolith
The application was decomposed into dedicated services:
- Website rendering
- CMS operations
- Static delivery layer
- Integrations and background tasks
Each service could scale independently.
2. Migrating to Kubernetes
Deploying workloads into Kubernetes enabled:
- Horizontal scaling
- Rolling deployments
- Resilience
- Smart resource utilization
- Self-healing
- Infrastructure consistency
3. Highly available database cluster
A proper HA database layer replaced the single-node model.
4. Moving to Cloudflare for edge delivery
Cloudflare was introduced for:
- High-performance global caching
- Traffic acceleration
- DDoS protection
- WAF
- HTTP/3 and modern protocol support
Cache hit ratios quickly hit 90–98%, drastically reducing load on origin.
All good — but Cloudflare became a potential single point of failure.
And last week proved exactly why that matters.
The DNS problem: The hidden failure point that kills SaaS platforms
Most SaaS companies tell clients to point their domains directly to Cloudflare.
That creates an enormous risk:
- If Cloudflare goes down
- And they control your DNS
- You cannot change anything.
During the outage last week:
- Many businesses' websites were offline
- Their DNS was unreachable
- They couldn't log in to Cloudflare
- They couldn't add DNS records
- They couldn't reroute traffic
- They couldn't even switch to a backup CDN
- They were stuck.
To avoid this trap, I designed a middle-layer DNS architecture.
Clients → Our DNS layer → Cloudflare → Origin
This gives the SaaS full control during outages.
Now, if Cloudflare has issues, we simply reroute traffic away from them.
But our DNS becomes a potential single point of failure
To solve this, I implemented:
- A primary DNS provider
- A secondary DNS provider
- Both authoritative.
If one fails, the other continues operating. DNS resilience solved.
Multi-cloud redundancy? Not realistic for this SaaS
Yes, in theory we could add:
- A second Kubernetes cluster in a different cloud
- Full cross-provider data replication
- Synchronized object storage
- Multi-cloud deployments
But this:
- Doubles the operational cost
- Doubles the storage cost
- Doubles the infrastructure complexity
- Requires constant state syncing
- Is unrealistic for a niche SaaS with strict budget constraints
We needed a smarter, more elegant solution.
The unconventional — but powerful — solution: A lightweight external nginx–varnish cluster
This turned out to be the single most important resilience mechanism in the entire system.
What it is
A tiny cluster running:
- Nginx
- Varnish (HTTP accelerator/cache)
- Hosted outside the main cloud provider.
What it stores
A full, preloaded, always-updated cache of every customer website.
Whenever a customer updates content:
- Cloudflare cache is purged
- The origin updates
- A "fire-and-forget" request instructs the varnish cluster to preload the updated page
This keeps varnish permanently warm with the latest version of every site.
Why this matters
This cluster:
- Requires almost no compute
- Requires no database
- Doesn't need the CMS
- Doesn't run business logic
- Scales in seconds
- Costs a fraction of full multi-cloud redundancy
It is essentially a "shadow delivery layer" that sits dormant — until it's needed.
What happened during the Cloudflare outage
Because of this architecture, here's what happened:
- Cloudflare failed
DNS, CDN and even the Cloudflare dashboard were unreachable for many companies. Those using Cloudflare for DNS couldn't even log in to fix records. - Our DNS layer stayed fully operational
Because it uses two independent DNS providers. - Traffic was rerouted away from Cloudflare
Instantly. - Traffic flowed to the external varnish cluster
Which already had all pages preloaded. - Auto-scaling kicked in
The varnish cluster scaled up in ~4 seconds to handle the traffic. - Fewer than 50 requests failed during the entire transition
All other requests served successfully.
Result: Zero downtime. Zero customer complaints. Zero impact.
Meanwhile:
- Monitoring tools were down
- Status dashboards were down
- Many SaaS platforms were down
- Developers couldn't even access Cloudflare
- Even uptime-checking services had outages
But this platform remained completely online.
Why this architecture works — and why it's cost-effective
This system is resilient because it removes single points of failure at the most critical layers:
- DNS (two authoritative providers)
- CDN (Cloudflare + independent varnish)
- Origin (Kubernetes + HA database)
- Application (multi-service architecture)
It avoids duplicating the entire infrastructure stack while still maintaining:
- Multi-layer failover paths
- High cache coverage
- Instant CDN bypass capability
- Fast recoverability
- Full customer transparency
This is what good architecture looks like: high reliability without high cost.
Conclusion: You don't need two clouds — you need the right design
The Cloudflare outage was a real-world stress test.
While a large part of the internet struggled — including DNS, dashboards, uptime monitors, and SaaS platforms — the system I architected stayed online.
Not because it had the biggest budget. Not because it replicated everything twice. But because it was designed thoughtfully, with:
- Multi-layer redundancy
- No single points of failure
- Smart caching
- DNS control
- Application decomposition
- Lightweight failover infrastructure
Resilience isn't bought — it's engineered.
And with the right architecture, even smaller SaaS companies can withstand the outages of major global providers.