Cloud-Native Architecture¶

Design for the cloud, not in spite of it.

The hook¶

"Lift and shift" to the cloud means running your old monolith on EC2 instead of a server in your closet. It works. The lights stay on. The bill arrives.

And the bill is bigger than your closet was.

You're paying cloud prices for on-prem architecture. Static instances, manual deploys, a database you SSH into when something breaks. Nothing about how you built the system changed — only where it runs.

Cloud-native means designing for what the cloud actually does well: elastic scale, managed services, ephemeral compute, declarative infrastructure. Different posture, different wins.

The concept¶

Cloud-native is a set of architecture principles, not a stack. You can be cloud-native on AWS, GCP, Azure, or a private Kubernetes cluster. Five core ideas:

Microservices — small services, independent deploys. One team can ship without merging into a monorepo of doom.
Containers — packaged, portable workloads. Same image runs on your laptop, in CI, and in prod.
Declarative infrastructure — describe the desired state in code (Terraform, k8s manifests), let tools converge to it. No more clicking around the AWS console.
Immutable infrastructure — never SSH into a box to patch it. Rebuild the image, redeploy, throw away the old one.
Managed services > self-hosted — use the cloud's databases, queues, and caches. Don't run Postgres on an EC2 you forgot about.

The thread tying these together is failure tolerance. Cloud-native assumes hardware fails, networks partition, and services go down. The system heals because it was designed to.

Diagram¶

flowchart TB
    subgraph IaaS["IaaS — you manage the OS"]
        EC2[EC2 / GCE / Azure VM]
    end
    subgraph PaaS["PaaS / CaaS — you manage the app"]
        CR[Cloud Run / Heroku / App Engine]
    end
    subgraph Serverless["Serverless — you manage the function"]
        L[Lambda / Cloudflare Workers]
    end
    EC2 --> CR --> L
    style EC2 fill:#fee
    style CR fill:#ffd
    style L fill:#dfd

Going up the ladder: less control, less ops, more vendor coupling. EC2 hands you a Linux box and walks away — you patch it, scale it, monitor it. Cloud Run takes a container and runs it for you. Lambda takes a function and runs it on demand. Pick the highest level you can tolerate the lock-in for.

Example — Netflix's cloud-native journey¶

Netflix is the canonical case, and the story isn't "they moved to AWS." It's why they moved and what they built once they got there.

2008 — the breaking point

A database corruption took Netflix offline for days. They were running their own data center. The decision afterward was blunt: leave the data center entirely. Not because the cloud was magic, but because their own infra had become a liability they couldn't engineer around.

2010–2015 — the rewrite

Netflix rebuilt as microservices on AWS. They open-sourced the stack as they went:

Hystrix — circuit breakers, so one slow service doesn't cascade
Eureka — service discovery, so services find each other dynamically
Zuul — API gateway at the edge
Ribbon — client-side load balancing for internal calls

Each tool exists because Netflix hit a problem at scale and built the answer.

Today

1,000+ microservices. Multi-region active-active. A culture of chaos engineering — they run Chaos Monkey, which kills random production instances during business hours. On purpose. Because if you can't survive a node dying, you'll find out the hard way eventually; they prefer to find out at 2pm on a Tuesday with engineers watching.

The real lesson

The shift wasn't "we use AWS now." It was designing for failure as the default. Hardware fails. Networks partition. Services go down. Cloud-native assumes this and builds accordingly. Netflix calls instances cattle, not pets — any node can die without notice, and the system heals. That mindset is the actual win, and you can apply it whether you're running on AWS, k8s, or a Raspberry Pi cluster.

Mechanics¶

IaaS / PaaS / Serverless¶

Level	You manage	Cloud manages	Examples	Cost model	Lock-in
IaaS	OS, runtime, app, scaling rules	Hardware, network, hypervisor	EC2, GCE, Azure VM	Per-hour, per-instance	Low — VMs are portable
PaaS / CaaS	App code, container image	OS, runtime, scaling	Cloud Run, Heroku, App Engine, Fargate	Per-request or per-container-second	Medium — config is provider-shaped
Serverless	Function code	Everything else	Lambda, Cloudflare Workers, Vercel Functions	Per-invocation, sub-second billing	High — code shape ties to the platform

Default rule: start at the highest level you can tolerate the lock-in for, and drop down only when you hit a hard ceiling (cold starts, runtime limits, niche dependencies).

Anti-patterns and hidden costs¶

Trap	What it looks like	Why it hurts
Lift-and-shift on EC2	Old monolith running on a big VM	Paying cloud prices for on-prem patterns. No elasticity, no managed services.
Egress fees	"Why is our bill $40k this month?"	Data leaving the cloud is expensive. Cross-region, cross-AZ, and to-internet transfers add up fast. The surprise bill.
N+1 cloud services	Three different queue products in one architecture	Pick one queue, one cache, one DB pattern. Diversity for its own sake means more SDKs, more IAM, more bills.
Snowflake servers	"Don't restart that one — we manually fixed something on it in 2022"	Defeats immutability. Nothing is reproducible, nothing is in code, deploys are scary.
Mutable infrastructure	SSH-ing into prod to patch a config	Configuration drift. The instance you have in staging is no longer the one running in prod.
Stateful pets	Single Postgres instance no one is allowed to touch	Single point of failure that the cloud was supposed to remove. Use a managed DB.

The honest summary: cloud bills get big because of patterns, not raw compute. Egress and idle resources eat more than your CPU.

Concept	What it is	How it relates
Microservices	Small, independently deployed services	One of the five core cloud-native principles. Hard to do right — easy to over-fragment.
Containers (Docker)	Packaged, portable runtime images	The unit of deploy in most cloud-native systems. Same image, every environment.
Kubernetes	Container orchestrator	The default for declarative, self-healing container deployment at scale.
Observability	Logs, metrics, traces, tied together	Cloud-native systems are much harder to debug without it. Distributed traces are non-optional once you have 20+ services.
CI/CD	Automated build, test, deploy pipelines	Immutable infra only works if rebuilds are cheap and automatic. CI/CD is the engine.
Infrastructure as Code (IaC)	Terraform, Pulumi, CDK	Declarative infrastructure in practice. Your cloud, in a Git repo.
Serverless	Functions/containers that scale to zero	The far end of the abstraction ladder. Great for spiky workloads, weird for steady ones.
Multi-region	Active-active or active-passive across regions	The pattern that justifies the cloud's price tag — survive entire region outages.

When (and when not) to go cloud-native¶

Go cloud-native when:

Traffic is variable — peaks and valleys justify elastic compute. You don't want to pay for peak capacity 24/7.
You'd benefit from managed services — managed Postgres, managed Redis, managed queues. The undifferentiated heavy lifting goes away.
Your team is comfortable with the abstraction — managed services are simpler to use and harder to debug. Skill matters.
You need multi-region or fast geographic scaling — the cloud makes this a config change instead of a procurement project.

Skip it when:

The workload is steady and you can run it cheaper on dedicated hardware. Bare-metal still beats the cloud on raw cost-per-CPU when usage is flat.
Compliance or data-locality forces on-prem — some regulated workloads (certain healthcare, government, finance) can't leave specific data centers.
You have niche requirements that don't fit cloud abstractions — specialized hardware, kernel-level customizations, deterministic latency.

The honest answer: most modern apps benefit from cloud-native, but lift-and-shift is the wrong way to do it. If you're going to pay cloud prices, design for cloud wins. Otherwise, stay where you are until the architecture is ready.

Key takeaway¶

Cloud-native is a posture, not a stack — design for failure, embrace managed services, treat infrastructure as code.
Lift-and-shift is the trap. You'll pay more and get less than the on-prem version.
Cattle, not pets. Any instance can die. The system heals because you designed it to.
Pick the highest abstraction you can stomach. Serverless > managed containers > VMs, in that order, until lock-in or limits push you down.
Egress fees are the surprise bill. Architect to minimize cross-region and out-of-cloud data transfer.

Quiz available in the SLAM OG app — three questions on cloud-native vs lift-and-shift, anti-patterns, and the IaaS/PaaS/serverless ladder.