Skip to main content

Command Palette

Search for a command to run...

Scaling: Before You Buy More Servers, Read This

Published
9 min read
Scaling: Before You Buy More Servers, Read This
A
Real-world engineering insights from 20+ years building scalable systems. Focused on AI, RAG architectures, and production-ready system design.

Series: Backend Engineering Fundamentals · Post 06 of 07 Level: Beginner-friendly · Read time: ~8 min


"We need to scale" is one of the most expensive sentences in engineering.

It triggers infrastructure discussions, migration projects, and architectural rewrites — often before anyone has looked at whether the current system is actually running at capacity.

Before scaling your infrastructure, understand what you're actually scaling for. Most systems that feel slow are bottlenecked by code problems (N+1 queries, missing indexes, synchronous calls that should be async) — not infrastructure capacity. Scaling a slow system gives you a more expensive slow system.

This post covers the actual mechanics of scaling, the tradeoffs between approaches, and how to think about it before opening a cloud console.


Vertical vs Horizontal Scaling

Vertical scaling (Scale Up): Add more resources to existing servers — bigger CPU, more RAM, faster disk.

Horizontal scaling (Scale Out): Add more servers and distribute the load across them.

Vertical Scaling                    Horizontal Scaling

[Server: 8 CPU, 32GB]      →       [Server: 4 CPU, 16GB] ×3
         ↓                                   ↓
[Server: 32 CPU, 128GB]             [Server: 4 CPU, 16GB] ×10
(one big machine)                   (many smaller machines)
Vertical Horizontal
Simplicity Simple — no code changes Complex — requires stateless design
Cost Expensive at high end (premium hardware) Cheaper per unit at scale
Failure impact Single point of failure Redundant — one server failure is minor
Ceiling Hard limit on available hardware Theoretically unlimited
Database Works well (most DBs scale vertically first) Sharding required for DBs

In practice: Start with vertical scaling. It's simpler, faster, and often sufficient. Switch to horizontal when you hit the vertical ceiling or need high availability.


The Stateless Requirement for Horizontal Scaling

Horizontal scaling only works if your application is stateless — each request can be handled by any server, with no local state that makes one server "special."

❌ Stateful — Can't Scale Horizontally

Server 1: User session in memory → [Request for user A] works
Server 2: No session for user A  → [Request for user A] fails

✅ Stateless — Scales Horizontally

Server 1: No local state → reads session from Redis
Server 2: No local state → reads session from Redis
Server 3: No local state → reads session from Redis

Any server can handle any request.
Load balancer distributes freely.

The rule: Move all state out of your application servers and into shared storage (Redis for sessions, S3 for files, your database for persistent data). Your servers should be interchangeable.

# ❌ Stateful — in-memory session
app.sessions[user_id] = {"cart": items}  # Lives on one server only

# ✅ Stateless — session in Redis
redis.setex(f"session:{session_id}", 3600, json.dumps({"cart": items}))

Load Balancers — The Front Door to Your Scaled System

A load balancer distributes incoming requests across your pool of servers.

Internet
   ↓
[Load Balancer]
   ├── Server 1
   ├── Server 2
   └── Server 3

Load balancing algorithms:

Algorithm How it works Use when
Round Robin Requests distributed in sequence (1→2→3→1→2→3) Servers have equal capacity and similar request costs
Least Connections Routes to server with fewest active connections Requests have variable processing time
IP Hash Routes same client IP to same server You need session stickiness and can't use a shared session store
Weighted Servers get traffic proportional to weight Servers have different capacities
Random Random server selection Surprisingly effective at scale; simple to implement

Layer 4 vs Layer 7:

  • L4 (TCP/UDP): Routes based on IP address and port. Extremely fast, no content inspection. AWS NLB, HAProxy in TCP mode.
  • L7 (HTTP): Routes based on HTTP content (URL, headers, cookies). More flexible — route /api to one pool, /static to another. AWS ALB, NGINX, Traefik.
# NGINX: Layer 7 load balancing with upstream pools
upstream api_servers {
    least_conn;  # Least connections algorithm
    server app1.internal:8080 weight=3;
    server app2.internal:8080 weight=3;
    server app3.internal:8080 weight=1;  # Lower weight = less traffic
    
    keepalive 32;  # Connection pool to upstream servers
}

upstream static_servers {
    server static1.internal:8080;
    server static2.internal:8080;
}

server {
    location /api/ {
        proxy_pass http://api_servers;
    }
    location /static/ {
        proxy_pass http://static_servers;
    }
}

Database Scaling — Where It Gets Hard

Application servers are stateless and easy to scale. Databases are stateful and hard.

Read Replicas — The First Move

Most applications are read-heavy. Add read replicas and route SELECT queries there.

Primary DB (writes)
    ↓ replication
Replica 1 (reads)
Replica 2 (reads)
Replica 3 (reads)

Application:
  - INSERT / UPDATE / DELETE → Primary
  - SELECT → Random replica
# Connection routing example
def get_db_connection(read_only: bool = False):
    if read_only:
        return random.choice(replica_connections)
    return primary_connection

Limitation: Replication lag. Replicas are slightly behind the primary (usually milliseconds, but can grow under load). Don't read from a replica immediately after a write if you need the result.


Connection Pooling — Before You Add Replicas

Before adding replicas, make sure you're not wasting connections. Databases have a hard limit on concurrent connections. Without pooling, a spike in traffic can exhaust connections instantly.

# SQLAlchemy connection pool
engine = create_engine(
    DATABASE_URL,
    pool_size=20,          # Normal pool size
    max_overflow=30,       # Extra connections under load
    pool_timeout=30,       # Wait up to 30s for a connection before error
    pool_recycle=3600      # Recycle connections after 1 hour
)

For PostgreSQL at scale, use PgBouncer — a lightweight connection pooler that sits between your app and the database, multiplexing thousands of application connections onto a smaller number of actual DB connections.


Sharding — The Last Resort

When a single primary + replicas isn't enough, you shard: split your data across multiple databases.

User IDs 1–1M   → Database Shard 1
User IDs 1M–2M  → Database Shard 2
User IDs 2M–3M  → Database Shard 3

The costs are real:

  • Cross-shard queries (JOINs across shards) become application logic
  • Transactions across shards require distributed transaction handling
  • Resharding (when a shard gets too large) is painful
  • Every query needs shard-routing logic

Sharding adds enormous operational complexity. Exhaust all other options first: indexing, query optimization, read replicas, caching, connection pooling, vertical scaling.


Auto-Scaling — Elasticity, Not Magic

Auto-scaling adds or removes servers based on load. This is valuable for variable traffic patterns (traffic spikes on product launches, Black Friday, etc.).

# AWS Auto Scaling Group (simplified)
AutoScalingGroup:
  MinSize: 2          # Always at least 2 servers
  MaxSize: 20         # Never exceed 20 servers
  DesiredCapacity: 4  # Start with 4

ScalingPolicy:
  ScaleOut:
    Trigger: CPUUtilization > 70% for 2 minutes
    Action: Add 2 instances
  ScaleIn:
    Trigger: CPUUtilization < 30% for 10 minutes
    Action: Remove 1 instance

Auto-scaling pitfalls:

  1. Cold start time: If spinning up a new instance takes 3 minutes, it won't help with a traffic spike that peaks in 1 minute. Pre-warm with a higher minimum capacity.

  2. Scale-in aggressiveness: Removing servers too aggressively causes thrashing (scale up, scale down, scale up again). Add a cooldown period.

  3. Database doesn't scale automatically: Auto-scaling your app tier is useless if your database becomes the bottleneck. Ensure your DB can handle the connection surge from new instances.

  4. Stateful sessions: If you forgot the stateless requirement, auto-scaling will cause session loss when a server is removed.


CDN for Static Assets — The Easiest Win

Before spending time on application scaling, ask: how much of your traffic is serving static files (JS, CSS, images)?

A CDN serves these from edge locations close to users, eliminating the load from your application servers entirely.

Without CDN:
User (Tokyo) → [Internet] → App Server (US East) → serve image (300ms)

With CDN:
User (Tokyo) → CDN Edge (Tokyo) → serve cached image (8ms)

This also reduces bandwidth costs, since CDN egress is typically cheaper than cloud server egress.

What to cache on CDN:

  • All static assets with content-hash filenames (infinite TTL, cache-busted on deploy)
  • API responses that are public and change infrequently (product catalog, pricing)
  • Rendered HTML pages for anonymous users (massive scale lever for content sites)

Scaling Checklist — Before Adding Servers

Run through this before any infrastructure change:

  • Are queries using indexes? (EXPLAIN ANALYZE your slow queries)
  • Is there N+1 query behavior in the application?
  • Is connection pooling configured? (PgBouncer, HikariCP, SQLAlchemy pool)
  • Are static assets served via CDN?
  • Is read traffic separated to replicas?
  • Are expensive computations cached?
  • Are long-running operations async (queues) instead of blocking request threads?
  • Is the application stateless (sessions in Redis, files in S3)?

Tick all of these before scaling horizontally. You'll likely find the bottleneck isn't what you thought.


Key Takeaways

  • Scale vertically first — it's simpler and often enough
  • Stateless design is the prerequisite for horizontal scaling — move all state to shared storage
  • Load balancers distribute traffic; Layer 7 gives you routing flexibility
  • Read replicas are the first database scaling move — they solve most read-heavy bottlenecks
  • Connection pooling (PgBouncer) often eliminates "database can't scale" problems cheaply
  • Sharding is a last resort — the complexity cost is real
  • CDN and query optimization have better ROI than new servers in most systems
  • Profile first. Most slow systems are code problems, not infrastructure problems.

What bottleneck surprised you most when your system first started struggling under load — was it what you expected?


Next in the series → Post 07: You Can't Manage What You Can't See — The Three Pillars of Observability

You've built and scaled your system. Now: how do you know it's working?

Backend Engineering Fundamentals

Part 2 of 6

Backend systems don't fail because of bad code alone — they fail because of bad decisions. This series breaks down the foundational concepts every developer, architect, and engineer needs to build systems that scale, stay secure, and survive production: APIs, caching, security, databases, message queues, scalability, and observability. No fluff, no vendor pitches — just the tradeoffs that actually matter.

Up next

When to Stop Calling APIs and Start Publishing Events

Series: Backend Engineering Fundamentals · Post 05 of 07 Level: Advanced · Read time: ~10 min Picture a simple checkout flow: user places an order → charge the card → update inventory → send a confi