Cache Invalidation: The Problem That Humbles Every Engineer

Series: Backend Engineering Fundamentals · Post 02 of 07 Level: Intermediate · Read time: ~9 min

Phil Karlton famously said there are only two hard problems in computer science: cache invalidation and naming things.

He was joking. But he wasn't wrong.

Caching seems simple. You store a result and serve the stored version next time. The hard part isn't storing data. It's knowing when the stored version is no longer valid, and handling that correctly at scale without bringing your database to its knees in the process.

This post covers the caching concepts that matter in production: where to cache, what to cache, how to invalidate it, and the failure modes that catch teams off guard.

Why Caching Matters (Beyond "It Makes Things Fast")

Before diving into mechanisms, let's be clear about what caching actually protects:

Database load — Every cache hit is a DB query that didn't happen
Latency — Memory reads are ~100x faster than a network round-trip to a DB
Cost — Fewer DB operations = smaller instance sizes = real money at scale
Resilience — A warm cache can serve traffic even when the DB is degraded

But caching introduces its own risks: stale data, cache stampedes, memory pressure, and invalidation bugs that surface as subtle data inconsistencies. Understanding these tradeoffs is what separates a senior engineer from someone who just adds Redis to every problem.

The Caching Layers

Modern systems have caching at multiple levels, and understanding each layer helps you place data in the right one.

Client Request
     ↓
[Browser Cache]        ← Layer 1: HTTP Cache-Control headers
     ↓
[CDN / Edge Cache]     ← Layer 2: Cloudflare, Fastly, CloudFront
     ↓
[API Gateway Cache]    ← Layer 3: Optional, for high-traffic APIs
     ↓
[Application Cache]    ← Layer 4: Redis, Memcached (your code controls this)
     ↓
[Database Buffer Pool] ← Layer 5: MySQL/Postgres keeps hot pages in memory
     ↓
[Disk]

Most teams operate actively at Layers 2 and 4. The decisions you make there have the biggest impact.

Redis vs Memcached — The Honest Comparison

Both are in-memory key-value stores. Most teams should just use Redis. Here's why:

Feature	Redis	Memcached
Data structures	Strings, hashes, lists, sets, sorted sets, streams	Strings only
Persistence	Optional (RDB snapshots, AOF logs)	None
Replication	Built-in primary/replica	None (third-party)
Clustering	Redis Cluster (built-in)	Client-side sharding
Pub/Sub	Yes	No
Lua scripting	Yes	No
Memory efficiency	Good	Slightly better for simple strings
Multithreading	Single-threaded (I/O event loop)	Multi-threaded

Use Memcached when: You have a very specific use case — pure string caching at enormous scale — and you've benchmarked that Memcached's multi-threaded architecture genuinely outperforms Redis for your workload. This is rare.

Use Redis for everything else. The richer data structures alone (sorted sets for leaderboards, streams for queues) make it the practical default.

Caching Strategies

Cache-Aside (Lazy Loading)

The most common pattern. Your application manages the cache explicitly.

def get_user(user_id: str) -> User:
    # 1. Check cache
    cached = redis.get(f"user:{user_id}")
    if cached:
        return User.from_json(cached)
    
    # 2. Cache miss — fetch from DB
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    
    # 3. Populate cache for next time
    redis.setex(f"user:{user_id}", 3600, user.to_json())  # TTL: 1 hour
    
    return user

Pros: Only caches data that's actually requested. Simple to reason about.
Cons: First request always hits the DB (cold cache). Race condition possible if multiple requests miss simultaneously.

Write-Through

Write to the cache and DB simultaneously on every write.

def update_user(user_id: str, data: dict) -> User:
    user = db.update("UPDATE users SET ... WHERE id = %s", user_id, data)
    redis.setex(f"user:{user_id}", 3600, user.to_json())  # Sync write to cache
    return user

Pros: Cache is always consistent with DB. No stale reads after writes.
Cons: Write latency increases. Cache fills with data that might never be read.

Write-Behind (Write-Back)

Write to cache immediately, write to DB asynchronously.

Pros: Extremely fast writes.
Cons: Risk of data loss if cache fails before async write completes. Complex error handling. Use only when you fully understand the durability tradeoff.

Read-Through

The cache layer itself fetches from DB on a miss — your application always talks to the cache.

# Cache library handles DB fallback automatically
user = cache.get(f"user:{user_id}", loader=lambda: db.find_user(user_id))

Pros: Application code stays clean. Cache and DB logic are centralized.
Cons: Requires a cache library or proxy that supports this pattern.

Cache Invalidation — The Hard Part

There are three approaches, each with different tradeoffs:

1. TTL (Time-To-Live) — Simplest

Set an expiry time. The data becomes stale after that window.

redis.setex("product:456:price", 300, "29.99")  # Expires in 5 minutes

Works well for: Data that can tolerate slight staleness — product listings, user profile data, search results.
Fails for: Anything that needs immediate consistency after a write — account balances, inventory levels, permissions.

2. Event-Driven Invalidation — Most Correct

When data changes, explicitly invalidate or update the cached version.

def update_product_price(product_id: str, new_price: float):
    db.update("UPDATE products SET price = %s WHERE id = %s", new_price, product_id)
    redis.delete(f"product:{product_id}:price")  # Explicit invalidation
    # Or: redis.set(...) to update immediately rather than wait for next read

Works well for: Data that must be fresh after writes.
Fails for: Systems with complex invalidation logic across many cache keys — one update triggers a cascade of invalidations that's hard to track.

3. Cache Tags / Dependency Tracking — Advanced

Group related cache entries under a tag. Invalidate the tag, and all entries under it expire.

# Pseudo-code — some Redis libraries support this natively
cache.set("user:123:orders", data, tags=["user:123", "orders"])
cache.invalidate_tag("user:123")  # Clears user:123:orders and all other tagged entries

Works well for: Complex, nested data that comes from a single entity.
Requires: A cache library or framework that supports this pattern (Symfony Cache, Django's cache framework, etc.)

The Cache Stampede Problem

Imagine 10,000 concurrent users hit your app. A popular cache key expires. All 10,000 requests miss the cache simultaneously and hammer your database at once.

This is a cache stampede (also called dogpiling). It can bring down a database that was otherwise healthy.

T=0: Cache key expires
T=0.001: 10,000 requests arrive, all miss cache
T=0.001: 10,000 DB queries fire simultaneously
T=0.5: Database CPU spikes to 100%
T=1.0: DB starts timing out requests
T=1.5: Your PagerDuty alert fires

Solutions:

Mutex / Locking — Only one request rebuilds the cache. Others wait.

def get_with_lock(key: str, loader_fn):
    value = redis.get(key)
    if value:
        return value
    
    lock_key = f"lock:{key}"
    if redis.set(lock_key, "1", nx=True, ex=10):  # Acquire lock
        try:
            value = loader_fn()
            redis.setex(key, 3600, value)
            return value
        finally:
            redis.delete(lock_key)
    else:
        time.sleep(0.1)  # Wait and retry
        return get_with_lock(key, loader_fn)

Probabilistic Early Expiration — Start refreshing the cache before it expires, with a small random probability as TTL approaches.

Stale-While-Revalidate — Serve the stale value immediately, refresh in the background. The user gets a fast (slightly stale) response while the next request will get fresh data.

CDN Caching — Don't Forget the Edge

For static assets, API responses, and server-rendered pages, CDN-level caching is often more impactful than application caching.

# Response headers that control CDN behavior
Cache-Control: public, max-age=3600, s-maxage=86400
# public = CDN can cache this
# max-age = browser TTL (1 hour)
# s-maxage = CDN TTL (1 day)

Cache-Control: private, no-store
# private = only the browser caches this, not CDNs
# no-store = don't cache anywhere (for sensitive data)

Surrogate-Key: product-456 category-shoes
# Fastly/Varnish: tag-based purging at the CDN edge

Cache-busting for static assets: Use content hashes in filenames so you can set long TTLs without worrying about stale JS/CSS.

# Build output
app.a3f9c2d1.js   ← Hash changes when content changes
app.css → app.b8e4d6a2.css

What NOT to Cache

Caching everything is an anti-pattern. Some things should never be cached:

User-specific sensitive data (auth tokens, payment info) — unless isolated per-user with short TTLs
Write-heavy data — cache churn (constant invalidations) adds overhead with no benefit
Uniqueness checks — "is this username taken?" must always hit the source of truth
Random or time-sensitive outputs — NOW(), UUID(), anything that must be unique per request

Quick Reference: Eviction Policies

When Redis runs out of memory, it evicts keys based on its configured policy:

Policy	Behavior	Use When
`noeviction`	Returns error on write when full	You need strict control
`allkeys-lru`	Evicts least recently used keys	General-purpose cache
`volatile-lru`	LRU eviction only for keys with TTL	You have a mix of TTL and permanent keys
`allkeys-lfu`	Evicts least frequently used (Redis 4+)	Access patterns are skewed
`volatile-ttl`	Evicts keys closest to expiry	You want to preserve recently-refreshed data

For a pure cache workload, allkeys-lru or allkeys-lfu are usually the right defaults.

Key Takeaways

Redis is the practical default — richer data structures, replication, and pub/sub make it worth the marginal overhead over Memcached
TTL-based expiration is simple and works well for data that tolerates some staleness
Event-driven invalidation is correct but requires discipline to maintain as systems evolve
Cache stampedes are real — use locks, early expiration, or stale-while-revalidate for high-traffic keys
CDN caching is often more impactful than application-level caching for read-heavy, public data
Don't cache everything — cache what's expensive to recompute and safe to serve slightly stale

Have you been bitten by a cache invalidation bug in production? What was the data inconsistency and how long did it take to find it?

Those are the stories the comments were made for.

Next in the series → Post 03: Auth Is Not Security — A Guide for Teams Who Ship Fast

You've cached your data efficiently. Now: who's allowed to see it?

Cache Invalidation: The Problem That Humbles Every Engineer

Why Caching Matters (Beyond "It Makes Things Fast")

The Caching Layers

Redis vs Memcached — The Honest Comparison

Caching Strategies

Cache-Aside (Lazy Loading)

Write-Through

Write-Behind (Write-Back)

Read-Through

Cache Invalidation — The Hard Part

1. TTL (Time-To-Live) — Simplest

2. Event-Driven Invalidation — Most Correct

3. Cache Tags / Dependency Tracking — Advanced

The Cache Stampede Problem

CDN Caching — Don't Forget the Edge

What NOT to Cache

Quick Reference: Eviction Policies

Key Takeaways

Comments

More from this blog

The Vibe Coding Trap: Why AI-Driven Development Needs Strong Architecture

You Can't Manage What You Can't See: The Three Pillars of Observability

Scaling: Before You Buy More Servers, Read This

When to Stop Calling APIs and Start Publishing Events

SQL or NoSQL? Wrong Question. Here's the Right One.

Command Palette

Why Caching Matters (Beyond "It Makes Things Fast")

The Caching Layers

Redis vs Memcached — The Honest Comparison

Caching Strategies

Cache-Aside (Lazy Loading)

Write-Through

Write-Behind (Write-Back)

Read-Through

Cache Invalidation — The Hard Part

1. TTL (Time-To-Live) — Simplest

2. Event-Driven Invalidation — Most Correct

3. Cache Tags / Dependency Tracking — Advanced

The Cache Stampede Problem

CDN Caching — Don't Forget the Edge

What NOT to Cache

Quick Reference: Eviction Policies

Key Takeaways

Comments

More from this blog