Cache Invalidation: The Problem That Humbles Every Engineer

Series: Backend Engineering Fundamentals · Post 02 of 07 Level: Intermediate · Read time: ~9 min
Phil Karlton famously said there are only two hard problems in computer science: cache invalidation and naming things.
He was joking. But he wasn't wrong.
Caching seems simple. You store a result and serve the stored version next time. The hard part isn't storing data. It's knowing when the stored version is no longer valid, and handling that correctly at scale without bringing your database to its knees in the process.
This post covers the caching concepts that matter in production: where to cache, what to cache, how to invalidate it, and the failure modes that catch teams off guard.
Why Caching Matters (Beyond "It Makes Things Fast")
Before diving into mechanisms, let's be clear about what caching actually protects:
- Database load — Every cache hit is a DB query that didn't happen
- Latency — Memory reads are ~100x faster than a network round-trip to a DB
- Cost — Fewer DB operations = smaller instance sizes = real money at scale
- Resilience — A warm cache can serve traffic even when the DB is degraded
But caching introduces its own risks: stale data, cache stampedes, memory pressure, and invalidation bugs that surface as subtle data inconsistencies. Understanding these tradeoffs is what separates a senior engineer from someone who just adds Redis to every problem.
The Caching Layers
Modern systems have caching at multiple levels, and understanding each layer helps you place data in the right one.
Client Request
↓
[Browser Cache] ← Layer 1: HTTP Cache-Control headers
↓
[CDN / Edge Cache] ← Layer 2: Cloudflare, Fastly, CloudFront
↓
[API Gateway Cache] ← Layer 3: Optional, for high-traffic APIs
↓
[Application Cache] ← Layer 4: Redis, Memcached (your code controls this)
↓
[Database Buffer Pool] ← Layer 5: MySQL/Postgres keeps hot pages in memory
↓
[Disk]
Most teams operate actively at Layers 2 and 4. The decisions you make there have the biggest impact.
Redis vs Memcached — The Honest Comparison
Both are in-memory key-value stores. Most teams should just use Redis. Here's why:
| Feature | Redis | Memcached |
|---|---|---|
| Data structures | Strings, hashes, lists, sets, sorted sets, streams | Strings only |
| Persistence | Optional (RDB snapshots, AOF logs) | None |
| Replication | Built-in primary/replica | None (third-party) |
| Clustering | Redis Cluster (built-in) | Client-side sharding |
| Pub/Sub | Yes | No |
| Lua scripting | Yes | No |
| Memory efficiency | Good | Slightly better for simple strings |
| Multithreading | Single-threaded (I/O event loop) | Multi-threaded |
Use Memcached when: You have a very specific use case — pure string caching at enormous scale — and you've benchmarked that Memcached's multi-threaded architecture genuinely outperforms Redis for your workload. This is rare.
Use Redis for everything else. The richer data structures alone (sorted sets for leaderboards, streams for queues) make it the practical default.
Caching Strategies
Cache-Aside (Lazy Loading)
The most common pattern. Your application manages the cache explicitly.
def get_user(user_id: str) -> User:
# 1. Check cache
cached = redis.get(f"user:{user_id}")
if cached:
return User.from_json(cached)
# 2. Cache miss — fetch from DB
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
# 3. Populate cache for next time
redis.setex(f"user:{user_id}", 3600, user.to_json()) # TTL: 1 hour
return user
Pros: Only caches data that's actually requested. Simple to reason about.
Cons: First request always hits the DB (cold cache). Race condition possible if multiple requests miss simultaneously.
Write-Through
Write to the cache and DB simultaneously on every write.
def update_user(user_id: str, data: dict) -> User:
user = db.update("UPDATE users SET ... WHERE id = %s", user_id, data)
redis.setex(f"user:{user_id}", 3600, user.to_json()) # Sync write to cache
return user
Pros: Cache is always consistent with DB. No stale reads after writes.
Cons: Write latency increases. Cache fills with data that might never be read.
Write-Behind (Write-Back)
Write to cache immediately, write to DB asynchronously.
Pros: Extremely fast writes.
Cons: Risk of data loss if cache fails before async write completes. Complex error handling. Use only when you fully understand the durability tradeoff.
Read-Through
The cache layer itself fetches from DB on a miss — your application always talks to the cache.
# Cache library handles DB fallback automatically
user = cache.get(f"user:{user_id}", loader=lambda: db.find_user(user_id))
Pros: Application code stays clean. Cache and DB logic are centralized.
Cons: Requires a cache library or proxy that supports this pattern.
Cache Invalidation — The Hard Part
There are three approaches, each with different tradeoffs:
1. TTL (Time-To-Live) — Simplest
Set an expiry time. The data becomes stale after that window.
redis.setex("product:456:price", 300, "29.99") # Expires in 5 minutes
Works well for: Data that can tolerate slight staleness — product listings, user profile data, search results.
Fails for: Anything that needs immediate consistency after a write — account balances, inventory levels, permissions.
2. Event-Driven Invalidation — Most Correct
When data changes, explicitly invalidate or update the cached version.
def update_product_price(product_id: str, new_price: float):
db.update("UPDATE products SET price = %s WHERE id = %s", new_price, product_id)
redis.delete(f"product:{product_id}:price") # Explicit invalidation
# Or: redis.set(...) to update immediately rather than wait for next read
Works well for: Data that must be fresh after writes.
Fails for: Systems with complex invalidation logic across many cache keys — one update triggers a cascade of invalidations that's hard to track.
3. Cache Tags / Dependency Tracking — Advanced
Group related cache entries under a tag. Invalidate the tag, and all entries under it expire.
# Pseudo-code — some Redis libraries support this natively
cache.set("user:123:orders", data, tags=["user:123", "orders"])
cache.invalidate_tag("user:123") # Clears user:123:orders and all other tagged entries
Works well for: Complex, nested data that comes from a single entity.
Requires: A cache library or framework that supports this pattern (Symfony Cache, Django's cache framework, etc.)
The Cache Stampede Problem
Imagine 10,000 concurrent users hit your app. A popular cache key expires. All 10,000 requests miss the cache simultaneously and hammer your database at once.
This is a cache stampede (also called dogpiling). It can bring down a database that was otherwise healthy.
T=0: Cache key expires
T=0.001: 10,000 requests arrive, all miss cache
T=0.001: 10,000 DB queries fire simultaneously
T=0.5: Database CPU spikes to 100%
T=1.0: DB starts timing out requests
T=1.5: Your PagerDuty alert fires
Solutions:
Mutex / Locking — Only one request rebuilds the cache. Others wait.
def get_with_lock(key: str, loader_fn):
value = redis.get(key)
if value:
return value
lock_key = f"lock:{key}"
if redis.set(lock_key, "1", nx=True, ex=10): # Acquire lock
try:
value = loader_fn()
redis.setex(key, 3600, value)
return value
finally:
redis.delete(lock_key)
else:
time.sleep(0.1) # Wait and retry
return get_with_lock(key, loader_fn)
Probabilistic Early Expiration — Start refreshing the cache before it expires, with a small random probability as TTL approaches.
Stale-While-Revalidate — Serve the stale value immediately, refresh in the background. The user gets a fast (slightly stale) response while the next request will get fresh data.
CDN Caching — Don't Forget the Edge
For static assets, API responses, and server-rendered pages, CDN-level caching is often more impactful than application caching.
# Response headers that control CDN behavior
Cache-Control: public, max-age=3600, s-maxage=86400
# public = CDN can cache this
# max-age = browser TTL (1 hour)
# s-maxage = CDN TTL (1 day)
Cache-Control: private, no-store
# private = only the browser caches this, not CDNs
# no-store = don't cache anywhere (for sensitive data)
Surrogate-Key: product-456 category-shoes
# Fastly/Varnish: tag-based purging at the CDN edge
Cache-busting for static assets: Use content hashes in filenames so you can set long TTLs without worrying about stale JS/CSS.
# Build output
app.a3f9c2d1.js ← Hash changes when content changes
app.css → app.b8e4d6a2.css
What NOT to Cache
Caching everything is an anti-pattern. Some things should never be cached:
- User-specific sensitive data (auth tokens, payment info) — unless isolated per-user with short TTLs
- Write-heavy data — cache churn (constant invalidations) adds overhead with no benefit
- Uniqueness checks — "is this username taken?" must always hit the source of truth
- Random or time-sensitive outputs —
NOW(),UUID(), anything that must be unique per request
Quick Reference: Eviction Policies
When Redis runs out of memory, it evicts keys based on its configured policy:
| Policy | Behavior | Use When |
|---|---|---|
noeviction |
Returns error on write when full | You need strict control |
allkeys-lru |
Evicts least recently used keys | General-purpose cache |
volatile-lru |
LRU eviction only for keys with TTL | You have a mix of TTL and permanent keys |
allkeys-lfu |
Evicts least frequently used (Redis 4+) | Access patterns are skewed |
volatile-ttl |
Evicts keys closest to expiry | You want to preserve recently-refreshed data |
For a pure cache workload, allkeys-lru or allkeys-lfu are usually the right defaults.
Key Takeaways
- Redis is the practical default — richer data structures, replication, and pub/sub make it worth the marginal overhead over Memcached
- TTL-based expiration is simple and works well for data that tolerates some staleness
- Event-driven invalidation is correct but requires discipline to maintain as systems evolve
- Cache stampedes are real — use locks, early expiration, or stale-while-revalidate for high-traffic keys
- CDN caching is often more impactful than application-level caching for read-heavy, public data
- Don't cache everything — cache what's expensive to recompute and safe to serve slightly stale
Have you been bitten by a cache invalidation bug in production? What was the data inconsistency and how long did it take to find it?
Those are the stories the comments were made for.
Next in the series → Post 03: Auth Is Not Security — A Guide for Teams Who Ship Fast
You've cached your data efficiently. Now: who's allowed to see it?




