Ajitabh Singh

You Can't Manage What You Can't See: The Three Pillars of Observability

Ajitabh Singh — Thu, 26 Mar 2026 17:30:00 GMT

Series: Backend Engineering Fundamentals · Post 07 of 07 Level: Intermediate · Read time: ~9 min

It's 2am. An alert fires. Your service is down.

You open your dashboard: CPU is fine, memory is fine. You check the logs: thousands of lines of INFO messages and a few ERROR lines with stack traces — none of them obviously the root cause. You open your tracing tool and realize you only instrumented the main API service, not the three downstream services it calls.

Forty-five minutes later, you find it: a database connection pool exhausted in a service that has no alerting on it, caused by a slow query introduced in yesterday's deployment. You had no visibility into that service, no alert on the metric that would have told you, and no trace that showed where the latency was accumulating.

This is what poor observability looks like. You're debugging in the dark.

Observability isn't a feature you add later. It's the practice of making your system understandable from the outside — so that when something goes wrong, you can ask questions of your system and get useful answers.

The Three Pillars

Observability is commonly structured around three signal types:

Pillar	Answers	Examples
Logs	What happened?	Error traces, audit events, debug output
Metrics	How is it behaving over time?	Request rate, error %, CPU, latency histograms
Traces	Where did this request go and how long did each step take?	Distributed request spans across services

Each pillar answers different questions. All three are needed for a complete picture.

Pillar 1: Logs

Logs are the most familiar observability tool — and the most frequently done wrong.

Structured Logging

The difference between logs that help you and logs that don't is structure.

# ❌ Unstructured — fast to write, painful to query at scale
print(f"Processing order {order_id} for user {user_id} failed: {error}")
# Output: "Processing order 789 for user 123 failed: Connection timeout"

# ✅ Structured — queryable, filterable, alertable
import structlog

logger = structlog.get_logger()
logger.error(
    "order_processing_failed",
    order_id=order_id,
    user_id=user_id,
    error_type="ConnectionTimeout",
    service="payment-service",
    duration_ms=3240,
    retry_count=3
)
# Output: {"event": "order_processing_failed", "order_id": "789", "user_id": "123",
#          "error_type": "ConnectionTimeout", "duration_ms": 3240, ...}

With structured logs, you can query: show me all orders that failed with ConnectionTimeout in the last hour where retry_count > 2. With unstructured logs, you're writing regex.

Log Levels — Use Them Correctly

logger.debug(...)    # Detailed diagnostic, disabled in production
logger.info(...)     # Normal operation milestones (request received, order placed)
logger.warning(...)  # Unexpected but handled (retried 2x, using fallback)
logger.error(...)    # Failed operation, requires attention
logger.critical(...) # System cannot continue, immediate action required

Common mistake: Using INFO for everything. At scale, an INFO log for every request is millions of log entries per hour — expensive to store and slow to search. Log meaningful state changes and errors, not every heartbeat.

Correlation IDs

In a distributed system, a single user action triggers logs across multiple services. Without a correlation ID, you can't connect them.

import uuid
from contextvars import ContextVar

request_id: ContextVar[str] = ContextVar('request_id', default='')

# Middleware: Generate ID at the edge
@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
    req_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
    request_id.set(req_id)
    response = await call_next(request)
    response.headers["X-Request-ID"] = req_id
    return response

# Every log includes it automatically
logger.info("payment_initiated", request_id=request_id.get(), ...)

# Pass it downstream
httpx.post(payment_service_url, headers={"X-Request-ID": request_id.get()})

Now you can search your log aggregator for a single request_id and see every log line across every service for that user's request.

Pillar 2: Metrics

Metrics are aggregated numerical measurements over time. They answer: is the system healthy right now, and how does that compare to last week?

The Four Golden Signals

Google's SRE book identified four metrics that, together, give you a complete picture of service health:

1. Latency      — How long are requests taking?
                  (Distinguish: successful requests vs error requests)

2. Traffic      — How many requests per second?
                  (Understand normal baselines)

3. Errors       — What percentage of requests are failing?
                  (Both 5xx errors and application-level failures)

4. Saturation   — How "full" is the service?
                  (CPU, memory, queue depth, connection pool usage)

If you only instrument one thing, instrument these four for every service.

Prometheus — The De-Facto Standard

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Counter: always increasing (requests, errors)
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])

# Histogram: distribution of values (latency, request size)
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]  # Bucket boundaries in seconds
)

# Gauge: current value (active connections, queue depth)
ACTIVE_CONNECTIONS = Gauge('db_active_connections', 'Active database connections')

# Instrument your endpoints
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()
    
    REQUEST_LATENCY.labels(endpoint=request.url.path).observe(duration)
    
    return response

Percentiles Beat Averages

Average latency hides your users' actual experience. If 95% of requests are fast but 5% are extremely slow, the average looks fine while 5% of your users are having a terrible time.

Average latency: 120ms  ← Looks fine
P50 (median):   80ms    ← Most users are fine
P95:            450ms   ← 5% of users waiting 450ms
P99:            2,100ms ← 1% of users waiting 2+ seconds

Always alert on P95 and P99 latency, not averages.

Pillar 3: Distributed Tracing

Logs tell you what happened. Metrics tell you how things are trending. Traces tell you where time is actually going across your services for a specific request.

Trace for request_id: abc-123 (total: 1,240ms)

├── API Gateway                              [10ms]
│   └── OrderService.createOrder()          [1,210ms]
│       ├── validateUser() → UserService     [15ms]
│       ├── checkInventory() → InventoryService [45ms]
│       ├── processPayment() → PaymentService  [980ms]  ← HERE
│       │   ├── validateCard()               [12ms]
│       │   ├── chargeCard() → Stripe API    [952ms]  ← External call slow
│       │   └── recordTransaction() → DB     [16ms]
│       └── sendNotification() → EmailService [35ms]

Without tracing, you'd see "order creation is slow (1,240ms)" in your metrics. With tracing, you see "Stripe API is taking 952ms." Two very different problems to solve.

OpenTelemetry — The Standard

OpenTelemetry (OTel) is the vendor-neutral instrumentation standard for traces, metrics, and logs.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup (once at application start)
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Instrument a function
def process_payment(order_id: str, amount: float):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("payment.amount", amount)
        
        try:
            result = stripe.charge(amount)
            span.set_attribute("stripe.charge_id", result.id)
            return result
        except StripeError as e:
            span.record_exception(e)
            span.set_status(StatusCode.ERROR, str(e))
            raise

OTel data can be exported to Jaeger, Zipkin, Datadog, Honeycomb, Grafana Tempo — your choice of backend.

SLOs, SLAs, and Error Budgets

Metrics are more useful when tied to explicit reliability targets.

SLA (Service Level Agreement): A contract with users or customers. "We guarantee 99.9% uptime." Breaking this has business consequences.
SLO (Service Level Objective): An internal target for reliability. "We aim for P99 latency < 500ms, measured over 30 days."
Error Budget: The amount of unreliability you're allowed before you're breaking your SLO.

SLO: 99.9% availability over 30 days

Total minutes in 30 days: 43,200
Allowed downtime (0.1%): 43.2 minutes

Error budget: 43.2 minutes

If you've used 40 minutes this month:
→ Feature freezes, focus on reliability
→ Any risky deployments wait until next month's budget resets

If you've used 5 minutes:
→ You have headroom for risky changes, experiments

Error budgets create a shared language between engineering and product: reliability isn't free, it consumes budget, and you have to choose how to spend it.

Alerting — Noise Is the Enemy

A team that receives 50 alerts per day learns to ignore alerts. The goal is high signal, low noise.

# ❌ Alert on symptoms, not causes — creates noise
- alert: CpuHigh
  expr: cpu_usage > 80
  for: 5m
  # Root cause of 100 different things — what do you do with this?

# ✅ Alert on user-visible impact — actionable
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 2m
  annotations:
    summary: "Error rate above 5% for 2 minutes"
    runbook: "https://wiki/runbooks/high-error-rate"

- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
  for: 3m
  annotations:
    summary: "P99 latency above 2 seconds"

Runbooks matter: Every alert should link to a runbook — a documented set of steps for investigating and resolving that specific alert. Runbooks reduce MTTR (mean time to resolve) and mean the on-call engineer doesn't have to improvise at 2am.

Observability Stack — Common Combinations

Stack	Use Case
Prometheus + Grafana + Jaeger	Open source, self-hosted, full control
Datadog	Managed, all-in-one, expensive but powerful
Grafana Cloud (Loki + Tempo + Mimir)	Open source stack, managed hosting
AWS CloudWatch	Good enough for AWS-native teams, avoid vendor lock-in
Honeycomb	Best-in-class for traces and exploratory analysis
OpenTelemetry → Any backend	Instrument once, switch backends freely

💡 Start with OpenTelemetry instrumentation regardless of which backend you choose. OTel is vendor-neutral — instrument your code with OTel, export to wherever makes sense today, and migrate backends later without changing application code.

Key Takeaways

Logs, metrics, and traces each answer different questions — you need all three
Structured logging (JSON, key-value) makes logs queryable; unstructured logs don't scale
Correlation IDs connect a user action across every service in your system
The four golden signals (latency, traffic, errors, saturation) are the minimum metrics for every service
Alert on P95/P99 latency, not averages — averages hide the tail experience
Distributed tracing shows you where time actually goes across service boundaries
OpenTelemetry is the vendor-neutral standard — instrument with it, export anywhere
SLOs and error budgets give reliability a shared language across engineering and product
Alert on user-visible symptoms — too many alerts = all alerts get ignored

What's the metric or log line you wish you'd added before your first major incident? What would have cut your MTTR in half?

Wrapping Up the Series

This was Post 7 of 7 in Backend Engineering Fundamentals. Here's where we've been:

APIs — Choosing the right communication paradigm
Caching — What to cache, how to invalidate, what can go wrong
Security — Auth patterns and the vulnerabilities that actually cause breaches
Databases — Access patterns, CAP theorem, when to use what
Message Queues — Decoupling services with events
Scalability — Scaling strategies before and after you need them
Observability — Making your system understandable from the outside (you are here)

If this series was useful, share it with your team or anyone who'd benefit. And if there's a topic you'd like covered next — drop it in the comments.

Scaling: Before You Buy More Servers, Read This

Ajitabh Singh — Thu, 26 Mar 2026 16:30:00 GMT

Series: Backend Engineering Fundamentals · Post 06 of 07 Level: Beginner-friendly · Read time: ~8 min

"We need to scale" is one of the most expensive sentences in engineering.

It triggers infrastructure discussions, migration projects, and architectural rewrites — often before anyone has looked at whether the current system is actually running at capacity.

Before scaling your infrastructure, understand what you're actually scaling for. Most systems that feel slow are bottlenecked by code problems (N+1 queries, missing indexes, synchronous calls that should be async) — not infrastructure capacity. Scaling a slow system gives you a more expensive slow system.

This post covers the actual mechanics of scaling, the tradeoffs between approaches, and how to think about it before opening a cloud console.

Vertical vs Horizontal Scaling

Vertical scaling (Scale Up): Add more resources to existing servers — bigger CPU, more RAM, faster disk.

Horizontal scaling (Scale Out): Add more servers and distribute the load across them.

Vertical Scaling                    Horizontal Scaling

[Server: 8 CPU, 32GB]      →       [Server: 4 CPU, 16GB] ×3
         ↓                                   ↓
[Server: 32 CPU, 128GB]             [Server: 4 CPU, 16GB] ×10
(one big machine)                   (many smaller machines)

	Vertical	Horizontal
Simplicity	Simple — no code changes	Complex — requires stateless design
Cost	Expensive at high end (premium hardware)	Cheaper per unit at scale
Failure impact	Single point of failure	Redundant — one server failure is minor
Ceiling	Hard limit on available hardware	Theoretically unlimited
Database	Works well (most DBs scale vertically first)	Sharding required for DBs

In practice: Start with vertical scaling. It's simpler, faster, and often sufficient. Switch to horizontal when you hit the vertical ceiling or need high availability.

The Stateless Requirement for Horizontal Scaling

Horizontal scaling only works if your application is stateless — each request can be handled by any server, with no local state that makes one server "special."

❌ Stateful — Can't Scale Horizontally

Server 1: User session in memory → [Request for user A] works
Server 2: No session for user A  → [Request for user A] fails

✅ Stateless — Scales Horizontally

Server 1: No local state → reads session from Redis
Server 2: No local state → reads session from Redis
Server 3: No local state → reads session from Redis

Any server can handle any request.
Load balancer distributes freely.

The rule: Move all state out of your application servers and into shared storage (Redis for sessions, S3 for files, your database for persistent data). Your servers should be interchangeable.

# ❌ Stateful — in-memory session
app.sessions[user_id] = {"cart": items}  # Lives on one server only

# ✅ Stateless — session in Redis
redis.setex(f"session:{session_id}", 3600, json.dumps({"cart": items}))

Load Balancers — The Front Door to Your Scaled System

A load balancer distributes incoming requests across your pool of servers.

Internet
   ↓
[Load Balancer]
   ├── Server 1
   ├── Server 2
   └── Server 3

Load balancing algorithms:

Algorithm	How it works	Use when
Round Robin	Requests distributed in sequence (1→2→3→1→2→3)	Servers have equal capacity and similar request costs
Least Connections	Routes to server with fewest active connections	Requests have variable processing time
IP Hash	Routes same client IP to same server	You need session stickiness and can't use a shared session store
Weighted	Servers get traffic proportional to weight	Servers have different capacities
Random	Random server selection	Surprisingly effective at scale; simple to implement

Layer 4 vs Layer 7:

L4 (TCP/UDP): Routes based on IP address and port. Extremely fast, no content inspection. AWS NLB, HAProxy in TCP mode.
L7 (HTTP): Routes based on HTTP content (URL, headers, cookies). More flexible — route /api to one pool, /static to another. AWS ALB, NGINX, Traefik.

# NGINX: Layer 7 load balancing with upstream pools
upstream api_servers {
    least_conn;  # Least connections algorithm
    server app1.internal:8080 weight=3;
    server app2.internal:8080 weight=3;
    server app3.internal:8080 weight=1;  # Lower weight = less traffic
    
    keepalive 32;  # Connection pool to upstream servers
}

upstream static_servers {
    server static1.internal:8080;
    server static2.internal:8080;
}

server {
    location /api/ {
        proxy_pass http://api_servers;
    }
    location /static/ {
        proxy_pass http://static_servers;
    }
}

Database Scaling — Where It Gets Hard

Application servers are stateless and easy to scale. Databases are stateful and hard.

Read Replicas — The First Move

Most applications are read-heavy. Add read replicas and route SELECT queries there.

Primary DB (writes)
    ↓ replication
Replica 1 (reads)
Replica 2 (reads)
Replica 3 (reads)

Application:
  - INSERT / UPDATE / DELETE → Primary
  - SELECT → Random replica

# Connection routing example
def get_db_connection(read_only: bool = False):
    if read_only:
        return random.choice(replica_connections)
    return primary_connection

Limitation: Replication lag. Replicas are slightly behind the primary (usually milliseconds, but can grow under load). Don't read from a replica immediately after a write if you need the result.

Connection Pooling — Before You Add Replicas

Before adding replicas, make sure you're not wasting connections. Databases have a hard limit on concurrent connections. Without pooling, a spike in traffic can exhaust connections instantly.

# SQLAlchemy connection pool
engine = create_engine(
    DATABASE_URL,
    pool_size=20,          # Normal pool size
    max_overflow=30,       # Extra connections under load
    pool_timeout=30,       # Wait up to 30s for a connection before error
    pool_recycle=3600      # Recycle connections after 1 hour
)

For PostgreSQL at scale, use PgBouncer — a lightweight connection pooler that sits between your app and the database, multiplexing thousands of application connections onto a smaller number of actual DB connections.

Sharding — The Last Resort

When a single primary + replicas isn't enough, you shard: split your data across multiple databases.

User IDs 1–1M   → Database Shard 1
User IDs 1M–2M  → Database Shard 2
User IDs 2M–3M  → Database Shard 3

The costs are real:

Cross-shard queries (JOINs across shards) become application logic
Transactions across shards require distributed transaction handling
Resharding (when a shard gets too large) is painful
Every query needs shard-routing logic

Sharding adds enormous operational complexity. Exhaust all other options first: indexing, query optimization, read replicas, caching, connection pooling, vertical scaling.

Auto-Scaling — Elasticity, Not Magic

Auto-scaling adds or removes servers based on load. This is valuable for variable traffic patterns (traffic spikes on product launches, Black Friday, etc.).

# AWS Auto Scaling Group (simplified)
AutoScalingGroup:
  MinSize: 2          # Always at least 2 servers
  MaxSize: 20         # Never exceed 20 servers
  DesiredCapacity: 4  # Start with 4

ScalingPolicy:
  ScaleOut:
    Trigger: CPUUtilization > 70% for 2 minutes
    Action: Add 2 instances
  ScaleIn:
    Trigger: CPUUtilization < 30% for 10 minutes
    Action: Remove 1 instance

Auto-scaling pitfalls:

Cold start time: If spinning up a new instance takes 3 minutes, it won't help with a traffic spike that peaks in 1 minute. Pre-warm with a higher minimum capacity.
Scale-in aggressiveness: Removing servers too aggressively causes thrashing (scale up, scale down, scale up again). Add a cooldown period.
Database doesn't scale automatically: Auto-scaling your app tier is useless if your database becomes the bottleneck. Ensure your DB can handle the connection surge from new instances.
Stateful sessions: If you forgot the stateless requirement, auto-scaling will cause session loss when a server is removed.

CDN for Static Assets — The Easiest Win

Before spending time on application scaling, ask: how much of your traffic is serving static files (JS, CSS, images)?

A CDN serves these from edge locations close to users, eliminating the load from your application servers entirely.

Without CDN:
User (Tokyo) → [Internet] → App Server (US East) → serve image (300ms)

With CDN:
User (Tokyo) → CDN Edge (Tokyo) → serve cached image (8ms)

This also reduces bandwidth costs, since CDN egress is typically cheaper than cloud server egress.

What to cache on CDN:

All static assets with content-hash filenames (infinite TTL, cache-busted on deploy)
API responses that are public and change infrequently (product catalog, pricing)
Rendered HTML pages for anonymous users (massive scale lever for content sites)

Scaling Checklist — Before Adding Servers

Run through this before any infrastructure change:

Are queries using indexes? (EXPLAIN ANALYZE your slow queries)
Is there N+1 query behavior in the application?
Is connection pooling configured? (PgBouncer, HikariCP, SQLAlchemy pool)
Are static assets served via CDN?
Is read traffic separated to replicas?
Are expensive computations cached?
Are long-running operations async (queues) instead of blocking request threads?
Is the application stateless (sessions in Redis, files in S3)?

Tick all of these before scaling horizontally. You'll likely find the bottleneck isn't what you thought.

Key Takeaways

Scale vertically first — it's simpler and often enough
Stateless design is the prerequisite for horizontal scaling — move all state to shared storage
Load balancers distribute traffic; Layer 7 gives you routing flexibility
Read replicas are the first database scaling move — they solve most read-heavy bottlenecks
Connection pooling (PgBouncer) often eliminates "database can't scale" problems cheaply
Sharding is a last resort — the complexity cost is real
CDN and query optimization have better ROI than new servers in most systems
Profile first. Most slow systems are code problems, not infrastructure problems.

What bottleneck surprised you most when your system first started struggling under load — was it what you expected?

Next in the series → Post 07: You Can't Manage What You Can't See — The Three Pillars of Observability

You've built and scaled your system. Now: how do you know it's working?

When to Stop Calling APIs and Start Publishing Events

Ajitabh Singh — Thu, 26 Mar 2026 15:30:00 GMT

Series: Backend Engineering Fundamentals · Post 05 of 07 Level: Advanced · Read time: ~10 min

Picture a simple checkout flow: user places an order → charge the card → update inventory → send a confirmation email → notify the warehouse → update analytics.

In a synchronous world, your checkout endpoint calls each of those services in sequence. If the email service is slow, the checkout is slow. If the warehouse notification times out, do you roll back the charge? If analytics is down, does checkout fail?

Synchronous chains are brittle. They couple your system's availability to the availability of every downstream service. At small scale, this is manageable. At scale, it becomes the source of cascading failures, long tail latencies, and 3am incidents.

Message queues and event streaming are how you break these chains.

The Core Idea: Decouple Producers from Consumers

Instead of Service A calling Service B directly, A publishes an event to a queue or topic. B (and C, and D) subscribe and process that event independently, at their own pace.

❌ Synchronous — Tightly Coupled

OrderService → [HTTP] → PaymentService → [HTTP] → EmailService → [HTTP] → WarehouseService
  (if any step fails, the whole chain fails)


✅ Event-Driven — Loosely Coupled

OrderService → [Publish: order.placed] → Message Broker
                                              ↓
                              ┌───────────────┼────────────────┐
                              ↓               ↓                ↓
                        PaymentService   EmailService   WarehouseService
                    (processes when     (processes      (processes when
                       ready)           independently)    ready)

This shift — from calling to publishing — fundamentally changes how your system scales and fails.

Message Queues vs Event Streaming

These are related but distinct concepts. Getting the distinction right matters for choosing the right tool.

	Message Queue	Event Stream
Model	Work distribution — each message consumed by one consumer	Log — multiple consumers read the full stream independently
After consumption	Message is deleted	Message is retained (configurable duration)
Replay	Not supported	Supported — reprocess from any point
Ordering	Per-queue FIFO	Ordered within a partition
Best for	Task distribution, job queues	Event sourcing, audit logs, real-time pipelines
Tools	RabbitMQ, Amazon SQS, ActiveMQ	Kafka, Amazon Kinesis, Pulsar

RabbitMQ — The Message Queue Standard

RabbitMQ is a mature, AMQP-based message broker. The mental model: producers send messages to exchanges, exchanges route them to queues, consumers read from queues.

import pika

# Producer: Publishing a task
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

channel.queue_declare(queue='email_notifications', durable=True)
# durable=True: queue survives broker restart

channel.basic_publish(
    exchange='',
    routing_key='email_notifications',
    body='{"type": "order_confirmation", "orderId": "789", "userId": "123"}',
    properties=pika.BasicProperties(delivery_mode=2)  # 2 = persistent message
)

# Consumer: Processing tasks
def process_email(ch, method, properties, body):
    data = json.loads(body)
    send_confirmation_email(data['userId'], data['orderId'])
    ch.basic_ack(delivery_tag=method.delivery_tag)  # Acknowledge success

channel.basic_qos(prefetch_count=1)  # Process one message at a time
channel.basic_consume(queue='email_notifications', on_message_callback=process_email)
channel.start_consuming()

Key RabbitMQ concepts:

Acknowledgments (ack/nack): Consumer explicitly confirms it processed the message. If it crashes before acking, the message is redelivered. If it nacks, it can be requeued or sent to a dead-letter exchange.
Dead Letter Exchange (DLX): Messages that fail processing (after retry limits) are routed here. Critical for debugging and not silently dropping failures.
Exchange types: Direct (exact routing key match), Topic (wildcard routing), Fanout (broadcast to all bound queues).

# Dead Letter Queue setup
channel.queue_declare(
    queue='email_notifications',
    durable=True,
    arguments={
        'x-dead-letter-exchange': 'dlx',
        'x-message-ttl': 60000,        # Messages expire after 60s if not consumed
        'x-max-retries': 3             # Custom header for retry counting
    }
)

Apache Kafka — Event Streaming at Scale

Kafka is fundamentally different from RabbitMQ. It's a distributed log: events are appended to topics (partitioned, replicated logs), and consumers read from those logs at their own offset.

Topic: order-events (3 partitions)

Partition 0: [order.placed, order.placed, order.cancelled]
Partition 1: [order.placed, order.shipped, order.delivered]
Partition 2: [order.placed, order.paid]

Consumer Group A (Order Fulfillment): reads all partitions, tracks offset
Consumer Group B (Analytics): reads all partitions, independent offset
Consumer Group C (Fraud Detection): reads all partitions, independent offset

Each group processes the FULL stream independently.
Adding a new consumer group doesn't affect existing ones.

from confluent_kafka import Producer, Consumer

# Producer
producer = Producer({'bootstrap.servers': 'kafka:9092'})

def publish_order_event(order_id: str, event_type: str, data: dict):
    producer.produce(
        topic='order-events',
        key=order_id,          # Same key → same partition → ordered for this order
        value=json.dumps({"type": event_type, "orderId": order_id, **data}),
        callback=delivery_report
    )
    producer.flush()

# Consumer
consumer = Consumer({
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'order-fulfillment-service',
    'auto.offset.reset': 'earliest'  # Start from beginning if no committed offset
})

consumer.subscribe(['order-events'])

while True:
    msg = consumer.poll(1.0)
    if msg and not msg.error():
        event = json.loads(msg.value())
        process_order_event(event)
        consumer.commit()  # Commit offset after successful processing

Kafka's superpower — replay: Because events are retained in the log, you can:

Replay events to rebuild a corrupted database
Add a new downstream service and backfill it from the beginning of time
Debug production issues by replaying the exact event sequence

Kafka vs RabbitMQ — Choosing the Right Tool

Scenario	Use
Background job processing (email, notifications, PDF generation)	RabbitMQ / SQS
Multiple services need to react to the same event independently	Kafka
You need to replay or audit events	Kafka
Simple task queue, low throughput	RabbitMQ / SQS
Real-time data pipelines, event sourcing	Kafka
You want managed, minimal ops overhead	Amazon SQS or Google Pub/Sub
Microservices with complex routing rules	RabbitMQ
>100k events/second	Kafka

💡 Amazon SQS is the "just works" option for AWS shops. No broker to manage, virtually unlimited scale, pay-per-use. For most task queue use cases, it's the practical default.

Delivery Guarantees — This Matters More Than Most Teams Realize

Not all message systems deliver the same guarantee:

Guarantee	Meaning	Risk
At-most-once	Message delivered 0 or 1 times	Messages can be lost
At-least-once	Message delivered 1 or more times	Duplicate processing possible
Exactly-once	Message delivered exactly once	Hard to guarantee end-to-end; Kafka transactions support this

Most systems use at-least-once delivery. This means your consumers must be idempotent — processing the same message twice must produce the same result as processing it once.

def process_payment(payment_id: str, amount: float):
    # ❌ NOT idempotent — charges twice if message is redelivered
    charge_card(payment_id, amount)
    
    # ✅ Idempotent — check if already processed
    if db.payment_exists(payment_id):
        return  # Already processed, safe to skip
    
    with db.transaction():
        charge_card(payment_id, amount)
        db.record_payment(payment_id, amount)

Common Patterns

Fan-Out

One event triggers multiple independent consumers:

order.placed
    ├── EmailService (send confirmation)
    ├── InventoryService (reserve stock)
    ├── AnalyticsService (track purchase)
    └── LoyaltyService (award points)

Saga Pattern — Distributed Transactions

When you need a transaction across multiple services without a distributed lock:

Choreography-based Saga:

1. OrderService publishes: order.created
2. PaymentService consumes, processes payment, publishes: payment.completed
3. InventoryService consumes, reserves stock, publishes: inventory.reserved
4. FulfillmentService consumes, ships order, publishes: order.fulfilled

On failure at step 3:
3b. InventoryService publishes: inventory.failed
4b. PaymentService consumes inventory.failed, issues refund, publishes: payment.refunded

Outbox Pattern — Reliable Event Publishing

The classic dual-write problem: how do you update the database AND publish an event atomically?

# ❌ WRONG — race condition
def place_order(order: Order):
    db.save(order)              # Succeeds
    kafka.publish(order_event)  # Fails → event never published, DB inconsistent

# ✅ CORRECT — Transactional Outbox Pattern
def place_order(order: Order):
    with db.transaction():
        db.save(order)
        db.outbox.insert({       # Write event to outbox table in same transaction
            "topic": "order-events",
            "payload": order_event_json,
            "published": False
        })
    # Separate process polls outbox and publishes to Kafka reliably

Key Takeaways

Message queues decouple services — a slow downstream service no longer blocks your upstream caller
RabbitMQ is the right choice for task distribution, complex routing, and lower-throughput workloads
Kafka is for high-throughput event streaming, replay, audit, and fan-out at scale
SQS/Pub Sub for managed simplicity with minimal operational overhead
Idempotency is mandatory with at-least-once delivery — design your consumers to handle duplicates safely
The Outbox Pattern solves reliable event publishing without distributed transactions
Don't go event-driven prematurely — if your system has 3 services, synchronous calls are probably fine

Have you dealt with a cascade failure in a synchronous service chain that made you switch to async? What was the tipping point?

Next in the series → Post 06: Scaling — Before You Buy More Servers, Read This

You've decoupled your services with events. Now: how do you scale the services themselves?

SQL or NoSQL? Wrong Question. Here's the Right One.

Ajitabh Singh — Thu, 26 Mar 2026 14:30:00 GMT

Series: Backend Engineering Fundamentals · Post 04 of 07 Level: Intermediate · Read time: ~9 min

Every few years the industry declares SQL dead, or NoSQL dead, or NewSQL the future. Meanwhile, production systems quietly keep running on PostgreSQL, with a Redis cache, a MongoDB collection for one specific use case, and an Elasticsearch index for search.

The SQL vs NoSQL debate is the wrong frame. The right question is: what are your data access patterns, consistency requirements, and team capabilities?

Answer those, and the database choice usually becomes obvious.

What SQL Actually Gives You (That's Often Taken for Granted)

Relational databases aren't just "tables with foreign keys." The guarantees they provide are hard to replicate:

ACID Transactions

BEGIN;
  UPDATE accounts SET balance = balance - 500 WHERE id = 'alice';
  UPDATE accounts SET balance = balance + 500 WHERE id = 'bob';
COMMIT;
-- Either both updates happen, or neither does. No partial state.

You don't appreciate ACID until you've debugged a distributed system where you transferred $500, debited Alice, and then the network failed before crediting Bob.

Joins — Relationship Integrity Without Application Logic

SELECT o.id, o.total, u.name, u.email
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status = 'pending'
  AND o.created_at > NOW() - INTERVAL '24 hours';

In a document database, this query becomes application code — multiple fetches, assembled in memory, with no guarantee of consistency.

Schema Enforcement The database rejects data that doesn't fit the schema. This feels restrictive early in development; it becomes invaluable when your system is running 24/7 and a bug tries to write malformed data.

The CAP Theorem — A Useful Mental Model

Distributed systems can guarantee at most two of three properties:

        Consistency
       (every read returns
        the latest write)
            /\
           /  \
          /    \
         /  CP  \
        /        \
       /----AP----|
      /            \
Availability    Partition
(every request   Tolerance
gets a response) (system works
                 despite network
                   failures)

CP systems (Consistency + Partition Tolerance): Choose correctness over availability. HBase, MongoDB (with certain write concerns), etcd.

AP systems (Availability + Partition Tolerance): Choose availability over strict consistency. Cassandra, CouchDB, DynamoDB (by default).

CA systems: Only possible without network partitions — i.e., single-node systems or systems within a trusted network. Most traditional relational databases in non-distributed setups.

⚠️ In practice, network partitions always can happen. The real choice is between consistency and availability when a partition occurs. Choose based on your domain: banking needs consistency; social media can tolerate eventual consistency.

NoSQL Data Models — Picking the Right Tool

"NoSQL" is not one thing. There are four fundamentally different data models:

1. Document Stores (MongoDB, Firestore, CouchDB)

Store data as JSON/BSON documents. Schema is flexible per document.

{
  "_id": "order_789",
  "userId": "user_123",
  "status": "shipped",
  "items": [
    {"productId": "prod_45", "name": "Keyboard", "qty": 1, "price": 79.99},
    {"productId": "prod_46", "name": "Mouse", "qty": 2, "price": 29.99}
  ],
  "shippingAddress": {
    "street": "123 Main St",
    "city": "New York"
  }
}

Use when: Your data naturally fits a hierarchical, self-contained document. The order example above is a perfect fit — you almost always want the full order with its items, not a joined result.

Avoid when: You need to query across relationships frequently, or your schema is highly relational.

2. Key-Value Stores (Redis, DynamoDB, Riak)

The simplest model: a key maps to a value. Lightning-fast lookups.

# Redis: O(1) lookup by key
redis.set("session:abc123", json.dumps({"userId": "123", "role": "admin"}), ex=3600)
session = redis.get("session:abc123")

# DynamoDB: partition key + optional sort key
table.get_item(Key={"userId": "123", "orderId": "order_789"})

Use when: You need ultra-fast single-key lookups, session storage, caching, or counters.

Avoid when: You need complex queries, filtering, or joins.

3. Column-Family Stores (Cassandra, HBase, ScyllaDB)

Data is stored in column families, optimized for time-series, write-heavy workloads.

-- Cassandra: Schema designed around query patterns, not data normalization
CREATE TABLE sensor_readings (
  device_id UUID,
  timestamp TIMESTAMP,
  temperature FLOAT,
  humidity FLOAT,
  PRIMARY KEY (device_id, timestamp)  -- Partition by device, sort by time
) WITH CLUSTERING ORDER BY (timestamp DESC);

-- This query is O(1) — it maps directly to the storage layout
SELECT * FROM sensor_readings WHERE device_id = ? LIMIT 100;

Use when: You have massive write volumes, time-series data, or IoT workloads. Cassandra can handle millions of writes per second.

Avoid when: You need complex queries that don't match your partition key, or ACID transactions.

4. Graph Databases (Neo4j, Amazon Neptune)

Data is modeled as nodes and edges. Relationships are first-class citizens.

-- Neo4j: Find all friends of Alice who also like "Distributed Systems"
MATCH (alice:User {name: "Alice"})-[:FRIENDS_WITH]->(friend:User)
WHERE (friend)-[:LIKES]->(:Topic {name: "Distributed Systems"})
RETURN friend.name

Use when: Your domain is fundamentally relational in a graph sense — social networks, recommendation engines, fraud detection, knowledge graphs.

Avoid when: Most other use cases. Graph databases are powerful but operationally complex.

PostgreSQL — Why It Often Wins Even Against NoSQL

PostgreSQL has quietly absorbed many NoSQL use cases:

-- JSONB column — document storage with SQL query capabilities
CREATE TABLE events (
  id UUID PRIMARY KEY,
  type VARCHAR(50),
  payload JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- GIN index on JSONB — fast document queries
CREATE INDEX idx_events_payload ON events USING GIN (payload);

-- Query inside JSON
SELECT * FROM events
WHERE payload->>'userId' = '123'
  AND type = 'purchase';

-- Full-text search (no Elasticsearch for basic cases)
CREATE INDEX idx_products_search ON products USING GIN (to_tsvector('english', name || ' ' || description));

SELECT * FROM products
WHERE to_tsvector('english', name || ' ' || description) @@ to_tsquery('mechanical & keyboard');

-- Time-series with partitioning (comparable to Cassandra for many workloads)
CREATE TABLE metrics (
  time TIMESTAMPTZ NOT NULL,
  device_id UUID NOT NULL,
  value FLOAT
) PARTITION BY RANGE (time);

Before adding a new database to your stack, check if PostgreSQL already handles it. Adding a database means another system to operate, monitor, backup, and train your team on.

Indexing — The Most Impactful Optimization Most Teams Underuse

A missing index is almost always the first cause of a slow query. An unnecessary index slows down writes.

-- EXPLAIN ANALYZE: your best friend for query performance
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = '123'
  AND status = 'pending'
ORDER BY created_at DESC;

-- If you see "Seq Scan" on a large table, you're missing an index
-- Seq Scan  (cost=0.00..45000.00 rows=5 width=200) -- ❌ scanning every row

-- Add a composite index matching your query
CREATE INDEX idx_orders_user_status_created
ON orders (user_id, status, created_at DESC);

-- Now: Index Scan — fast
-- Index Scan using idx_orders_user_status_created  (cost=0.42..8.50 rows=5) -- ✅

Composite index rule: Column order matters. Put equality conditions first (user_id, status), range/sort columns last (created_at).

The Decision Framework

Your primary need	Consider
ACID transactions, complex queries, relational data	PostgreSQL / MySQL
Document storage, flexible schema, hierarchical data	MongoDB (or PostgreSQL JSONB)
Ultra-fast key lookups, sessions, caching	Redis
Massive write throughput, time-series, IoT	Cassandra / ScyllaDB (or Timescale on PG)
Full-text search, faceted search	Elasticsearch / OpenSearch (or PG full-text for simpler cases)
Graph traversals, social networks	Neo4j / Neptune
Analytical queries over large datasets	BigQuery / Redshift / ClickHouse

Polyglot Persistence — When Multiple Databases Make Sense

Large systems often use multiple databases, each for a specific purpose:

User Service      → PostgreSQL (relational, ACID, user accounts/billing)
Product Catalog   → Elasticsearch (full-text search, faceted filtering)
Session Store     → Redis (fast key-value, TTL-based expiry)
Activity Feed     → Cassandra (high write throughput, time-ordered)
Recommendations   → Neo4j (graph traversals)
Analytics         → BigQuery (analytical, columnar, petabyte-scale)

The warning: Each database you add is a system you must operate. Start with the minimum. Introduce a new store only when you have a concrete, measurable pain point that your current database can't address.

Key Takeaways

ACID transactions are invaluable — don't give them up unless you have a compelling reason
CAP theorem is a useful frame: in a partition, choose consistency (banking) or availability (social feeds) based on your domain
NoSQL solves specific problems — document stores, column families, key-value, graphs are each optimized for different access patterns
PostgreSQL can handle more than you think — JSONB, full-text search, and partitioning cover many NoSQL use cases
Indexing is the highest-ROI database optimization — understand your query patterns before adding hardware
Polyglot persistence is real in large systems — but each database added is operational overhead

What's the most painful database migration you've been through — either choosing the wrong one initially, or scaling beyond what it could handle?

Next in the series → Post 05: When to Stop Calling APIs and Start Publishing Events

You've got your data store figured out. The next scaling inflection point is usually: synchronous calls don't compose well at scale.

Auth Is Not Security: What Engineers Get Wrong About Protecting APIs

Ajitabh Singh — Thu, 26 Mar 2026 13:00:00 GMT

Series: Backend Engineering Fundamentals · Post 03 of 07 Level: Advanced · Read time: ~10 min

Most API security bugs aren't cryptography failures. They're design failures.

The OWASP API Security Top 10 is the most authoritative list of real-world API vulnerabilities. It is dominated by problems like broken object-level authorization, excessive data exposure, and lack of rate limiting. Not broken TLS. Not weak encryption algorithms.

Engineers tend to conflate authentication ("who are you?") with security ("what can you actually do and what can go wrong?"). This post covers both. The auth patterns engineers deal with daily, and the security concerns that don't get enough attention until after the breach.

Authentication vs Authorization — Get This Right First

These terms are often used interchangeably. They shouldn't be.

Concept	Question	Example
Authentication (AuthN)	Who are you?	Verifying a JWT token is valid
Authorization (AuthZ)	What are you allowed to do?	Checking if user can access `/orders/456`
Accounting	What did you do?	Audit logs of actions taken

Most auth bugs are authorization bugs. The token is valid — the user is who they say they are — but they can see data they shouldn't.

The Three Main Auth Patterns

1. API Keys — Simple, Durable, Underrated

A random string issued to a client, sent with every request.

GET /api/v1/orders
Authorization: Bearer sk_live_a8f3j2k9...
# or
X-API-Key: sk_live_a8f3j2k9...

Best for: Server-to-server communication, developer-facing public APIs, internal service authentication.

Key implementation details:

Store only the hash of the key in your database, never the plaintext (same principle as passwords)
Use a prefix that identifies the key type: sk_live_, pk_test_, svc_ — makes secret scanning easier
Support key rotation without downtime: allow two active keys per client during a rotation window
Log key usage for anomaly detection

import hashlib, secrets

def create_api_key() -> tuple[str, str]:
    """Returns (plaintext_key_shown_once, hash_stored_in_db)"""
    key = f"sk_live_{secrets.token_urlsafe(32)}"
    key_hash = hashlib.sha256(key.encode()).hexdigest()
    return key, key_hash

def verify_api_key(provided_key: str, stored_hash: str) -> bool:
    provided_hash = hashlib.sha256(provided_key.encode()).hexdigest()
    return secrets.compare_digest(provided_hash, stored_hash)
    # Use compare_digest to prevent timing attacks

2. JWT (JSON Web Tokens) — Powerful but Frequently Misused

A JWT is a self-contained token with three parts: header, payload, signature.

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9   ← Header (alg + type)
.eyJ1c2VySWQiOiIxMjMiLCJyb2xlIjoiYWRtaW4iLCJleHAiOjE3MDAwMDAwMDB9   ← Payload
.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c   ← Signature

The server can verify the signature without a database lookup — this is why JWTs are popular in distributed systems and microservices.

Common JWT pitfalls:

# ❌ WRONG: Accepting the "none" algorithm
# An attacker can craft a token with alg: "none" and no signature
jwt.decode(token, options={"verify_signature": False})  # Never do this

# ❌ WRONG: Using the algorithm from the token header
# Attacker changes alg to "none" or "HS256" with your public key as secret
algorithm = jwt.get_unverified_header(token)['alg']  # Never trust this

# ✅ CORRECT: Always specify the expected algorithm explicitly
jwt.decode(token, secret_key, algorithms=["RS256"])  # Whitelist the algorithm

# ❌ WRONG: Storing sensitive data in the payload
# JWT payload is base64-encoded, not encrypted — anyone can read it
{"userId": "123", "creditCardNumber": "4111..."}  # Don't do this

# ✅ CORRECT: Store only what's needed for authorization
{"userId": "123", "role": "admin", "exp": 1700000000}

JWT vs Sessions trade-off:

	JWT (Stateless)	Sessions (Stateful)
Revocation	Hard — must wait for expiry or maintain a blocklist	Easy — delete from session store
Scalability	Any server can verify without coordination	Session store must be shared (Redis)
Token size	Larger (full claims in payload)	Smaller (just a session ID)
Suitable for	Microservices, mobile APIs	Traditional web apps

⚠️ The revocation problem is real. If you issue a JWT with a 24-hour expiry and a user changes their password or is suspended, that JWT is still valid until it expires. If revocation matters to you (it usually does), maintain a JWT blocklist in Redis or use short expiry times (5–15 minutes) with refresh tokens.

3. OAuth 2.0 — Delegated Authorization Done Right

OAuth 2.0 is not an authentication protocol (that's OpenID Connect on top of OAuth). It's a framework for delegated authorization — letting users grant third-party apps access to their data without sharing their password.

The four flows, matched to use cases:

Authorization Code Flow
├── With PKCE (for SPAs, mobile apps)
└── Without PKCE (server-side web apps only — never expose client_secret in browser)

Client Credentials Flow
└── Machine-to-machine (no user involved)

Device Authorization Flow
└── Smart TVs, CLIs, IoT devices

Implicit Flow
└── ⚠️ Deprecated — never use for new implementations

Most teams only need two:

User-facing apps → Authorization Code + PKCE
Service-to-service → Client Credentials

# Client Credentials — Service authenticating to another service
import httpx

def get_service_token(client_id: str, client_secret: str, token_url: str) -> str:
    response = httpx.post(token_url, data={
        "grant_type": "client_credentials",
        "client_id": client_id,
        "client_secret": client_secret,
        "scope": "orders:read inventory:write"
    })
    return response.json()["access_token"]

OWASP API Security Top 10 — What Actually Gets APIs Breached

Authentication is one piece. Here are a few vulnerabilities that show up in real incidents:

Broken Object-Level Authorization (BOLA) (Most Common)

A user can access objects (records) they shouldn't by manipulating IDs.

GET /api/orders/12345   ← User's own order
GET /api/orders/12346   ← Another user's order — does your API check ownership?

# ❌ WRONG — only checks authentication, not authorization
@app.get("/orders/{order_id}")
def get_order(order_id: str, current_user: User = Depends(get_current_user)):
    return db.get_order(order_id)  # Returns ANY order if user is authenticated

# ✅ CORRECT — checks that the order belongs to the requesting user
@app.get("/orders/{order_id}")
def get_order(order_id: str, current_user: User = Depends(get_current_user)):
    order = db.get_order(order_id)
    if order.user_id != current_user.id:
        raise HTTPException(status_code=403, detail="Forbidden")
    return order

Excessive Data Exposure

Returning more data than the client needs, relying on them to filter it.

# ❌ WRONG — serializes the full User model
return db.get_user(user_id)
# Includes: password_hash, internal_notes, admin_flags, ...

# ✅ CORRECT — explicit response schema
class UserPublicResponse(BaseModel):
    id: str
    name: str
    email: str
    # Nothing else

Lack of Rate Limiting

Without rate limiting, your API is vulnerable to brute-force, credential stuffing, and scraping.

# Using a token bucket approach with Redis
def check_rate_limit(client_id: str, limit: int = 100, window: int = 60) -> bool:
    key = f"rate_limit:{client_id}"
    pipe = redis.pipeline()
    pipe.incr(key)
    pipe.expire(key, window)
    count, _ = pipe.execute()
    return count <= limit

Or at the infrastructure level with NGINX:

# Limit to 10 requests/second per IP
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

server {
    location /api/ {
        limit_req zone=api burst=20 nodelay;
    }
}

HTTPS, HSTS, and Transport Security

HTTPS should be non-negotiable. But securing transport is not just about turning on TLS. HSTS tells the browser to always use HTTPS and never fall back to HTTP. A few headers help close common gaps:

# Force HTTPS for your domain + subdomains, 1 year, include in preload list
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload

# Prevent MIME type sniffing
X-Content-Type-Options: nosniff

# Control what info leaks in the Referer header
Referrer-Policy: strict-origin-when-cross-origin

# Disable browser features you don't need
Permissions-Policy: geolocation=(), camera=(), microphone=()

Secrets Management — Where Most Teams Cut Corners

Hardcoded secrets are the most preventable security vulnerability.

# ❌ Hardcoded in code — will end up in git history eventually
DATABASE_URL = "postgresql://admin:mypassword@prod-db:5432/app"

# ❌ In .env committed to repo
echo ".env" >> .gitignore   # This gets forgotten

# ✅ Fetched from a secrets manager at runtime
import boto3

def get_secret(secret_name: str) -> str:
    client = boto3.client("secretsmanager")
    return client.get_secret_value(SecretId=secret_name)["SecretString"]

DATABASE_URL = get_secret("prod/database/url")

Use AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, or Azure Key Vault. The investment is low; the blast radius of a leaked secret can be catastrophic.

Security Checklist for APIs

Before shipping an API endpoint, run through this:

Authentication required on all non-public routes
Object-level authorization: does the user own this resource?
Response schema is explicit — no extra fields leaking
Rate limiting on auth endpoints (login, token issuance)
Rate limiting on resource endpoints
Input validation on all parameters (types, lengths, allowed values)
No sensitive data in JWT payload
API keys hashed in storage, never in logs
HTTPS enforced with HSTS header
Secrets loaded from a secrets manager, not env vars in code

Key Takeaways

Authentication ≠ Authorization — most breaches happen when you verify identity but don't verify permission
API Keys are underrated for server-to-server auth — hash them, support rotation, prefix for scanning
JWT pitfalls (none algorithm, payload exposure, no revocation) are more common than you'd think
OAuth 2.0: Authorization Code + PKCE for users, Client Credentials for services — that's most of what you need
BOLA (broken object-level auth) is the #1 real-world API vulnerability — always check resource ownership
Rate limiting and secrets management are table stakes, not nice-to-haves

What's the most memorable security incident you've seen or heard about that started with an API design mistake — not a cryptography failure?

Next in the series → Post 04: SQL or NoSQL? Wrong Question. Here's the Right One.

You know who's talking to your API and what they're allowed to do. Now: where does the data actually live?

Cache Invalidation: The Problem That Humbles Every Engineer

Ajitabh Singh — Wed, 25 Mar 2026 12:31:40 GMT

Series: Backend Engineering Fundamentals · Post 02 of 07 Level: Intermediate · Read time: ~9 min

Phil Karlton famously said there are only two hard problems in computer science: cache invalidation and naming things.

He was joking. But he wasn't wrong.

Caching seems simple. You store a result and serve the stored version next time. The hard part isn't storing data. It's knowing when the stored version is no longer valid, and handling that correctly at scale without bringing your database to its knees in the process.

This post covers the caching concepts that matter in production: where to cache, what to cache, how to invalidate it, and the failure modes that catch teams off guard.

Why Caching Matters (Beyond "It Makes Things Fast")

Before diving into mechanisms, let's be clear about what caching actually protects:

Database load — Every cache hit is a DB query that didn't happen
Latency — Memory reads are ~100x faster than a network round-trip to a DB
Cost — Fewer DB operations = smaller instance sizes = real money at scale
Resilience — A warm cache can serve traffic even when the DB is degraded

But caching introduces its own risks: stale data, cache stampedes, memory pressure, and invalidation bugs that surface as subtle data inconsistencies. Understanding these tradeoffs is what separates a senior engineer from someone who just adds Redis to every problem.

The Caching Layers

Modern systems have caching at multiple levels, and understanding each layer helps you place data in the right one.

Client Request
     ↓
[Browser Cache]        ← Layer 1: HTTP Cache-Control headers
     ↓
[CDN / Edge Cache]     ← Layer 2: Cloudflare, Fastly, CloudFront
     ↓
[API Gateway Cache]    ← Layer 3: Optional, for high-traffic APIs
     ↓
[Application Cache]    ← Layer 4: Redis, Memcached (your code controls this)
     ↓
[Database Buffer Pool] ← Layer 5: MySQL/Postgres keeps hot pages in memory
     ↓
[Disk]

Most teams operate actively at Layers 2 and 4. The decisions you make there have the biggest impact.

Redis vs Memcached — The Honest Comparison

Both are in-memory key-value stores. Most teams should just use Redis. Here's why:

Feature	Redis	Memcached
Data structures	Strings, hashes, lists, sets, sorted sets, streams	Strings only
Persistence	Optional (RDB snapshots, AOF logs)	None
Replication	Built-in primary/replica	None (third-party)
Clustering	Redis Cluster (built-in)	Client-side sharding
Pub/Sub	Yes	No
Lua scripting	Yes	No
Memory efficiency	Good	Slightly better for simple strings
Multithreading	Single-threaded (I/O event loop)	Multi-threaded

Use Memcached when: You have a very specific use case — pure string caching at enormous scale — and you've benchmarked that Memcached's multi-threaded architecture genuinely outperforms Redis for your workload. This is rare.

Use Redis for everything else. The richer data structures alone (sorted sets for leaderboards, streams for queues) make it the practical default.

Caching Strategies

Cache-Aside (Lazy Loading)

The most common pattern. Your application manages the cache explicitly.

def get_user(user_id: str) -> User:
    # 1. Check cache
    cached = redis.get(f"user:{user_id}")
    if cached:
        return User.from_json(cached)
    
    # 2. Cache miss — fetch from DB
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    
    # 3. Populate cache for next time
    redis.setex(f"user:{user_id}", 3600, user.to_json())  # TTL: 1 hour
    
    return user

Pros: Only caches data that's actually requested. Simple to reason about.
Cons: First request always hits the DB (cold cache). Race condition possible if multiple requests miss simultaneously.

Write-Through

Write to the cache and DB simultaneously on every write.

def update_user(user_id: str, data: dict) -> User:
    user = db.update("UPDATE users SET ... WHERE id = %s", user_id, data)
    redis.setex(f"user:{user_id}", 3600, user.to_json())  # Sync write to cache
    return user

Pros: Cache is always consistent with DB. No stale reads after writes.
Cons: Write latency increases. Cache fills with data that might never be read.

Write-Behind (Write-Back)

Write to cache immediately, write to DB asynchronously.

Pros: Extremely fast writes.
Cons: Risk of data loss if cache fails before async write completes. Complex error handling. Use only when you fully understand the durability tradeoff.

Read-Through

The cache layer itself fetches from DB on a miss — your application always talks to the cache.

# Cache library handles DB fallback automatically
user = cache.get(f"user:{user_id}", loader=lambda: db.find_user(user_id))

Pros: Application code stays clean. Cache and DB logic are centralized.
Cons: Requires a cache library or proxy that supports this pattern.

Cache Invalidation — The Hard Part

There are three approaches, each with different tradeoffs:

1. TTL (Time-To-Live) — Simplest

Set an expiry time. The data becomes stale after that window.

redis.setex("product:456:price", 300, "29.99")  # Expires in 5 minutes

Works well for: Data that can tolerate slight staleness — product listings, user profile data, search results.
Fails for: Anything that needs immediate consistency after a write — account balances, inventory levels, permissions.

2. Event-Driven Invalidation — Most Correct

When data changes, explicitly invalidate or update the cached version.

def update_product_price(product_id: str, new_price: float):
    db.update("UPDATE products SET price = %s WHERE id = %s", new_price, product_id)
    redis.delete(f"product:{product_id}:price")  # Explicit invalidation
    # Or: redis.set(...) to update immediately rather than wait for next read

Works well for: Data that must be fresh after writes.
Fails for: Systems with complex invalidation logic across many cache keys — one update triggers a cascade of invalidations that's hard to track.

3. Cache Tags / Dependency Tracking — Advanced

Group related cache entries under a tag. Invalidate the tag, and all entries under it expire.

# Pseudo-code — some Redis libraries support this natively
cache.set("user:123:orders", data, tags=["user:123", "orders"])
cache.invalidate_tag("user:123")  # Clears user:123:orders and all other tagged entries

Works well for: Complex, nested data that comes from a single entity.
Requires: A cache library or framework that supports this pattern (Symfony Cache, Django's cache framework, etc.)

The Cache Stampede Problem

Imagine 10,000 concurrent users hit your app. A popular cache key expires. All 10,000 requests miss the cache simultaneously and hammer your database at once.

This is a cache stampede (also called dogpiling). It can bring down a database that was otherwise healthy.

T=0: Cache key expires
T=0.001: 10,000 requests arrive, all miss cache
T=0.001: 10,000 DB queries fire simultaneously
T=0.5: Database CPU spikes to 100%
T=1.0: DB starts timing out requests
T=1.5: Your PagerDuty alert fires

Solutions:

Mutex / Locking — Only one request rebuilds the cache. Others wait.

def get_with_lock(key: str, loader_fn):
    value = redis.get(key)
    if value:
        return value
    
    lock_key = f"lock:{key}"
    if redis.set(lock_key, "1", nx=True, ex=10):  # Acquire lock
        try:
            value = loader_fn()
            redis.setex(key, 3600, value)
            return value
        finally:
            redis.delete(lock_key)
    else:
        time.sleep(0.1)  # Wait and retry
        return get_with_lock(key, loader_fn)

Probabilistic Early Expiration — Start refreshing the cache before it expires, with a small random probability as TTL approaches.

Stale-While-Revalidate — Serve the stale value immediately, refresh in the background. The user gets a fast (slightly stale) response while the next request will get fresh data.

CDN Caching — Don't Forget the Edge

For static assets, API responses, and server-rendered pages, CDN-level caching is often more impactful than application caching.

# Response headers that control CDN behavior
Cache-Control: public, max-age=3600, s-maxage=86400
# public = CDN can cache this
# max-age = browser TTL (1 hour)
# s-maxage = CDN TTL (1 day)

Cache-Control: private, no-store
# private = only the browser caches this, not CDNs
# no-store = don't cache anywhere (for sensitive data)

Surrogate-Key: product-456 category-shoes
# Fastly/Varnish: tag-based purging at the CDN edge

Cache-busting for static assets: Use content hashes in filenames so you can set long TTLs without worrying about stale JS/CSS.

# Build output
app.a3f9c2d1.js   ← Hash changes when content changes
app.css → app.b8e4d6a2.css

What NOT to Cache

Caching everything is an anti-pattern. Some things should never be cached:

User-specific sensitive data (auth tokens, payment info) — unless isolated per-user with short TTLs
Write-heavy data — cache churn (constant invalidations) adds overhead with no benefit
Uniqueness checks — "is this username taken?" must always hit the source of truth
Random or time-sensitive outputs — NOW(), UUID(), anything that must be unique per request

Quick Reference: Eviction Policies

When Redis runs out of memory, it evicts keys based on its configured policy:

Policy	Behavior	Use When
`noeviction`	Returns error on write when full	You need strict control
`allkeys-lru`	Evicts least recently used keys	General-purpose cache
`volatile-lru`	LRU eviction only for keys with TTL	You have a mix of TTL and permanent keys
`allkeys-lfu`	Evicts least frequently used (Redis 4+)	Access patterns are skewed
`volatile-ttl`	Evicts keys closest to expiry	You want to preserve recently-refreshed data

For a pure cache workload, allkeys-lru or allkeys-lfu are usually the right defaults.

Key Takeaways

Redis is the practical default — richer data structures, replication, and pub/sub make it worth the marginal overhead over Memcached
TTL-based expiration is simple and works well for data that tolerates some staleness
Event-driven invalidation is correct but requires discipline to maintain as systems evolve
Cache stampedes are real — use locks, early expiration, or stale-while-revalidate for high-traffic keys
CDN caching is often more impactful than application-level caching for read-heavy, public data
Don't cache everything — cache what's expensive to recompute and safe to serve slightly stale

Have you been bitten by a cache invalidation bug in production? What was the data inconsistency and how long did it take to find it?

Those are the stories the comments were made for.

Next in the series → Post 03: Auth Is Not Security — A Guide for Teams Who Ship Fast

You've cached your data efficiently. Now: who's allowed to see it?

The API Decision That Haunts Your Architecture

Ajitabh Singh — Tue, 24 Mar 2026 13:00:00 GMT

Series: Backend Engineering Fundamentals · Post 01 of 07 Level: Intermediate · Read time: ~8 min

A team I know spent nine months migrating their mobile backend from REST to GraphQL. Two engineers dedicated full-time. At the end of it, their core performance problem of slow dashboard loads was unchanged. The culprit was N+1 queries in the database layer they had never touched.

API decisions feel reversible. They rarely are. By the time you have built client SDKs, versioning contracts, and downstream integrations, switching paradigms is a full rewrite. That is why your API style is an architectural decision and not an implementation detail.

Let us break down the four major API paradigms honestly so you can choose based on your actual constraints and not the current hype cycle.

The Four Paradigms at a Glance

Aspect	REST	SOAP	GraphQL	gRPC
Protocol	HTTP	HTTP, SMTP	HTTP	HTTP/2
Data Format	JSON, XML	XML (strict)	JSON	Protobuf (binary)
Flexibility	Medium	Low and rigid	High and client-driven	Medium with strong contract
Performance	Moderate	Heavy due to XML overhead	Good as it avoids over-fetching	Excellent with binary and streaming
Team Overhead	Low	High	Medium	Medium
Best For	Web and mobile APIs	Enterprise and legacy	Complex UI data needs	Internal microservices

REST — The Default, For Good Reason

REST (Representational State Transfer) is stateless, resource-based, and uses standard HTTP methods: GET, POST, PUT, DELETE. JSON is the lingua franca. Almost every developer knows it, every framework supports it, and tooling is mature.

The real strength: REST's simplicity is the feature itself. Low cognitive overhead means faster onboarding, easier debugging, and predictable behavior in production.

The honest weakness: REST can lead to over-fetching where the response contains more fields than the client needs, or under-fetching where the client needs multiple round trips to assemble a view. For most teams this is manageable. For teams with high-traffic and data-heavy mobile apps it becomes real latency.

GET /api/v1/users/123
GET /api/v1/users/123/orders
GET /api/v1/users/123/preferences
# Three requests to build one profile page

Use REST when:

You are building a public-facing API
Your team is mixed seniority or onboarding quickly
You need broad tooling, documentation, and ecosystem support
You do not yet know all the ways your data will be consumed

SOAP — Not Dead, Just Misunderstood

SOAP has a deserved reputation for verbosity, WSDLs are painful, and XML parsing is heavy. And yet it still runs banking systems, healthcare integrations, and government infrastructure worldwide.

Why? Because SOAP has built-in standards for things REST leaves entirely to you:

WS-Security for message-level encryption and signing
WS-AtomicTransaction for distributed transaction support
WSDL contracts that are machine-readable, strongly typed, and version-controlled

If you are integrating with a payment processor, a hospital EMR system, or any legacy enterprise platform built before 2010, you are using SOAP whether you planned to or not.

Use SOAP when:

You are in a compliance-heavy domain such as finance, healthcare, or government
You are integrating with enterprise systems that mandate it
You need formal, auditable contracts and built-in security standards

GraphQL — Powerful, But Only With Discipline

GraphQL flips the data fetching model. Instead of the server defining what data an endpoint returns, the client declares exactly what it needs. One request, precisely the data you asked for, nothing more.

query {
  user(id: "123") {
    name
    email
    orders(last: 5) {
      id
      total
      status
    }
  }
}

This is genuinely powerful for complex UIs such as dashboards, news feeds, and mobile apps where different views need different data shapes.

The honest tradeoffs:

N+1 query problem means naive GraphQL resolvers can fire a database query per item in a list. You need DataLoader or similar batching patterns to fix this.
Schema governance means as your graph grows, schema discipline becomes a team-wide practice and not a one-time setup.
Authorization complexity means REST's approach of protecting the endpoint is simpler than protecting each individual field on the graph.
Caching means HTTP-level caching is trivial with REST but with GraphQL's POST-based queries you need persisted queries or custom caching layers.

# Without DataLoader: N+1 queries for 100 orders
type Query {
  orders: [Order]  # 1 query
}
type Order {
  customer: User  # 100 queries, one per order
}

Use GraphQL when:

Your frontend teams are strong and own the client-side queries
You have multiple clients such as mobile, web, and partner apps needing different data shapes
Your backend team can invest in schema governance and query depth limiting
You are building a product and not a generic public API

gRPC — The Microservices Workhorse

gRPC uses Protocol Buffers (Protobuf), a binary serialization format, over HTTP/2. The result is significantly faster than JSON over HTTP/1.1, with native support for streaming in both directions.

// Define your contract in .proto
service OrderService {
  rpc GetOrder (OrderRequest) returns (OrderResponse);
  rpc StreamOrders (Empty) returns (stream Order);
}

message OrderRequest {
  string order_id = 1;
}

The contract is defined in .proto files that generate client and server code in multiple languages. This makes gRPC exceptional in polyglot environments where your Go service and your Python service speak the same typed contract.

The tradeoffs:

Browser support is limited and requires a gRPC-Web proxy for browser clients
Protobuf binary is not human-readable and is harder to debug without proper tooling
Managing .proto files adds workflow overhead for smaller teams

Use gRPC when:

You are building internal service-to-service communication
Performance and latency are critical
You have or expect a polyglot architecture
You need bidirectional streaming for real-time data or event feeds

How to Actually Choose

Stop asking which API is best. Start asking what your team needs to operate reliably at your current scale.

Your situation	Recommended
Public API, general purpose, mixed team	REST
Complex UI, multiple client types, strong frontend	GraphQL
Internal microservices, high throughput, polyglot	gRPC
Enterprise integration, compliance-driven, legacy	SOAP
Mobile app with internal services	REST externally, gRPC internally
Startup moving fast with a small team	REST until it hurts

Most mature systems use more than one paradigm. REST for the public API, gRPC for the service mesh, webhooks for async consumers. There is no rule against mixing paradigms, just the cost of maintaining each one.

Three Pitfalls to Avoid

1. Migrating for hype and not for pain

GraphQL and gRPC are excellent but adopting them before you have the problem they solve is expensive. Slow queries? Fix the database. Mobile over-fetching on five endpoints? REST versioning or sparse fieldsets might be cheaper than a full migration.

2. Treating API design as a junior task

API contracts outlive the engineers who write them. Resource naming, versioning strategy, and error envelope structure all become your team's debt or foundation depending on the care taken in the first sprint.

3. One style for everything

A GraphQL API that also powers your internal service mesh is the wrong tool in the wrong place. Match the paradigm to the use case, even if that means maintaining two styles.

Key Takeaways

REST is still the right default and you should not fix what is not broken
GraphQL pays off when UI needs are complex and your team has schema discipline
gRPC wins for internal microservice communication at scale
SOAP survives because enterprise compliance demands it and that reality deserves respect
Most production systems run multiple API styles so match the tool to the context
The best API is the one your team can build, operate, and debug at 2am

What is an API migration you have lived through? Did switching paradigms solve the original problem or did it just reveal a different one?

Drop your thoughts in the comments. The best real-world examples will make it into Part 2.

Next in the series: Post 02 — Cache Invalidation, The Problem That Humbles Everyone

After you have decided how clients talk to your system, the next question is what do you do when those requests are expensive?

RAG vs Vectorless RAG: How AI Systems Retrieve Knowledge

Ajitabh Singh — Mon, 23 Mar 2026 05:22:44 GMT

How AI finds answers — and why the next generation is rethinking the approach.

Introduction

LLMs are powerful, but they only know what they were trained on — once training ends, new documents, company updates, and recently uploaded PDFs are invisible to them. RAG solves this.

Retrieval-Augmented Generation (RAG) lets the AI search for relevant information before answering. It uses what it finds to write accurate, grounded responses — no retraining required.

💡 In short: RAG = Search first, then generate.

What is RAG?

RAG makes AI "up-to-date" without retraining it constantly. It works in five steps:

1. Indexing: Prepare Your Documents

All documents (PDFs, web pages, text files) are organised into a searchable index — like building a card catalogue in a library. It happens once upfront and updates whenever new documents arrive.

2. Chunking: Split Documents into Pieces

LLMs can only process a limited amount of text at once. So documents are split into chunks — paragraphs or sections — to fit the AI's context window.

Too small → loses surrounding context
Too large → wastes the AI's limited memory

3. Embeddings: Turn Text into Numbers

To find meaning, not just keywords, each chunk is converted into a vector — a list of numbers that represents its meaning. Similar concepts produce similar vectors, even when the words are completely different.

Example: "The cat sat on the mat" and "A feline rested on the rug" → nearly identical vectors.

4. Vector Database: Store the Meaning

Vectors are stored in a vector database (Pinecone, Weaviate, Qdrant, FAISS) that enables fast semantic search across thousands — or millions — of chunks.

5. Query Time: Answering Questions

User asks a question
Question is converted into a vector
Semantic search finds the top-k most similar chunks
Chunks + question are combined into a prompt
LLM generates a grounded answer

✅ Works well for many applications — but it has real limitations.

Problems with Traditional RAG

The entire RAG pipeline is only as good as its weakest link — retrieval. Here's where it breaks down:

Chunks lose context — fragments miss the surrounding meaning that gives them significance
Semantic search isn't perfect — embeddings can miss relevant sections, especially in specialised domains
Information spans multiple chunks — answers often need several sections combined, but RAG treats each chunk independently
Chunking is tricky — too big, too small, or overlapping chunks all introduce errors
Vector databases need maintenance — updating, deleting, and re-indexing adds operational complexity over time
Confident mistakes — AI writes fluent, authoritative answers even when the retrieved chunks are slightly off-topic

"The weakness of RAG is not the generation — it is the retrieval. If the right information was never found, the best AI in the world cannot save you."

Vectorless RAG: A Different Approach

Vectorless RAG skips vectors entirely. Instead of searching by similarity, it reasons through documents to find answers — like a detective working a case, not a search engine matching keywords.

💡 Core idea: Break the question into sub-questions, navigate to the exact document sections, read them in full, then combine everything into one complete answer.

How It Works

Think of how a doctor diagnoses a patient:

Fever → infection? → what type? → check bloodwork → treat accordingly

Each step guides the next. Vectorless RAG applies this same logic to documents:

Question: What is our employee leave policy?
├── Sick days?          → HR Manual, Section 3.2
├── Annual leave?       → HR Manual, Section 4.1
└── Approval process?   → Policy Doc, Approval Workflow

              ↓
   Read each section in full
              ↓
   Synthesise one complete, context-rich answer

After reading each section in full, the AI combines the answers into one complete, context-rich response — no guessing from fragments.

No embeddings. No vector database. The "index" is simply a clear, hierarchical map of your documents — easy to read, easy to update.

✅ Accurate, context-rich, low maintenance ⚠️ Works best with well-structured documents

RAG vs Vectorless RAG: At a Glance

Factor	Traditional RAG	Vectorless RAG
Speed	✅ Fast (1–3 sec)	⚠️ Slower (10–30 sec)
Accuracy	⚠️ Moderate	✅ High
Infrastructure	❌ Complex (vector DB)	✅ Simple
Context quality	❌ Fragmented chunks	✅ Full sections
Document types	✅ Any format	⚠️ Structured docs work best
Multi-step reasoning	❌ Not supported	✅ Built-in

Rule of Thumb

Fast & large-scale? → Traditional RAG
Accurate & structured? → Vectorless RAG

A slightly slower, accurate answer beats a fast, wrong one — especially in legal, medical, compliance, or technical domains.

Conclusion

RAG opened the door for AI to answer questions about new information. But chunking and vector search create real challenges that limit accuracy in high-stakes situations.

Vectorless RAG bets on reasoning over retrieval — and for structured documents, that bet pays off. It delivers full-context answers with simpler infrastructure and less ongoing maintenance.

The future of AI retrieval may not be in bigger vector databases — it may be in smarter navigation and reasoning.

Found this helpful? Share it with someone building AI systems. Questions or thoughts? Drop a comment below.