The Three Pillars of Observability Every Engineer Must Know

Series: Backend Engineering Fundamentals · Post 07 of 07 Level: Intermediate · Read time: ~9 min

It's 2am. An alert fires. Your service is down.

You open your dashboard: CPU is fine, memory is fine. You check the logs: thousands of lines of INFO messages and a few ERROR lines with stack traces — none of them obviously the root cause. You open your tracing tool and realize you only instrumented the main API service, not the three downstream services it calls.

Forty-five minutes later, you find it: a database connection pool exhausted in a service that has no alerting on it, caused by a slow query introduced in yesterday's deployment. You had no visibility into that service, no alert on the metric that would have told you, and no trace that showed where the latency was accumulating.

This is what poor observability looks like. You're debugging in the dark.

Observability isn't a feature you add later. It's the practice of making your system understandable from the outside — so that when something goes wrong, you can ask questions of your system and get useful answers.

The Three Pillars

Observability is commonly structured around three signal types:

Pillar	Answers	Examples
Logs	What happened?	Error traces, audit events, debug output
Metrics	How is it behaving over time?	Request rate, error %, CPU, latency histograms
Traces	Where did this request go and how long did each step take?	Distributed request spans across services

Each pillar answers different questions. All three are needed for a complete picture.

Pillar 1: Logs

Logs are the most familiar observability tool — and the most frequently done wrong.

Structured Logging

The difference between logs that help you and logs that don't is structure.

# ❌ Unstructured — fast to write, painful to query at scale
print(f"Processing order {order_id} for user {user_id} failed: {error}")
# Output: "Processing order 789 for user 123 failed: Connection timeout"

# ✅ Structured — queryable, filterable, alertable
import structlog

logger = structlog.get_logger()
logger.error(
    "order_processing_failed",
    order_id=order_id,
    user_id=user_id,
    error_type="ConnectionTimeout",
    service="payment-service",
    duration_ms=3240,
    retry_count=3
)
# Output: {"event": "order_processing_failed", "order_id": "789", "user_id": "123",
#          "error_type": "ConnectionTimeout", "duration_ms": 3240, ...}

With structured logs, you can query: show me all orders that failed with ConnectionTimeout in the last hour where retry_count > 2. With unstructured logs, you're writing regex.

Log Levels — Use Them Correctly

logger.debug(...)    # Detailed diagnostic, disabled in production
logger.info(...)     # Normal operation milestones (request received, order placed)
logger.warning(...)  # Unexpected but handled (retried 2x, using fallback)
logger.error(...)    # Failed operation, requires attention
logger.critical(...) # System cannot continue, immediate action required

Common mistake: Using INFO for everything. At scale, an INFO log for every request is millions of log entries per hour — expensive to store and slow to search. Log meaningful state changes and errors, not every heartbeat.

Correlation IDs

In a distributed system, a single user action triggers logs across multiple services. Without a correlation ID, you can't connect them.

import uuid
from contextvars import ContextVar

request_id: ContextVar[str] = ContextVar('request_id', default='')

# Middleware: Generate ID at the edge
@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
    req_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
    request_id.set(req_id)
    response = await call_next(request)
    response.headers["X-Request-ID"] = req_id
    return response

# Every log includes it automatically
logger.info("payment_initiated", request_id=request_id.get(), ...)

# Pass it downstream
httpx.post(payment_service_url, headers={"X-Request-ID": request_id.get()})

Now you can search your log aggregator for a single request_id and see every log line across every service for that user's request.

Pillar 2: Metrics

Metrics are aggregated numerical measurements over time. They answer: is the system healthy right now, and how does that compare to last week?

The Four Golden Signals

Google's SRE book identified four metrics that, together, give you a complete picture of service health:

1. Latency      — How long are requests taking?
                  (Distinguish: successful requests vs error requests)

2. Traffic      — How many requests per second?
                  (Understand normal baselines)

3. Errors       — What percentage of requests are failing?
                  (Both 5xx errors and application-level failures)

4. Saturation   — How "full" is the service?
                  (CPU, memory, queue depth, connection pool usage)

If you only instrument one thing, instrument these four for every service.

Prometheus — The De-Facto Standard

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Counter: always increasing (requests, errors)
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])

# Histogram: distribution of values (latency, request size)
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]  # Bucket boundaries in seconds
)

# Gauge: current value (active connections, queue depth)
ACTIVE_CONNECTIONS = Gauge('db_active_connections', 'Active database connections')

# Instrument your endpoints
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()
    
    REQUEST_LATENCY.labels(endpoint=request.url.path).observe(duration)
    
    return response

Percentiles Beat Averages

Average latency hides your users' actual experience. If 95% of requests are fast but 5% are extremely slow, the average looks fine while 5% of your users are having a terrible time.

Average latency: 120ms  ← Looks fine
P50 (median):   80ms    ← Most users are fine
P95:            450ms   ← 5% of users waiting 450ms
P99:            2,100ms ← 1% of users waiting 2+ seconds

Always alert on P95 and P99 latency, not averages.

Pillar 3: Distributed Tracing

Logs tell you what happened. Metrics tell you how things are trending. Traces tell you where time is actually going across your services for a specific request.

Trace for request_id: abc-123 (total: 1,240ms)

├── API Gateway                              [10ms]
│   └── OrderService.createOrder()          [1,210ms]
│       ├── validateUser() → UserService     [15ms]
│       ├── checkInventory() → InventoryService [45ms]
│       ├── processPayment() → PaymentService  [980ms]  ← HERE
│       │   ├── validateCard()               [12ms]
│       │   ├── chargeCard() → Stripe API    [952ms]  ← External call slow
│       │   └── recordTransaction() → DB     [16ms]
│       └── sendNotification() → EmailService [35ms]

Without tracing, you'd see "order creation is slow (1,240ms)" in your metrics. With tracing, you see "Stripe API is taking 952ms." Two very different problems to solve.

OpenTelemetry — The Standard

OpenTelemetry (OTel) is the vendor-neutral instrumentation standard for traces, metrics, and logs.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup (once at application start)
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Instrument a function
def process_payment(order_id: str, amount: float):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("payment.amount", amount)
        
        try:
            result = stripe.charge(amount)
            span.set_attribute("stripe.charge_id", result.id)
            return result
        except StripeError as e:
            span.record_exception(e)
            span.set_status(StatusCode.ERROR, str(e))
            raise

OTel data can be exported to Jaeger, Zipkin, Datadog, Honeycomb, Grafana Tempo — your choice of backend.

SLOs, SLAs, and Error Budgets

Metrics are more useful when tied to explicit reliability targets.

SLA (Service Level Agreement): A contract with users or customers. "We guarantee 99.9% uptime." Breaking this has business consequences.
SLO (Service Level Objective): An internal target for reliability. "We aim for P99 latency < 500ms, measured over 30 days."
Error Budget: The amount of unreliability you're allowed before you're breaking your SLO.

SLO: 99.9% availability over 30 days

Total minutes in 30 days: 43,200
Allowed downtime (0.1%): 43.2 minutes

Error budget: 43.2 minutes

If you've used 40 minutes this month:
→ Feature freezes, focus on reliability
→ Any risky deployments wait until next month's budget resets

If you've used 5 minutes:
→ You have headroom for risky changes, experiments

Error budgets create a shared language between engineering and product: reliability isn't free, it consumes budget, and you have to choose how to spend it.

Alerting — Noise Is the Enemy

A team that receives 50 alerts per day learns to ignore alerts. The goal is high signal, low noise.

# ❌ Alert on symptoms, not causes — creates noise
- alert: CpuHigh
  expr: cpu_usage > 80
  for: 5m
  # Root cause of 100 different things — what do you do with this?

# ✅ Alert on user-visible impact — actionable
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 2m
  annotations:
    summary: "Error rate above 5% for 2 minutes"
    runbook: "https://wiki/runbooks/high-error-rate"

- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
  for: 3m
  annotations:
    summary: "P99 latency above 2 seconds"

Runbooks matter: Every alert should link to a runbook — a documented set of steps for investigating and resolving that specific alert. Runbooks reduce MTTR (mean time to resolve) and mean the on-call engineer doesn't have to improvise at 2am.

Observability Stack — Common Combinations

Stack	Use Case
Prometheus + Grafana + Jaeger	Open source, self-hosted, full control
Datadog	Managed, all-in-one, expensive but powerful
Grafana Cloud (Loki + Tempo + Mimir)	Open source stack, managed hosting
AWS CloudWatch	Good enough for AWS-native teams, avoid vendor lock-in
Honeycomb	Best-in-class for traces and exploratory analysis
OpenTelemetry → Any backend	Instrument once, switch backends freely

💡 Start with OpenTelemetry instrumentation regardless of which backend you choose. OTel is vendor-neutral — instrument your code with OTel, export to wherever makes sense today, and migrate backends later without changing application code.

Key Takeaways

Logs, metrics, and traces each answer different questions — you need all three
Structured logging (JSON, key-value) makes logs queryable; unstructured logs don't scale
Correlation IDs connect a user action across every service in your system
The four golden signals (latency, traffic, errors, saturation) are the minimum metrics for every service
Alert on P95/P99 latency, not averages — averages hide the tail experience
Distributed tracing shows you where time actually goes across service boundaries
OpenTelemetry is the vendor-neutral standard — instrument with it, export anywhere
SLOs and error budgets give reliability a shared language across engineering and product
Alert on user-visible symptoms — too many alerts = all alerts get ignored

What's the metric or log line you wish you'd added before your first major incident? What would have cut your MTTR in half?

Wrapping Up the Series

This was Post 7 of 7 in Backend Engineering Fundamentals. Here's where we've been:

APIs — Choosing the right communication paradigm
Caching — What to cache, how to invalidate, what can go wrong
Security — Auth patterns and the vulnerabilities that actually cause breaches
Databases — Access patterns, CAP theorem, when to use what
Message Queues — Decoupling services with events
Scalability — Scaling strategies before and after you need them
Observability — Making your system understandable from the outside (you are here)

If this series was useful, share it with your team or anyone who'd benefit. And if there's a topic you'd like covered next — drop it in the comments.

You Can't Manage What You Can't See: The Three Pillars of Observability

The Three Pillars

Pillar 1: Logs

Structured Logging

Log Levels — Use Them Correctly

Correlation IDs

Pillar 2: Metrics

The Four Golden Signals

Prometheus — The De-Facto Standard

Percentiles Beat Averages

Pillar 3: Distributed Tracing

OpenTelemetry — The Standard

SLOs, SLAs, and Error Budgets

Alerting — Noise Is the Enemy

Observability Stack — Common Combinations

Key Takeaways

Wrapping Up the Series

Comments

Backend Engineering Fundamentals

Scaling: Before You Buy More Servers, Read This

More from this blog

The Vibe Coding Trap: Why AI-Driven Development Needs Strong Architecture

Scaling: Before You Buy More Servers, Read This

When to Stop Calling APIs and Start Publishing Events

SQL or NoSQL? Wrong Question. Here's the Right One.

Command Palette

The Three Pillars

Pillar 1: Logs

Structured Logging

Log Levels — Use Them Correctly

Correlation IDs

Pillar 2: Metrics

The Four Golden Signals

Prometheus — The De-Facto Standard

Percentiles Beat Averages

Pillar 3: Distributed Tracing

OpenTelemetry — The Standard

SLOs, SLAs, and Error Budgets

Alerting — Noise Is the Enemy

Observability Stack — Common Combinations

Key Takeaways

Wrapping Up the Series

Comments

Backend Engineering Fundamentals

Scaling: Before You Buy More Servers, Read This

More from this blog