Skip to main content

Command Palette

Search for a command to run...

You Can't Manage What You Can't See: The Three Pillars of Observability

Published
10 min read
You Can't Manage What You Can't See: The Three Pillars of Observability
A
Real-world engineering insights from 20+ years building scalable systems. Focused on AI, RAG architectures, and production-ready system design.

Series: Backend Engineering Fundamentals · Post 07 of 07 Level: Intermediate · Read time: ~9 min


It's 2am. An alert fires. Your service is down.

You open your dashboard: CPU is fine, memory is fine. You check the logs: thousands of lines of INFO messages and a few ERROR lines with stack traces — none of them obviously the root cause. You open your tracing tool and realize you only instrumented the main API service, not the three downstream services it calls.

Forty-five minutes later, you find it: a database connection pool exhausted in a service that has no alerting on it, caused by a slow query introduced in yesterday's deployment. You had no visibility into that service, no alert on the metric that would have told you, and no trace that showed where the latency was accumulating.

This is what poor observability looks like. You're debugging in the dark.

Observability isn't a feature you add later. It's the practice of making your system understandable from the outside — so that when something goes wrong, you can ask questions of your system and get useful answers.


The Three Pillars

Observability is commonly structured around three signal types:

Pillar Answers Examples
Logs What happened? Error traces, audit events, debug output
Metrics How is it behaving over time? Request rate, error %, CPU, latency histograms
Traces Where did this request go and how long did each step take? Distributed request spans across services

Each pillar answers different questions. All three are needed for a complete picture.


Pillar 1: Logs

Logs are the most familiar observability tool — and the most frequently done wrong.

Structured Logging

The difference between logs that help you and logs that don't is structure.

# ❌ Unstructured — fast to write, painful to query at scale
print(f"Processing order {order_id} for user {user_id} failed: {error}")
# Output: "Processing order 789 for user 123 failed: Connection timeout"

# ✅ Structured — queryable, filterable, alertable
import structlog

logger = structlog.get_logger()
logger.error(
    "order_processing_failed",
    order_id=order_id,
    user_id=user_id,
    error_type="ConnectionTimeout",
    service="payment-service",
    duration_ms=3240,
    retry_count=3
)
# Output: {"event": "order_processing_failed", "order_id": "789", "user_id": "123",
#          "error_type": "ConnectionTimeout", "duration_ms": 3240, ...}

With structured logs, you can query: show me all orders that failed with ConnectionTimeout in the last hour where retry_count > 2. With unstructured logs, you're writing regex.

Log Levels — Use Them Correctly

logger.debug(...)    # Detailed diagnostic, disabled in production
logger.info(...)     # Normal operation milestones (request received, order placed)
logger.warning(...)  # Unexpected but handled (retried 2x, using fallback)
logger.error(...)    # Failed operation, requires attention
logger.critical(...) # System cannot continue, immediate action required

Common mistake: Using INFO for everything. At scale, an INFO log for every request is millions of log entries per hour — expensive to store and slow to search. Log meaningful state changes and errors, not every heartbeat.

Correlation IDs

In a distributed system, a single user action triggers logs across multiple services. Without a correlation ID, you can't connect them.

import uuid
from contextvars import ContextVar

request_id: ContextVar[str] = ContextVar('request_id', default='')

# Middleware: Generate ID at the edge
@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
    req_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
    request_id.set(req_id)
    response = await call_next(request)
    response.headers["X-Request-ID"] = req_id
    return response

# Every log includes it automatically
logger.info("payment_initiated", request_id=request_id.get(), ...)

# Pass it downstream
httpx.post(payment_service_url, headers={"X-Request-ID": request_id.get()})

Now you can search your log aggregator for a single request_id and see every log line across every service for that user's request.


Pillar 2: Metrics

Metrics are aggregated numerical measurements over time. They answer: is the system healthy right now, and how does that compare to last week?

The Four Golden Signals

Google's SRE book identified four metrics that, together, give you a complete picture of service health:

1. Latency      — How long are requests taking?
                  (Distinguish: successful requests vs error requests)

2. Traffic      — How many requests per second?
                  (Understand normal baselines)

3. Errors       — What percentage of requests are failing?
                  (Both 5xx errors and application-level failures)

4. Saturation   — How "full" is the service?
                  (CPU, memory, queue depth, connection pool usage)

If you only instrument one thing, instrument these four for every service.

Prometheus — The De-Facto Standard

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Counter: always increasing (requests, errors)
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])

# Histogram: distribution of values (latency, request size)
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]  # Bucket boundaries in seconds
)

# Gauge: current value (active connections, queue depth)
ACTIVE_CONNECTIONS = Gauge('db_active_connections', 'Active database connections')

# Instrument your endpoints
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()
    
    REQUEST_LATENCY.labels(endpoint=request.url.path).observe(duration)
    
    return response

Percentiles Beat Averages

Average latency hides your users' actual experience. If 95% of requests are fast but 5% are extremely slow, the average looks fine while 5% of your users are having a terrible time.

Average latency: 120ms  ← Looks fine
P50 (median):   80ms    ← Most users are fine
P95:            450ms   ← 5% of users waiting 450ms
P99:            2,100ms ← 1% of users waiting 2+ seconds

Always alert on P95 and P99 latency, not averages.


Pillar 3: Distributed Tracing

Logs tell you what happened. Metrics tell you how things are trending. Traces tell you where time is actually going across your services for a specific request.

Trace for request_id: abc-123 (total: 1,240ms)

├── API Gateway                              [10ms]
│   └── OrderService.createOrder()          [1,210ms]
│       ├── validateUser() → UserService     [15ms]
│       ├── checkInventory() → InventoryService [45ms]
│       ├── processPayment() → PaymentService  [980ms]  ← HERE
│       │   ├── validateCard()               [12ms]
│       │   ├── chargeCard() → Stripe API    [952ms]  ← External call slow
│       │   └── recordTransaction() → DB     [16ms]
│       └── sendNotification() → EmailService [35ms]

Without tracing, you'd see "order creation is slow (1,240ms)" in your metrics. With tracing, you see "Stripe API is taking 952ms." Two very different problems to solve.

OpenTelemetry — The Standard

OpenTelemetry (OTel) is the vendor-neutral instrumentation standard for traces, metrics, and logs.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup (once at application start)
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Instrument a function
def process_payment(order_id: str, amount: float):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("payment.amount", amount)
        
        try:
            result = stripe.charge(amount)
            span.set_attribute("stripe.charge_id", result.id)
            return result
        except StripeError as e:
            span.record_exception(e)
            span.set_status(StatusCode.ERROR, str(e))
            raise

OTel data can be exported to Jaeger, Zipkin, Datadog, Honeycomb, Grafana Tempo — your choice of backend.


SLOs, SLAs, and Error Budgets

Metrics are more useful when tied to explicit reliability targets.

  • SLA (Service Level Agreement): A contract with users or customers. "We guarantee 99.9% uptime." Breaking this has business consequences.
  • SLO (Service Level Objective): An internal target for reliability. "We aim for P99 latency < 500ms, measured over 30 days."
  • Error Budget: The amount of unreliability you're allowed before you're breaking your SLO.
SLO: 99.9% availability over 30 days

Total minutes in 30 days: 43,200
Allowed downtime (0.1%): 43.2 minutes

Error budget: 43.2 minutes

If you've used 40 minutes this month:
→ Feature freezes, focus on reliability
→ Any risky deployments wait until next month's budget resets

If you've used 5 minutes:
→ You have headroom for risky changes, experiments

Error budgets create a shared language between engineering and product: reliability isn't free, it consumes budget, and you have to choose how to spend it.


Alerting — Noise Is the Enemy

A team that receives 50 alerts per day learns to ignore alerts. The goal is high signal, low noise.

# ❌ Alert on symptoms, not causes — creates noise
- alert: CpuHigh
  expr: cpu_usage > 80
  for: 5m
  # Root cause of 100 different things — what do you do with this?

# ✅ Alert on user-visible impact — actionable
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 2m
  annotations:
    summary: "Error rate above 5% for 2 minutes"
    runbook: "https://wiki/runbooks/high-error-rate"

- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
  for: 3m
  annotations:
    summary: "P99 latency above 2 seconds"

Runbooks matter: Every alert should link to a runbook — a documented set of steps for investigating and resolving that specific alert. Runbooks reduce MTTR (mean time to resolve) and mean the on-call engineer doesn't have to improvise at 2am.


Observability Stack — Common Combinations

Stack Use Case
Prometheus + Grafana + Jaeger Open source, self-hosted, full control
Datadog Managed, all-in-one, expensive but powerful
Grafana Cloud (Loki + Tempo + Mimir) Open source stack, managed hosting
AWS CloudWatch Good enough for AWS-native teams, avoid vendor lock-in
Honeycomb Best-in-class for traces and exploratory analysis
OpenTelemetry → Any backend Instrument once, switch backends freely

💡 Start with OpenTelemetry instrumentation regardless of which backend you choose. OTel is vendor-neutral — instrument your code with OTel, export to wherever makes sense today, and migrate backends later without changing application code.


Key Takeaways

  • Logs, metrics, and traces each answer different questions — you need all three
  • Structured logging (JSON, key-value) makes logs queryable; unstructured logs don't scale
  • Correlation IDs connect a user action across every service in your system
  • The four golden signals (latency, traffic, errors, saturation) are the minimum metrics for every service
  • Alert on P95/P99 latency, not averages — averages hide the tail experience
  • Distributed tracing shows you where time actually goes across service boundaries
  • OpenTelemetry is the vendor-neutral standard — instrument with it, export anywhere
  • SLOs and error budgets give reliability a shared language across engineering and product
  • Alert on user-visible symptoms — too many alerts = all alerts get ignored

What's the metric or log line you wish you'd added before your first major incident? What would have cut your MTTR in half?


Wrapping Up the Series

This was Post 7 of 7 in Backend Engineering Fundamentals. Here's where we've been:

  1. APIs — Choosing the right communication paradigm
  2. Caching — What to cache, how to invalidate, what can go wrong
  3. Security — Auth patterns and the vulnerabilities that actually cause breaches
  4. Databases — Access patterns, CAP theorem, when to use what
  5. Message Queues — Decoupling services with events
  6. Scalability — Scaling strategies before and after you need them
  7. Observability — Making your system understandable from the outside (you are here)

If this series was useful, share it with your team or anyone who'd benefit. And if there's a topic you'd like covered next — drop it in the comments.

Backend Engineering Fundamentals

Part 1 of 6

Backend systems don't fail because of bad code alone — they fail because of bad decisions. This series breaks down the foundational concepts every developer, architect, and engineer needs to build systems that scale, stay secure, and survive production: APIs, caching, security, databases, message queues, scalability, and observability. No fluff, no vendor pitches — just the tradeoffs that actually matter.

Up next

Scaling: Before You Buy More Servers, Read This

Series: Backend Engineering Fundamentals · Post 06 of 07 Level: Beginner-friendly · Read time: ~8 min "We need to scale" is one of the most expensive sentences in engineering. It triggers infrastruc