Conventions
Observability
Metrics, tracing, alerting, and health check standards.
Logging alone is not enough. A production system needs metrics, tracing, and alerting to be truly observable.
See also: logging.md for structured logging standards.
Three Pillars
| Pillar | What It Tells You | Tools |
|---|---|---|
| Logs | What happened (discrete events) | Seq, ELK, Grafana Loki |
| Metrics | How the system is performing (aggregated numbers) | Prometheus, Grafana, InfluxDB |
| Traces | How a request flows across services (distributed) | Jaeger, Zipkin, OpenTelemetry |
Health Checks
Every service must expose health endpoints.
Endpoints
GET /api/health → 200 { "status": "healthy" }
GET /api/health/ready → 200 or 503 (checks dependencies)
GET /api/health/live → 200 (process is running, used by k8s liveness probe)What Readiness Checks
| Dependency | Check |
|---|---|
| Database | Can connect and run a simple query |
| Redis/Cache | Can connect and ping |
| External API | Can reach the endpoint (with timeout) |
| Message Queue | Can connect to broker |
| File Storage | Can list/read from bucket |
// .NET
builder.Services.AddHealthChecks()
.AddNpgSql(connectionString, name: "postgres")
.AddRedis(redisConnection, name: "redis")
.AddUrlGroup(new Uri("https://external-api/health"), name: "external-api");// Next.js / Express
app.get('/api/health/ready', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
storage: await checkStorage(),
};
const healthy = Object.values(checks).every(Boolean);
res.status(healthy ? 200 : 503).json({ status: healthy ? 'healthy' : 'unhealthy', checks });
});Metrics
What to Measure
| Metric | Type | Example |
|---|---|---|
| Request rate | Counter | http_requests_total |
| Request duration | Histogram | http_request_duration_seconds |
| Error rate | Counter | http_errors_total (by status code) |
| Active connections | Gauge | http_active_connections |
| Queue depth | Gauge | job_queue_size |
| Business metrics | Counter | enrollments_total, courses_created_total |
RED Method (for services)
Every service should track:
- Rate — requests per second
- Errors — failed requests per second
- Duration — time per request (p50, p95, p99)
USE Method (for resources)
Every resource (CPU, memory, disk, connections) should track:
- Utilization — percentage in use
- Saturation — queue depth / backlog
- Errors — error count
Naming Convention
<namespace>_<name>_<unit>
# Examples
http_request_duration_seconds
db_query_duration_seconds
cache_hits_total
job_processing_errors_totalDistributed Tracing
When You Need It
- Multiple services handle a single request
- You need to find where latency comes from
- You need to debug cross-service failures
Implementation
Use OpenTelemetry as the standard — it's vendor-neutral and supports all major backends.
// .NET
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddOtlpExporter());# Python
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(app)Propagation
- Pass
traceparentheader between services (W3C Trace Context standard) - Include
CorrelationIdin all log entries for manual correlation - Never create custom trace ID formats — use the standard
Alerting
Alert Rules
| Severity | Condition | Action |
|---|---|---|
| Critical | Service down, error rate \> 50%, data loss risk | Page on-call immediately |
| Warning | Error rate \> 5%, latency p99 \> 2s, disk \> 80% | Notify in Slack, investigate within 1 hour |
| Info | Deployment completed, cert expiring in 30 days | Log for awareness |
Alert Best Practices
- Alert on symptoms, not causes — "API latency \> 2s" not "CPU \> 80%"
- Include runbook links in alert messages
- Set appropriate thresholds — avoid alert fatigue from noisy alerts
- Use alert grouping — don't fire 100 alerts for the same incident
- Test alerts — regularly verify they fire when expected
Alert Template
alert: HighErrorRate
expr: rate(http_errors_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
runbook: "https://wiki.internal/runbooks/high-error-rate"Dashboards
Every Service Dashboard Should Have
- Request rate — traffic over time
- Error rate — 4xx and 5xx separately
- Latency — p50, p95, p99 over time
- Saturation — CPU, memory, connections, queue depth
- Business metrics — relevant domain-specific counters
Dashboard Rules
- One dashboard per service, not one giant dashboard
- Use consistent time ranges and refresh intervals
- Include links to related dashboards and runbooks
- Mark deployment events on graphs for correlation
Anti-Patterns
- Logging as the only observability — Logs alone can't show trends or distributions
- No health checks — You can't know if a service is healthy without asking it
- Alerting on every metric — Alert fatigue leads to ignored alerts
- Custom metrics libraries — Use OpenTelemetry, don't reinvent the wheel
- Dashboard without context — Every graph should have a title that explains what "bad" looks like