Alright, folks. Chris Wade here, back in the digital trenches with you at agntlog.com. Today, we’re not just kicking the tires; we’re getting under the hood and maybe even a little grease on our hands. The topic? Alerts. But not just any alerts. We’re talking about the art and science of building an alert system that actually serves you, instead of drowning you in a sea of meaningless pings. The kind that tells you something’s genuinely busted, not just that a server sneezed.
It’s March 2026, and I’m still seeing way too many teams struggle with alert fatigue. It’s a real problem, and it’s not just about annoyance. When every minor fluctuation triggers a notification, your team starts to ignore everything. And that, my friends, is how a real incident slips through the cracks. It’s like the boy who cried wolf, but instead of a wolf, it’s a critical customer facing service going offline at 3 AM because no one checked the 1,000th “CPU usage slightly elevated” email.
So, let’s talk about moving past the noise and building an alert strategy that actually works. We’re aiming for precision, relevance, and actionability. No more “everything is an emergency.”
The Problem with “Everything is an Alert”
I remember this one time, early in my career, we had a pretty basic monitoring setup for a new e-commerce platform. Every metric we could scrape, we alerted on. Disk usage, memory, CPU, network I/O, database connection pool, even specific HTTP status codes – if it moved, we had an alert for it. The result? Our ops team’s Slack channel was a constant waterfall of red emojis and flashing sirens. Most of it was benign. A brief spike in CPU during a scheduled backup? Alert. A cache clear causing a temporary dip in response times? Alert. We were effectively deafening ourselves.
The worst part was when a real issue hit – a misconfigured load balancer that was silently dropping requests. It was buried under a hundred other “informational” alerts about a slight increase in database connections. It took us far longer than it should have to identify and fix, all because our signal-to-noise ratio was abysmal.
This isn’t just about reducing notification volume; it’s about preserving cognitive load and ensuring that when an alert does come in, it carries weight. It means something.
Shifting Focus: From Symptoms to Impact
The biggest mistake I see teams make is alerting on symptoms rather than impact. High CPU utilization is a symptom. A customer experiencing slow page loads or failed transactions is an impact. While symptoms can lead to impact, not all symptoms immediately translate to a problem that requires urgent human intervention.
Think about it like this: if your car’s engine light comes on, that’s a symptom. If your car is actively sputtering and losing power, that’s an impact. You want to be alerted when the car is sputtering, not just for every tiny sensor reading that’s slightly out of spec.
Key Principles for Effective Alerting
- Alert on SLOs/SLIs: The gold standard. If you have Service Level Objectives (SLOs) and Service Level Indicators (SLIs) defined for your services, alert when these are violated or are trending towards violation. For example, if your SLO is 99.9% availability for API requests, alert when your error rate crosses a predefined threshold that makes achieving that SLO unlikely.
- Focus on User Experience: Ultimately, what matters most is the end-user experience. Are they able to use your service? Are transactions completing successfully? Are pages loading quickly? If not, that’s an alert-worthy event.
- Actionability: Every alert should ideally point to an actionable task or provide enough context for an engineer to start debugging. If an alert just says “Error,” it’s not very useful. “Database connection pool exhausted on service X impacting transaction processing” is much better.
- Context is King: An alert without context is just noise. Include relevant metrics, links to dashboards, runbooks, and even historical data if possible.
Crafting Smarter Alert Conditions
Let’s get practical. How do we translate these principles into actual alert configurations? It’s about being deliberate with your thresholds and conditions.
Example 1: The “Bursty” Workload Problem
Imagine you have a background processing service that handles image uploads. It’s designed to be bursty – sometimes it’s idle, sometimes it’s hammering the CPU at 90% for a few minutes as it processes a batch, then it goes back to idle. If you set a static alert for “CPU > 80% for 5 minutes,” you’re going to get an alert every time it does its job. That’s a bad alert.
Instead, consider alerting on things like:
- Job backlog growth: If the queue of images to process is consistently growing, indicating the workers can’t keep up.
- Processing time exceeding SLA: If the average time to process an image goes above, say, 10 seconds, and your SLA is 15 seconds, that’s a leading indicator of trouble.
- Error rate during processing: If image processing failures spike.
Let’s say you’re using Prometheus and Alertmanager. For the job backlog, you might have something like this:
groups:
- name: image_processor_alerts
rules:
- alert: ImageProcessorBacklogGrowing
expr: sum(image_processor_queue_size) by (instance) > 1000 and rate(image_processor_processed_total[5m]) < 0.1 * sum(image_processor_queue_size) by (instance)
for: 5m
labels:
severity: critical
annotations:
summary: "Image processor backlog is growing for instance {{ $labels.instance }}"
description: "The image processing queue has over 1000 items and processing rate is too low. This indicates workers are not keeping up. Check worker health and resource utilization."
runbook: "https://my-runbooks.com/image-processor-backlog"
This alert fires only if the queue size is significant AND the processing rate isn't keeping up. That's much smarter than a raw CPU alert.
Example 2: Latency & Error Budget
For user-facing services, latency and error rates are often the most critical metrics. Instead of alerting on "P99 latency > 500ms for 1 minute," which might be a temporary hiccup, consider an alert that consumes your error budget.
If your SLO allows for 0.1% errors over a 30-day period, you can estimate an hourly or daily error budget. When you're burning through that budget too quickly, that's an alert.
Let's say you're tracking successful requests (http_requests_total{status_code="2xx"}) and all requests (http_requests_total). Your error rate for a 5-minute window could be:
groups:
- name: api_service_alerts
rules:
- alert: ApiServiceErrorBudgetBurn
expr: |
(sum(rate(http_requests_total{job="api-service", status_code=~"5xx|4xx"}[5m]))
/
sum(rate(http_requests_total{job="api-service"}[5m]))) > 0.005
for: 2m
labels:
severity: major
annotations:
summary: "API Service is burning error budget too quickly"
description: "The API service error rate has exceeded 0.5% over the last 5 minutes. This is burning through our error budget faster than expected. Investigate recent deployments or upstream dependencies."
dashboard: "https://grafana.example.com/d/api-service-overview"
This alert fires if the 5-minute error rate exceeds 0.5%. The threshold (0.005) is derived from your SLO; if your SLO allows 0.1% errors over a month, then 0.5% over a few minutes is definitely a red flag that you're on track to blow past your budget. The for: 2m ensures it's not just a single-point spike.
The Human Element: Runbooks and Communication
An alert is only as good as the action it prompts. This is where runbooks come in. Every critical alert should have a corresponding runbook – a documented set of steps an engineer can follow to investigate, diagnose, and potentially remediate the issue.
I've seen so many teams build these amazing monitoring systems, only to fall flat when an alert fires because no one knows what to do. The on-call engineer gets pinged, stares at the alert, and then starts frantically searching internal wikis or pinging senior engineers. That's wasted time, increased mean time to recovery (MTTR), and unnecessary stress.
What makes a good runbook?
- Clear Title & ID: Easy to find and reference.
- Alert Context: Reiterate what the alert means and why it fired.
- Initial Triage Steps: "Check service X logs," "Verify health endpoint," "Look at dashboard Y."
- Common Causes: List known reasons for the alert (e.g., recent deploy, dependent service outage).
- Remediation Steps: "Restart service Z," "Scale up instances," "Rollback deployment."
- Escalation Path: Who to contact if the issue can't be resolved with the documented steps.
- Links to Dashboards/Logs: Direct links to relevant tools.
And beyond runbooks, think about communication. When an alert fires, who needs to know? Is it just the on-call engineer, or does a critical customer-facing issue warrant a wider notification to product managers or even executives? Use different notification channels (Slack for informational, PagerDuty for critical) and escalation policies to manage this effectively.
Actionable Takeaways for Your Alert Strategy
Okay, so you’ve stuck with me through my rants and examples. Here’s what I want you to walk away with today:
- Audit Your Existing Alerts: Seriously, go through them. For each one, ask: "Does this alert require immediate human action? If not, can it be reconfigured as a dashboard metric, a lower-severity notification, or removed entirely?" Be ruthless.
- Define SLOs/SLIs (If You Haven't Already): This is foundational. Once you know what truly matters for your service's reliability and user experience, your alerting strategy will naturally fall into place.
- Shift from Symptoms to Impact: Reconfigure alerts to focus on when user experience or service reliability is genuinely degraded, not just when a system metric crosses an arbitrary threshold.
- Implement "Burn Rate" or Budget-Based Alerting: Especially for critical services, move beyond static thresholds and alert when you're consuming your error budget too quickly.
- Develop and Link Runbooks: Every critical alert needs a clear, actionable runbook. Make it easy for your on-call team to find and follow them. This drastically reduces MTTR and stress.
- Test Your Alerts (Regularly!): Don't just set them and forget them. Periodically simulate failures or manually trigger alerts to ensure they fire correctly, notify the right people, and that your runbooks are still accurate.
Building an effective alerting system is an ongoing process, not a one-time setup. It requires continuous refinement, listening to your on-call team's feedback, and adapting as your systems evolve. But by focusing on impact, actionability, and providing context, you can transform your alerts from a source of fatigue into a powerful tool for maintaining service reliability. Keep your ears open for the real alarms, and let the rest fade into the background.
That's all for now. Until next time, keep those agents monitored, and those alerts meaningful.
🕒 Last updated: · Originally published: March 12, 2026