Alright, folks. Chris Wade here, back at agntlog.com, and today we’re diving headfirst into something that’s probably keeping more than a few of you up at night: alerting. Specifically, how to make your alerts actually useful instead of just another source of noise. Because let’s be real, in 2026, if your phone is buzzing with alerts every ten minutes, you’re not monitoring anything effectively; you’re just getting spammed.
I’ve been there. My first real job out of college, back in the day when ‘cloud’ was still a buzzword people thought might blow over, I was on a small ops team for an e-commerce platform. We had alerts for everything. Disk space, CPU usage, memory, database connections, HTTP 5xx errors, even if a specific cron job didn’t run for more than 30 seconds. Sounds good on paper, right? thorough coverage. The reality? My pager (yes, a literal pager) would go off so often I started leaving it in a drawer. The team developed a kind of alert fatigue that was almost impressive. We’d triage based on the ‘urgency’ of the sound the pager made, which, spoiler alert, isn’t a great system.
The problem wasn’t a lack of alerts; it was a lack of *intelligent* alerts. We were drowning in data points without context. Fast forward to today, with microservices, serverless functions, and distributed systems being the norm, that problem is amplified a thousandfold. The sheer volume of potential things to alert on is mind-boggling. So, how do we cut through the noise and get to the signal?
Beyond Thresholds: The Art of Actionable Alerting
My core belief about alerting, refined over years of sleepless nights and too many false positives, is this: an alert should tell you something you need to act on, right now, or something that will soon require your immediate attention. If it doesn’t meet that bar, it’s probably not an alert; it’s a metric you should be observing on a dashboard, or a log entry you should be aggregating.
This might sound obvious, but I see so many teams still falling into the trap of alerting on every single metric that dips or spikes. Let’s break down how to get smarter about this.
1. Focus on Business Impact, Not Just Technical Metrics
This is probably the biggest significant shift. Instead of just alerting on a server hitting 90% CPU, consider what that 90% CPU actually *means* for your users. Is it causing slow page loads? Are API calls timing out? If not, maybe that server is just working hard, doing what it’s supposed to. A high CPU on a batch processing server might be normal during its run cycle, but a high CPU on your main API gateway during peak hours is a different story.
My current setup for a client involves a multi-tiered alerting strategy. At the top, we have alerts directly tied to user experience. For example:
- “Conversion Rate Drop Alert”: If the checkout conversion rate drops by X% compared to the 24-hour average for more than 5 minutes. This is a P1 alert. It means money is being lost.
- “Critical API Latency Spike”: If the average response time for
/api/v1/checkoutexceeds 500ms for 3 consecutive minutes. This is also a P1. - “Login Failures Anomaly”: If the rate of failed login attempts suddenly spikes outside of its historical pattern, indicating a potential brute-force attack or a sudden auth system failure.
Notice how these aren’t about a single server. They’re about the *outcome* users are experiencing or the *security* of the platform. Technical metrics are often the *symptoms*, not the *illness* itself. Alert on the illness.
2. Differentiate Between Warning and Critical
Not all problems are equal. A warning alert should give your team a heads-up that something is trending in the wrong direction, allowing proactive intervention. A critical alert means “drop everything and fix this now.”
For instance, let’s take disk space. An old classic. Instead of just alerting at 90% usage (which might be too late to safely clean up), consider:
- Warning: Disk usage on critical data store reaches 80%. (Triggers an email to the relevant team, opens a low-priority ticket, but no immediate pager duty.)
- Critical: Disk usage on critical data store reaches 95%. (Triggers a P1 pager alert, wakes up the on-call engineer.)
This gives your team a chance to react before a full outage occurs. It’s about building in grace periods and multiple levels of escalation. You might even have a third level – “Informational” – for things that just need to be logged or aggregated for later review, but don’t require human intervention.
3. Context is King: Enrich Your Alerts
An alert that just says “Service X is down” is… less than helpful. An alert that says “Service X is down in us-east-1, affecting 15% of users, average response time for /api/v1/orders is now 5000ms, and the last 5 restarts failed with ‘database connection refused’” – now *that’s* an alert you can work with.
Modern monitoring tools allow you to pull in a wealth of related data. When an alert fires, try to include:
- Relevant metrics: What was the CPU, memory, error rate just before the alert?
- Recent logs: The last 10-20 error logs from the affected service.
- Links to dashboards: A direct link to the relevant Grafana/Prometheus/Datadog dashboard for deeper investigation.
- Runbook links: A link to a specific runbook or troubleshooting guide for that alert type.
Here’s a simplified example of an alert message payload (often sent to Slack or PagerDuty):
{
"alert_name": "API Gateway High Error Rate (5xx)",
"severity": "CRITICAL",
"timestamp": "2026-03-18T14:30:00Z",
"service": "api-gateway-service",
"region": "us-west-2",
"impact": "5xx errors affecting ~10% of API requests.",
"trigger_condition": "Average 5xx rate > 5% over 5 minutes",
"current_value": "7.2% (average over last 5 min)",
"details": [
"Recent log snippet (last 30s): [ERROR] Request failed: Connection refused for DB 'orders_db'",
"Affected instances: ip-10-0-0-10, ip-10-0-0-11",
"Recommended action: Check 'orders_db' health and API gateway logs.",
"Dashboard link: https://dashboards.example.com/api-gateway-overview",
"Runbook: https://runbooks.example.com/api-gateway-5xx"
]
}
This isn’t just an alert; it’s a miniature incident report, giving the on-call engineer a massive head start.
4. Alert on Rate of Change and Anomalies, Not Just Static Thresholds
Static thresholds are easy to set up, but they’re often brittle. What’s a normal CPU usage at 3 AM might be critical at 3 PM. What if your traffic profile changes? You’ll be constantly tweaking thresholds or suffering from false positives/negatives.
This is where things like “rate of change” and “anomaly detection” become powerful. Instead of “alert if 5xx count > 100”, try:
- “Alert if 5xx rate increases by 3 standard deviations from the 1-hour rolling average.”
- “Alert if the number of active users drops by 20% compared to the same time yesterday.”
Many modern monitoring platforms (Datadog, New Relic, Grafana with Prometheus and ML plugins) offer built-in anomaly detection. It takes a bit more effort to configure and fine-tune, but the reduction in false positives is often worth it. It learns what “normal” looks like for your systems and alerts when things deviate significantly.
For example, using Prometheus and Alertmanager, you might define a rule like this (simplified):
groups:
- name: critical_alerts
rules:
- alert: HighAPILatencyAnomaly
expr: |
(
rate(http_request_duration_seconds_bucket{le="0.5", job="api-service"}[5m])
/
rate(http_request_duration_seconds_count{job="api-service"}[5m])
OFFSET 1h
)
-
(
rate(http_request_duration_seconds_bucket{le="0.5", job="api-service"}[5m])
/
rate(http_request_duration_seconds_count{job="api-service"}[5m])
)
> 0.1 # Example: If success rate drops by more than 10% compared to 1 hour ago
for: 5m
labels:
severity: critical
annotations:
summary: "API service latency has significantly increased compared to 1 hour ago."
description: "The success rate of requests completing within 500ms has dropped by more than 10%."
This kind of rule is looking for a *change* in behavior, which is often a much stronger signal for an actual problem than a static threshold.
Actionable Takeaways for Your Alerting Strategy
Okay, so we’ve talked theory. What can you actually do right now to improve your alerts?
- Audit Your Existing Alerts: Go through every single alert you have. For each one, ask: “What action do I take when this fires? Is it immediate? Is it critical? Does it provide enough context?” If you can’t answer definitively, it needs refinement.
- Prioritize Business Impact: Identify your most critical business flows (login, checkout, search, content delivery). Build alerts around the health of those flows, focusing on user-facing metrics first.
- Implement Multi-Tiered Alerting: Don’t treat all problems equally. Define warning, critical, and perhaps informational tiers for your alerts. Different tiers should have different notification methods (email, Slack, pager duty).
- Enrich Your Alert Payloads: Make sure your alerts deliver as much context as possible: relevant metrics, logs, links to dashboards, and runbooks. The less time an on-call engineer spends hunting for information, the faster they can resolve an issue.
- Experiment with Anomaly Detection: If your platform supports it, start small by applying anomaly detection to a few key metrics that are prone to false positives with static thresholds. Learn how it behaves for your specific workload.
- Regularly Review and Tune: Your systems evolve, traffic patterns change, and new services come online. Your alerts need to evolve too. Schedule quarterly reviews with your team to discuss alert effectiveness, false positives, and missed incidents.
Building a truly effective alerting system is an ongoing process, not a one-time setup. It requires understanding your systems, your users, and your team’s operational needs. But trust me, investing the time now will save you countless headaches, missed sleep, and actual revenue down the line. Stop drowning in notifications and start getting signals that truly matter. Your future self (and your on-call team) will thank you.
Related Articles
- Computer Vision Sports News: AI Rewriting the Game
- Google AI News Today: October 2025 – Key Updates & Predictions
- Monitoring Agent Behavior: A Quick-Start Practical Guide
🕒 Last updated: · Originally published: March 18, 2026