Alright, folks, Chris Wade here, back in your inbox and on your screens at agntlog.com. Today, we’re diving deep into a topic that, honestly, keeps me up at night more often than I’d care to admit: alerts. Not just any alerts, mind you, but the kind that actually matter, the kind that don’t send you spiraling into an existential dread every Tuesday at 3 AM because some cron job on a dev server sneezed.
The year is 2026. If your alert strategy still looks like a firehose pointed directly at your PagerDuty, we need to talk. Seriously. I’ve seen this movie too many times. The one where the on-call engineer, bleary-eyed and fueled by lukewarm coffee, stares at a screen full of red only to discover that 90% of it is noise. Or worse, the truly critical alert gets buried under a pile of “info” level warnings that were never really urgent. We’re talking alert fatigue, a silent killer of productivity, morale, and frankly, my sanity.
My own journey with alerts has been a rollercoaster. Early in my career, I was a firm believer in “alert on everything.” If a CPU spiked, alert! If disk usage hit 80%, alert! If a log line had the word “error” in it, alert! It felt proactive, like I was truly on top of things. What I actually was, was perpetually exhausted. I remember one particularly brutal week where I was woken up four nights in a row by an alert about a database connection pool nearly hitting its limit, only to find it was a transient spike during an hourly report generation. It was an issue, sure, but not “wake up the human” urgent. The real problem? We didn’t have a clear definition of what “urgent” meant.
So, today, we’re not just talking about alerts. We’re talking about building an intelligent alerting system. One that respects your sleep, focuses on impact, and actually helps you fix problems faster, rather than just pointing out symptoms.
The Problem with “Alert on Everything”
Let’s be blunt: most alerting systems are broken. They generate too many alerts, too few actionable alerts, or they alert on the wrong things. The default templates from your monitoring solution are often a good starting point, but they rarely translate directly to your specific business logic or user experience.
Think about it: a CPU spike on a backend service might be concerning, but if that service has auto-scaling enabled and gracefully handles the load, is it truly an incident requiring immediate human intervention? Probably not. An HTTP 500 error on your login endpoint, however, absolutely warrants a wake-up call. The difference? Impact.
The goal isn’t to eliminate alerts entirely. It’s to ensure that when an alert fires, it means something truly important has happened or is about to happen that impacts your users or your business, and that a human needs to intervene.
The Four Pillars of Intelligent Alerting
After years of trial and error, late-night debugging sessions, and more than a few frustrated yells at my monitor, I’ve distilled intelligent alerting down to four key pillars:
- Focus on Symptoms, Not Causes (Mostly): Alert on what your users experience.
- Thresholds That Make Sense: Don’t just pick arbitrary numbers.
- Context is King: Give your on-call team the information they need immediately.
- Regular Review and Refinement: Alerts are not set-it-and-forget-it.
Let’s break these down.
1. Focus on Symptoms, Not Causes (Mostly)
This is probably the biggest shift in mindset for many teams. Instead of alerting when a specific component fails (a “cause”), alert when the overall system performance or user experience is degraded (a “symptom”).
For example, instead of alerting on:
- High CPU utilization on a specific web server.
- Database connection pool exhaustion.
- Increased latency to a third-party API.
Consider alerting on:
- Elevated HTTP 5xx errors on your primary API endpoint.
- Increased average response time for critical user journeys (e.g., login, checkout).
- Decreased successful transaction rate.
Why? Because the latter directly reflects user pain. The former might be a precursor, but it might also be a perfectly normal operational state that the system handles gracefully. By focusing on symptoms, you cut through the noise. If your API is returning 5xx errors, it doesn’t matter if it’s CPU, database, or network; something is broken for your users, and you need to know.
Of course, this doesn’t mean you ignore component-level monitoring. That’s crucial for debugging once an incident is declared. But for the initial alert that wakes someone up, prioritize the user experience.
Practical Example: SLI/SLO-Based Alerting
If you’ve defined Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for your critical services, these become the perfect foundation for your alerts. Let’s say your SLO for your e-commerce checkout service is “99.9% of checkout requests must complete within 2 seconds.”
You’d then set up an alert that fires when:
- The error rate for checkout requests exceeds 0.1% over a 5-minute window.
- The 95th percentile latency for checkout requests exceeds 2 seconds for 3 consecutive minutes.
Here’s a simplified Prometheus example of an SLI-based alert for a hypothetical checkout service:
groups:
- name: checkout-service-sli-alerts
rules:
- alert: CheckoutHighErrorRate
expr: |
sum(rate(checkout_requests_total{status="5xx"}[5m]))
/
sum(rate(checkout_requests_total[5m]))
> 0.001
for: 5m
labels:
severity: critical
annotations:
summary: "Checkout service experiencing high 5xx error rate"
description: "The checkout service is returning more than 0.1% 5xx errors over the last 5 minutes. User impact is likely high."
- alert: CheckoutHighLatency
expr: |
histogram_quantile(0.95, sum by (le) (rate(checkout_request_duration_seconds_bucket[5m])))
> 2
for: 3m
labels:
severity: critical
annotations:
summary: "Checkout service response latency exceeding SLO"
description: "95th percentile latency for checkout requests has exceeded 2 seconds for 3 consecutive minutes. Users are experiencing slow checkouts."
These alerts directly address the user experience and business impact. When these fire, you know it’s time to act.
2. Thresholds That Make Sense
This is where the art meets the science. Arbitrary thresholds like “CPU > 80%” or “Disk > 90%” are rarely useful on their own. Instead, base your thresholds on:
- Historical Baselines: What’s “normal” for your service? Use monitoring tools to analyze historical data and identify typical ranges.
- Capacity Planning: At what point does a resource constraint actually impact performance or stability?
- Business Impact: How much degradation can your business tolerate before revenue or user experience is significantly affected?
- Error Budgets: If you have SLOs, use your error budget to define your alert thresholds.
One common mistake is setting thresholds too tight. This leads to flapping alerts – alerts that trigger and resolve repeatedly, further contributing to fatigue. Consider using “for” clauses in your alerting rules (as in the Prometheus example above) to ensure a condition persists for a certain duration before firing an alert.
Also, think about different levels of urgency. Maybe a warning when latency hits 1.5 seconds, and a critical alert when it hits 2 seconds. This gives your team a chance to investigate before things go completely sideways.
3. Context is King
An alert that just says “Something is wrong!” is useless. When an alert fires, your on-call engineer needs immediate context to understand:
- What service is affected?
- What is the nature of the problem? (e.g., high errors, slow response, resource exhaustion)
- What is the potential impact? (e.g., “Login is broken for all users,” “Partial outage in EU region”)
- Where can I find more information? (links to dashboards, logs, runbooks)
Your alert messages should be descriptive and include relevant labels or annotations. Don’t make your team dig for information. If your monitoring system integrates with a runbook or wiki, link directly to the troubleshooting guide for that specific alert.
For example, instead of an alert subject like “Service X – High CPU,” aim for something like:
Subject: CRITICAL: Checkout Service - High 5xx Error Rate (EU Region)
Description:
The Checkout service in the EU region is experiencing a sustained 5xx error rate of 0.5% over the last 5 minutes. This is above the SLO threshold of 0.1%.
Potential Impact: Users in the EU region are unable to complete purchases.
Relevant Dashboards:
- Checkout Service Overview: [Link to Grafana/Datadog dashboard]
- EU Region Logs: [Link to Splunk/ELK query]
Runbook:
- Troubleshooting 5xx Errors: [Link to Confluence/Wiki page]
Affected Pods: [List of affected pods/instances if available]
This provides immediate, actionable information, significantly reducing the Mean Time To Acknowledge (MTTA) and Mean Time To Resolve (MTTR).
4. Regular Review and Refinement
This is perhaps the most overlooked pillar. Your system changes, traffic patterns evolve, and new features are deployed. Your alerts need to evolve with them. What was a critical alert six months ago might be noise today, and vice-versa.
I advocate for a regular “alert review” meeting. This could be monthly, quarterly, or after any major incident. In these reviews:
- Analyze “noisy” alerts: Which alerts fire frequently but don’t result in meaningful action? Can their thresholds be adjusted? Can they be demoted to a warning or simply logged?
- Review “missed” incidents: Were there incidents that impacted users but didn’t trigger an alert? This indicates a gap in your alerting coverage.
- Update alert definitions: Ensure alerts reflect the current state of your services and their dependencies.
- Get feedback from on-call: The folks actually getting paged are your best source of truth. Listen to their pain points.
Just last month, we found an alert that was firing weekly for a “low disk space” issue on a log aggregation server. It turns out, our log retention policy had recently changed, and the server was aggressively deleting old logs, causing the disk usage to fluctuate. The alert was technically correct, but the issue was self-correcting and didn’t require human intervention. We adjusted the threshold and added a suppression rule for the specific file system, saving countless unnecessary pages.
Actionable Takeaways
Phew! That was a lot. But hopefully, it gives you a clear path forward. Here’s what you can do starting today:
- Audit Your Current Alerts: Go through every single alert you have configured. For each one, ask: “Does this alert directly indicate an impact on users or business? Does it require immediate human intervention?” If the answer is no, consider demoting it, logging it, or removing it.
- Prioritize SLI/SLO-Based Alerts: Identify your most critical services and define their SLIs and SLOs. Then, build alerts directly off these metrics. If you’re not doing SLIs/SLOs, start with a simple “error rate” and “latency” for your main user-facing endpoints.
- Improve Alert Context: Ensure every alert message includes who, what, where, and links to relevant dashboards/runbooks. This is a quick win that pays massive dividends.
- Schedule Regular Alert Reviews: Put it on the calendar. Make it a recurring meeting with your operations and development teams. Treat your alerts as living documents that need constant care.
- Embrace a “Noisy Alert” Policy: If an alert fires repeatedly without actionable follow-up, it’s noisy. Fix it. Don’t just mute it and hope it goes away.
Intelligent alerting isn’t about setting up a bunch of rules and forgetting them. It’s an ongoing process, a continuous feedback loop between your systems, your teams, and your users. Get it right, and you’ll not only have a more stable system but also a happier, less fatigued on-call team. And trust me, that’s worth more than all the lukewarm coffee in the world.
Stay sharp out there, folks. Until next time.
🕒 Published: