Alright, agntlog fam! Chris Wade here, and today we’re diving headfirst into something that’s been chewing at my brain lately: the humble (or not-so-humble) alert. Now, I know what you’re thinking – “Alerts? We’ve got alerts, Chris. They go off. Sometimes.” But that’s precisely the problem, isn’t it?
It’s 2026. We’re past the point of just having alerts. We need effective alerts. Alerts that tell us something useful, not just that a server somewhere is feeling a bit peeky. Because let’s be real, an alert that gets ignored is worse than no alert at all. It’s a boy-who-cried-wolf scenario, but instead of wolves, it’s a critical system going down while you’re staring at 30 unread Slack notifications.
My particular bugbear these days? The sheer volume of “noisy” alerts that drown out the actual emergencies. We’ve all been there. You set up a new service, meticulously configure your monitoring, and then BAM! Your inbox is a wasteland of CPU spikes, memory warnings, and disk space alerts that resolve themselves five minutes later. Before you know it, you’re hitting ‘Mark All Read’ with the efficiency of a seasoned gamer mashing buttons.
So, today, I want to talk about how we can make our alerts smarter, more contextual, and ultimately, more actionable. This isn’t just about reducing noise; it’s about building trust in our monitoring systems again. It’s about getting back to that feeling of, “Oh crap, an alert! I better check this,” instead of, “Ugh, another one.”
The Alert Fatigue Epidemic: My Own Battle Scars
I distinctly remember a project a few years back – a fairly complex microservices setup for a client in the e-commerce space. Think hundreds of services, all talking to each other, all with their own little quirks. We had monitoring everywhere, which on paper, sounded great. In practice? It was a nightmare.
Every deployment, every minor hiccup, every slightly elevated network latency would trigger a cascade of alerts. Our on-call rotation was a mess. Engineers were burned out, constantly chasing ghosts. I personally spent an entire Saturday debugging what turned out to be a flaky network card on a non-critical development server that somehow still had production-level alerting enabled. My wife was thrilled, as you can imagine.
That experience taught me a harsh lesson: more alerts do not equal more visibility. They often equal less. When everything is urgent, nothing is. We had effectively desensitized ourselves to the very signals we had put in place to protect our systems.
The solution wasn’t to turn off monitoring. It was to get surgical with our alerting. To understand what truly mattered and what was just environmental chatter.
Beyond Thresholds: Contextual Alerts for the Win
The most common type of alert, and arguably the biggest contributor to fatigue, is the simple threshold alert. CPU usage over 80%? Alert! Disk space below 10%? Alert! While these have their place, they often lack context. Is 80% CPU usage bad if it’s for two minutes during a scheduled batch job? Probably not. Is 10% disk space bad if it’s on a temporary scratch volume that self-cleans every hour? Again, probably not.
This is where we need to start thinking about contextual alerting. It’s about combining multiple signals and understanding the state of your system, not just individual metrics.
1. Baselines and Anomaly Detection
Instead of static thresholds, consider dynamic baselines. What’s “normal” for your system during specific hours or days? Tools exist now that can learn these patterns and only alert you when behavior deviates significantly from the norm. This is particularly powerful for things like request latency or error rates, which can fluctuate naturally throughout the day.
For example, if your average API response time is 50ms during peak hours and suddenly jumps to 500ms for more than 5 minutes, that’s an alert. But if it briefly hits 100ms during a deployment and quickly returns to normal, that’s just noise.
2. Correlated Alerts: The Power of AND
This is my absolute favorite for reducing noise. Instead of alerting on a single metric, combine them. An alert shouldn’t fire unless multiple conditions are met, suggesting a genuine problem.
Let’s say you have a web server. Individually:
- CPU usage > 80%? Maybe it’s fine.
- Memory usage > 90%? Maybe it’s fine.
- Error rate > 5%? Maybe it’s a bad client request.
But what if ALL of those happen simultaneously for more than 60 seconds? Now you’re talking about a real problem. Your web server is probably struggling to serve legitimate requests.
Here’s a simplified example of how you might configure this in a monitoring system like Prometheus with Alertmanager, or even a custom script:
# Prometheus Alerting Rule (YAML)
groups:
- name: webserver-issues
rules:
- alert: HighLoadAndErrors
expr: |
(node_cpu_usage_percentage{job="webserver"} > 80)
and
(node_memory_usage_percentage{job="webserver"} > 90)
and
(sum(rate(http_requests_total{job="webserver", status!~"2xx|3xx"}[5m]))
/ sum(rate(http_requests_total{job="webserver"}[5m]))) > 0.05
for: 1m
labels:
severity: critical
annotations:
summary: "Web server {{ $labels.instance }} experiencing high load and errors"
description: "CPU, memory, and error rates are all elevated on {{ $labels.instance }}. Immediate investigation required."
This rule won’t fire unless all three conditions are met for at least one minute. That’s a much stronger signal than any one of those metrics alone.
3. Service-Level Objective (SLO) Based Alerts
If you’re serious about reliability, you’re likely thinking about SLOs. These are concrete, measurable targets for your service’s performance. Instead of alerting on underlying infrastructure metrics directly, alert when your service is at risk of violating its SLO.
For example, if your SLO is “99.9% of requests must complete within 200ms over a 7-day period,” you can set up an alert that fires when your current error budget is being consumed too quickly. This tells you not just that something is slow, but that your customers are feeling the impact and you’re jeopardizing your reliability target.
This is a more mature approach to alerting because it aligns directly with business impact. You’re not just fixing a server; you’re protecting your service’s reputation and customer satisfaction.
A simple way to think about this is “error budget burn rate.” If your service has an error budget (e.g., 0.1% of requests can fail), and suddenly you’re failing at 0.5% for an extended period, that’s burning through your budget much faster than expected. An alert on this burn rate gives you time to react before you actually blow your SLO for the month.
# Simplified SLO Burn Rate Alert (conceptual, depends on monitoring system)
# Alert if 1-hour error rate is 10x higher than allowed by weekly SLO
alert: SLOBurnRateHigh
expr: |
sum(rate(my_service_errors_total{}[1h])) / sum(rate(my_service_requests_total{}[1h]))
>
(0.001 * 10) # 0.1% allowed error rate, multiplied by a burn rate factor (10x)
for: 5m
labels:
severity: warning
annotations:
summary: "Service {{ $labels.service }} is burning error budget quickly."
description: "Current error rate is significantly higher than allowed by SLO. Investigate recent deployments or upstream issues."
This kind of alert gives you early warning that you’re heading towards a violation, rather than just telling you when you’ve already hit it.
Actionable Takeaways for Smarter Alerting
So, how do we get from our current state of alert fatigue to a more robust, trust-inspiring alerting system? Here’s my playbook:
- Audit Your Existing Alerts: This is step one. Go through every alert you have. For each one, ask:
- What problem does this alert indicate?
- What’s the expected resolution time?
- Who is responsible for acting on it?
- What happens if this alert is ignored? (If nothing, delete it.)
Be ruthless. If an alert frequently fires without requiring action, it’s a candidate for modification or deletion.
- Prioritize Based on Impact: Not all alerts are created equal. Use severity levels (critical, warning, info) and ensure your notification channels reflect this. Critical alerts should page someone. Warnings might go to a less intrusive channel like a dedicated Slack channel or email. Informational alerts? Probably just logs.
- Embrace Correlation: Look for opportunities to combine multiple metrics into a single, more meaningful alert. If your app server is slow AND the database is showing high disk I/O, that’s one critical alert, not two separate warnings.
- Use Dynamic Thresholds or Anomaly Detection: Where static thresholds fail you, explore tools that can learn normal behavior and only alert on deviations. Many modern monitoring platforms offer this out-of-the-box or via integrations.
- Focus on SLOs, Not Just Infrastructure: Shift your mindset from “is the server healthy?” to “is the service delivering on its promise to users?” Alerts based on SLOs provide direct insight into customer experience.
- Run Drills and Test Your Alerts: Don’t wait for a production incident to find out your alerts are broken or misconfigured. Schedule regular “fire drills” where you intentionally trigger alerts and ensure they behave as expected. Do they go to the right people? Is the message clear?
- Document Your Alerts: For every alert, document what it means, why it’s important, and initial troubleshooting steps. This empowers your on-call team to react faster and more effectively.
This isn’t a one-time fix. Building a healthy alerting culture is an ongoing process. It requires continuous refinement, listening to your on-call teams, and being willing to adapt. But I promise you, investing in smarter, more contextual alerts will pay dividends in reduced fatigue, faster incident resolution, and ultimately, more reliable services for your users.
Let’s stop crying wolf and start sending out clear, actionable signals. Your future self (and your on-call team) will thank you.
🕒 Published: