Hey everyone, Chris Wade here from agntlog.com, popping in with another dive into the nitty-gritty of keeping your agents in line. Today, I want to talk about something that’s been on my mind a lot lately, especially with the move towards more distributed systems and microservices: how we approach alerting. Specifically, I want to tackle the beast that is alert fatigue and how we can get smarter about what actually triggers our alarms.
It’s 2026, and if your phone is still buzzing like a bee’s nest every time a CPU spikes above 80% for five minutes, or a single agent drops offline in a cluster of thousands, we need to have a serious chat. We’re past the days where every red light on a dashboard required a human intervention. We have to be more sophisticated, more contextual, and frankly, more respectful of our on-call engineers’ sleep schedules.
The Echo Chamber of Alerts: My Own Painful History
I’ve been there. Oh, have I been there. Back in my early days managing a fleet of payment processing agents, our alerting setup was… enthusiastic. To put it mildly. We had alerts for everything. Disk space usage? Alert. Network latency spike? Alert. A single 4xx error on an API endpoint? You guessed it, alert. My phone, and the phones of my team, were basically perpetually vibrating. It got so bad that we started developing a kind of selective deafness to the alert tone. You’d hear it, glance, and if it wasn’t the “datacenter on fire” alert, you’d just… ignore it. That’s a dangerous place to be, and it’s the classic definition of alert fatigue.
The problem wasn’t that these metrics weren’t important. They were! But the way we were alerted about them was fundamentally flawed. We were treating symptoms as critical incidents, instead of understanding the broader health of the system and, crucially, the impact on our users or business goals.
Moving Beyond Simple Thresholds: The Power of SLOs and SLIs
So, what’s the antidote to this endless stream of noise? For me, it’s been a gradual but firm shift towards focusing our alerts on Service Level Objectives (SLOs) and the underlying Service Level Indicators (SLIs). If you’re not familiar, SLIs are quantifiable measures of some aspect of the service provided – think latency, error rate, throughput. SLOs are the target values for those SLIs, the promises you make to your users (or your internal stakeholders) about how well your service will perform.
The beauty of this approach is that it forces you to ask: “What truly matters to the user experience or the business?” Instead of alerting on every little wobble, you alert when you’re actually at risk of failing to meet your commitments. This is a game-changer for agent monitoring because it shifts the focus from individual agent health to the collective health and performance of the agent fleet as it relates to its mission.
Example 1: From CPU Spikes to Processing Throughput
Let’s say you have a fleet of data processing agents. In the old world, you might have an alert like this:
- alert: HighCPUAgent
expr: avg by (instance) (node_cpu_seconds_total{mode="idle"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.instance }} has high CPU usage"
description: "CPU idle percentage on {{ $labels.instance }} is below 20% for 5 minutes."
This alert is a classic example of noise. A single agent having high CPU doesn't necessarily mean your data processing is failing. Maybe it's just a particularly large batch it's working on. What we really care about is whether the rate at which data is being processed is meeting our targets. So, instead, we might alert on an SLI like "records processed per minute" or "end-to-end processing latency."
A better alert would look something like this, assuming your agents expose a metric for processed records:
- alert: LowProcessingThroughput
expr: sum(rate(agent_records_processed_total[5m])) < 1000 # Target: 1000 records/minute
for: 10m
labels:
severity: critical
annotations:
summary: "Overall processing throughput is low"
description: "The total rate of records processed across all agents has dropped below 1000/minute for 10 minutes. This indicates a potential backlog or processing issue affecting our SLO."
See the difference? The second alert tells you there's a problem affecting the overall system's ability to do its job, rather than just pointing out a single component's internal state.
Error Budgets: The Secret Sauce for Smart Alerting
Once you've defined your SLOs, you can then define an "error budget." This is the amount of time or the percentage of requests that your service is allowed to fail or degrade without violating your SLO. For example, if your SLO is 99.9% availability, your error budget is 0.1% of downtime or errors over a given period (say, a month).
Alerting when you're about to burn through your error budget is incredibly powerful. It means you're only alerted when there's a real risk of disappointing your users. This moves you away from "is this agent healthy?" to "is our service healthy enough to meet our promise?"
Example 2: From Individual Agent Errors to Collective Error Rate
Consider an API gateway agent fleet. Old alert:
- alert: HighAgentErrorRate
expr: sum(rate(agent_http_requests_total{status="5xx"}[1m])) by (instance) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.instance }} is experiencing high 5xx errors"
description: "Agent {{ $labels.instance }} has seen more than 5 5xx errors in the last minute for 2 minutes."
This is again problematic. A single agent might hit a bad upstream, or encounter a transient issue. If it's part of a load-balanced fleet, users might not even notice. What we care about is the overall error rate for the service.
Let's say our SLO for the API is 99.9% success rate over a 5-minute window. We can set up an error budget-based alert:
- alert: ServiceErrorBudgetBurn
expr: |
sum(rate(agent_http_requests_total{status="5xx"}[5m])) / sum(rate(agent_http_requests_total[5m])) > 0.001 # More than 0.1% errors
for: 5m
labels:
severity: critical
annotations:
summary: "API Service is burning error budget rapidly"
description: "The overall 5xx error rate for the API service is above 0.1% for 5 minutes, indicating a significant impact on our availability SLO."
This alert is much more impactful. It tells you that your service as a whole is in trouble, not just one of its many components.
Context is King: Enriching Your Alerts
Beyond just triggering alerts based on SLOs, it's crucial to provide context. When an alert fires, the person on call needs to quickly understand:
- What is the impact? (e.g., "Users cannot log in," "Data processing is stalled.")
- Where is the problem? (e.g., "Agents in region A," "Database X," "Service Y.")
- What's the likely cause? (e.g., "High CPU," "Disk full," "Network partition.")
- What are the first steps to investigate? (e.g., links to dashboards, runbooks.)
This means your alert annotations shouldn't just repeat the alert expression. They should provide actionable information. Integrating with your monitoring dashboards (like Grafana, Datadog, etc.) to link directly to relevant graphs is invaluable. Even better, consider linking to a runbook or a troubleshooting guide specific to that alert.
For our ServiceErrorBudgetBurn alert, we could add:
annotations:
summary: "API Service is burning error budget rapidly"
description: "The overall 5xx error rate for the API service is above 0.1% for 5 minutes, indicating a significant impact on our availability SLO. Investigate recent deployments or upstream dependencies. Check Grafana dashboard: [link-to-api-dashboard] and Runbook: [link-to-api-runbook]."
This small addition can shave minutes, sometimes hours, off incident response time.
Actionable Takeaways
So, where do you start if you're drowning in alerts and want to make a change? Here’s my advice:
- Audit Your Current Alerts: Go through every single alert you have. For each one, ask: "What problem does this alert tell me about? Is it directly impacting a user or a business goal? Does it contribute to an SLO?" If the answer isn't clear, consider demoting it to a dashboard metric or even removing it.
- Define Your SLIs and SLOs: This is the foundational step. Identify the key metrics that define the health and performance of your agent fleet from a user's perspective. Set clear, measurable objectives for these.
- Implement Error Budgeting: Once you have SLOs, calculate your error budget. Start building alerts that trigger when you're about to burn through that budget, not just when a single component hiccups.
- Prioritize Alert Severity: Not all alerts are created equal. Use severity levels (critical, warning, info) wisely. Only "critical" alerts should wake someone up. "Warning" alerts can go to a ticketing system or a less intrusive notification channel.
- Enrich Alert Context: Make sure your alerts provide clear, actionable information. Include links to dashboards, logs, and runbooks. The goal is to minimize the time an on-call engineer spends trying to figure out what's going on.
- Regularly Review and Refine: Your system evolves, and so should your alerting. Set up a regular cadence (monthly, quarterly) to review your alerts, especially after incidents. Did an alert fire that shouldn't have? Did an incident occur without an alert? Adjust accordingly.
Smarter alerting isn't just about reducing noise; it's about improving the reliability of your services, reducing engineer burnout, and ultimately, delivering a better experience for your users. It's a continuous process, but one that pays dividends many times over. Let's make 2026 the year we finally conquer alert fatigue.
That's all for now. Until next time, happy monitoring!
🕒 Published: