Alright, folks. Chris Wade here, back in the digital trenches, and today we’re talking about something that probably keeps half of you up at night, or at least makes you flinch every time your phone pings after hours: Alerts. Specifically, how we’re generally doing them wrong, and why it’s time for a major overhaul, especially in the agent monitoring space.
It’s 2026. We’ve got AI doing everything from writing our emails (badly, sometimes) to predicting stock market crashes (mostly inaccurately). Yet, when it comes to being notified that something important has gone sideways in our systems, we’re still stuck in the early 2010s. Flashing red lights, blaring sirens, and a cascade of irrelevant notifications that make us want to throw our phones into the nearest body of water. Sound familiar?
I’ve been there. Oh, have I been there. I remember one particularly gnarly Friday night, around 2 AM. I was deep in a coding flow state, trying to squash a tricky race condition in a new agent deployment. My phone, usually a silent sentinel, suddenly decided to imitate a fire alarm. PagerDuty. Slack. Email. All screaming about a “critical CPU spike” on a development server that, frankly, nobody cared about at 2 AM on a Friday. It was a false positive, of course. A forgotten cron job kicking off a heavy test script. But it broke my focus, spiked my blood pressure, and made me question all my life choices.
That night, I realized we weren’t just over-alerting; we were doing it mindlessly. We were treating every alert like a five-alarm blaze, regardless of its actual impact or urgency. And that’s a problem because when everything is critical, nothing is critical. Alert fatigue is real, and it’s costing us sleep, sanity, and ultimately, real incident response time.
The Problem with Our Current Alerting Paradigms
Most of us, when setting up monitoring for our agents, fall into a few common traps:
- The “Monitor Everything” Fallacy: We think more metrics equal better visibility. So we alert on CPU, memory, disk I/O, network latency, process count, log errors, queue depths, API response times, cosmic ray flux… you get the idea. While collecting all this data is good for observation and debugging, alerting on every single deviation is a recipe for disaster.
- Threshold Myopia: We set static thresholds and forget about them. “If CPU > 80% for 5 minutes, alert.” Sounds reasonable on paper, but what if 80% is normal for a brief burst during a data sync? Or what if a gradual increase to 75% over 12 hours is actually more indicative of a memory leak than a sudden spike to 90% that resolves in 30 seconds?
- Lack of Context and Correlation: An agent might be reporting high memory usage, but if it’s an agent on a development server that’s currently running a memory-intensive test suite, is that truly an issue? What if a specific agent type consistently uses more CPU but that’s expected behavior for its workload? Our alerts often lack the intelligence to understand the broader system state or the agent’s specific role.
- One-Size-Fits-All Notification Channels: Everything goes to PagerDuty. Or everything goes to Slack. Or everything goes to email. There’s no differentiation based on severity, impact, or even who needs to know. Your critical production agent failure should not trigger the same notification path as a minor warning on a staging agent.
These issues are particularly pronounced in agent monitoring because agents are often distributed, ephemeral, and diverse in their functions. A single, monolithic alerting strategy simply doesn’t cut it.
Rethinking Alerts: From Noise to Signal
So, how do we fix this? My current thinking, and what I’ve been implementing with some success, revolves around a few core principles:
1. Focus on Impact, Not Just Deviation
Before you even think about setting up an alert, ask yourself: “What is the actual impact if this metric crosses this threshold?” If the answer is “nothing immediately noticeable by a user or another critical system,” then it’s probably not an alert. It might be a warning, or something to track in a dashboard, but not something that wakes someone up.
For an agent, this means prioritizing metrics that directly affect its core function or the services it supports. For example:
- Agent Connectivity Loss: If an agent goes offline for more than X minutes, and it’s a critical production agent, that’s an alert. If it’s a non-critical agent on a test machine, maybe just a warning or an informational log.
- Data Processing Stalled: If an agent is responsible for processing a queue of items, and that queue depth isn’t decreasing (or is growing uncontrollably), that’s an alert. The impact is direct: data isn’t moving.
- Resource Exhaustion (with context): If an agent consistently hits 95% CPU AND its processing rate drops significantly, then it’s an alert. The ‘AND’ is crucial. High CPU alone might be fine if it’s still doing its job.
2. Dynamic Thresholds and Anomaly Detection
Static thresholds are a blunt instrument. They don’t account for normal fluctuations, seasonality, or evolving system behavior. Instead, we should move towards dynamic thresholds or, even better, anomaly detection.
Many modern monitoring platforms (and even open-source tools) offer some form of baseline modeling. They learn what “normal” looks like for your agent metrics over time and alert you when there’s a significant deviation from that learned pattern. This is far more effective than guessing a magic number.
For example, instead of “CPU > 80%,” consider something like:
alert: AgentHighCPUAnomaly
expr: avg by (agent_id) (agent_cpu_usage_percentage) > (avg_over_time(agent_cpu_usage_percentage[1w]) * 1.5)
for: 5m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent_id }} is experiencing abnormally high CPU usage."
description: "CPU usage for agent {{ $labels.agent_id }} is 50% higher than its weekly average for the last 5 minutes. Consider checking its workload."
This Prometheus-style alert snippet tries to catch when an agent’s CPU usage is 50% higher than its average over the last week. It’s a simple example, but it’s a step up from a fixed 80%.
3. Intelligent Alert Routing and Escalation
Not every alert needs to wake up the on-call engineer. We need tiered alerting based on severity and impact.
- Critical: PagerDuty (or similar), immediate high-priority Slack notification. Reserved for production-impacting issues.
- Warning: Low-priority Slack channel, email to relevant team. Something to look at during working hours.
- Informational: Log entry, dashboard change. No active notification, just for historical tracking or future debugging.
Furthermore, consider adding escalation policies. If a critical alert isn’t acknowledged or resolved within a certain timeframe, it escalates to a wider group or a higher authority. And for agents, you might even route alerts based on the agent’s type or the team responsible for it. For instance, if you have agents collecting network telemetry, and another set for database health, the alerts should go to the respective network and database teams.
Here’s a conceptual example using a pseudo-routing logic:
function process_alert(alert_data):
agent_type = alert_data.get('agent_type')
severity = alert_data.get('severity')
environment = alert_data.get('environment')
if environment == 'production' and severity == 'critical':
send_to_pagerduty(alert_data)
send_to_slack_channel('#prod-incidents', alert_data)
elif environment == 'production' and severity == 'warning':
send_to_slack_channel('#prod-warnings', alert_data)
send_email_to_team_lead(alert_data)
elif environment == 'development' and severity == 'critical':
send_to_slack_channel('#dev-alerts', alert_data)
else: # Default for less critical dev/staging alerts
log_alert(alert_data)
This simple logic ensures that only truly critical production issues trigger the most disruptive notifications, while less urgent or non-production issues are routed appropriately.
4. Self-Healing and Automated Remediation (Where Possible)
This is the holy grail. If an alert triggers a known, repeatable issue, can we automate its resolution? For agents, this could mean:
- If an agent process crashes, automatically restart it.
- If an agent reports low disk space (and it’s a non-critical log volume), automatically purge older logs.
- If an agent stops reporting, automatically attempt to ping its host or check its service status.
Of course, this requires careful consideration and testing. You don’t want to automate yourself into a bigger hole. But for common, low-risk issues, automated remediation can significantly reduce alert noise and operational burden.
Actionable Takeaways for Your Agent Monitoring
Alright, enough theory. Here’s what you can start doing today to get your alerts under control:
- Audit Your Existing Alerts: Go through every single alert you have. For each one, ask: “What happens if this triggers? What’s the actual business impact? Who needs to know? Can this be automated?” Be ruthless. If it’s not truly impactful, demote it to a warning or simply a dashboard metric.
- Implement Tiered Notifications: Stop sending everything to everyone. Define clear severity levels (Critical, Warning, Informational) and map them to distinct notification channels and escalation paths.
- Embrace Dynamic Thresholds: If your monitoring system supports it, start experimenting with anomaly detection or dynamic baselines for key agent metrics. If not, consider tools that do.
- Add Context to Alerts: Ensure your alerts include enough information to be actionable. Agent ID, type, environment, affected service, and a link to relevant dashboards or runbooks are essential.
- Review and Refine Regularly: Your systems evolve, and so should your alerts. Schedule quarterly reviews of your alerting strategy. Are you still getting false positives? Are you missing critical issues? Adjust accordingly.
- Document Your Runbooks: For every alert, there should be a clear, concise runbook outlining the steps to diagnose and resolve the issue. This reduces the cognitive load during an incident.
Getting alerts right is less about technology and more about discipline and a deep understanding of your systems and their impact. It’s about moving from a mindset of “catch every error” to “catch every important error, efficiently.” Your sleep schedule, and your team’s sanity, will thank you for it.
That’s it for me today. Go forth and conquer your alert fatigue! And as always, let me know your thoughts and experiences in the comments below.
🕒 Published: