Hey everyone, Chris Wade here, back on agntlog.com. It’s March 2026, and honestly, if you’re still thinking about monitoring your agents the way we did five years ago, you’re leaving money, sanity, and maybe even your job on the table. The world has moved on, and so should your strategy. Today, I want to talk about something specific, something that’s been bugging me (and saving my bacon) a lot lately: Observability for Agent States, Not Just Metrics.
I know, “observability” is a buzzword that gets thrown around like a frisbee at a tech picnic. But hear me out. For us, in the agent monitoring space, it’s not just a fancy term; it’s a fundamental shift in how we understand our systems. We’re not just looking at CPU usage or memory consumption anymore. We’re trying to understand the *internal state* of our agents, their journey through a workflow, and the subtle signals that tell us something is about to go sideways – or already has.
My journey into this started about a year and a half ago. We had a new client, a massive call center operation with thousands of agents using our custom CRM integration. The old monitoring approach was failing spectacularly. We had dashboards glowing red for “high CPU” or “low disk space” on agent workstations, but the actual problem, the one causing customer complaints and missed SLAs, was deeper. Agents were getting stuck in a “post-call wrap-up” state indefinitely, or their CTI (Computer Telephony Integration) wasn’t correctly registering outbound calls. The traditional metrics were telling us the machines were fine, but the *agents* (and by extension, the business) were clearly not.
That’s when I realized we needed to stop just monitoring infrastructure and start observing agent behavior and state transitions. It’s not enough to know if an agent’s application is running; we need to know *what* that application thinks the agent is doing, and if that aligns with reality.
The Problem with Just Monitoring Metrics
Think about your car. If your engine light comes on, that’s a metric. It tells you *something* is wrong. But it doesn’t tell you *what*. Is it a loose gas cap? A failing oxygen sensor? A catastrophic engine failure? You need more context, more internal visibility, to truly understand the problem.
Traditional monitoring is like that engine light. It tells you *if* a resource is constrained or *if* a service is down. But for complex, stateful agents, especially those interacting with multiple systems (CRM, telephony, knowledge base, custom scripts), a simple metric often doesn’t give you the full picture. An agent might be “logged in” (according to your basic presence monitoring) but effectively frozen, unable to perform any actions, because of a subtle deadlock in their application’s state machine.
We saw this vividly with our call center client. Our basic monitoring showed agents were logged into the CRM. Their network connectivity was fine. CPU was normal. Yet, calls weren’t being processed. Digging deeper, we found that a specific sequence of actions (transferring a call, then immediately trying to access a customer’s payment history) was causing the CRM’s internal state for that agent to get stuck in a “waiting for external service response” loop, even after the external service had responded. The application wasn’t crashing; it was just frozen in a non-responsive state from the agent’s perspective. No metric would have caught that.
Observability for Agent States: What It Means
For me, observability in this context means instrumenting your agent applications and workflows to emit detailed, structured data about their internal state changes, events, and context. It’s about asking:
- What is the agent *currently doing* according to the application? (e.g., “On Call,” “Wrap-up,” “Available,” “Training,” “Paused”)
- What was the *previous state*?
- What *events* triggered the state change? (e.g., “Call Incoming,” “Call Answered,” “Manual Pause Button Clicked”)
- What *context* is associated with this state? (e.g., call ID, customer ID, reason for pause, duration in current state)
This isn’t just logging errors; it’s logging the *journey* of an agent through their workday, step by step, with rich context.
Example 1: Tracking Agent Workflow States
Let’s say you have a custom agent desktop application. Instead of just monitoring if the process is running, you want to know its internal state. You can emit structured logs or traces whenever a key state changes.
// Inside your agent desktop application (pseudo-code)
enum AgentState {
LoggedIn,
Available,
OnCall,
WrapUp,
Paused,
Offline
}
currentAgentState = AgentState.Offline;
function transitionState(newState, eventDetails = {}) {
const timestamp = new Date().toISOString();
const agentId = getAgentId(); // Get from application context
const sessionId = getCurrentSessionId(); // Get from application context
// Log the state transition with rich context
console.log(JSON.stringify({
timestamp: timestamp,
agentId: agentId,
sessionId: sessionId,
oldState: currentAgentState.toString(),
newState: newState.toString(),
eventType: eventDetails.type || "STATE_CHANGE",
eventData: eventDetails.data || {}
}));
currentAgentState = newState;
}
// Example usage:
// When an agent answers a call:
transitionState(AgentState.OnCall, { type: "CALL_ANSWERED", data: { callId: "12345", customerId: "98765" } });
// When an agent goes into wrap-up:
transitionState(AgentState.WrapUp, { type: "CALL_ENDED", data: { callId: "12345", dispositionTime: 30 } });
// When an agent manually pauses:
transitionState(AgentState.Paused, { type: "MANUAL_PAUSE", data: { reason: "Break", duration: 15 } });
This isn’t rocket science, right? But the power comes when you collect these events centrally. You can then build dashboards that show real-time agent states, identify agents stuck in unexpected states, and even analyze state transition patterns to spot workflow bottlenecks.
Building Observability into Your Agent Tooling
This isn’t just about adding a few print statements. It requires a thoughtful approach to instrumentation. Here’s how we tackled it:
1. Define Key Agent States and Transitions
Before you write any code, sit down with your product managers, team leads, and even some agents. Map out the critical states an agent can be in and the legitimate ways they transition between them. This becomes your “ground truth” for what a healthy agent workflow looks like. Document this explicitly.
2. Instrument Everything Relevant
In your agent applications (desktop, web, backend services supporting agents), emit structured logs or traces at every significant state change or event. Focus on:
- State Changes: When an agent’s status changes (e.g., “Available” to “On Call”).
- Key Actions: When an agent performs a critical action (e.g., “Transferred Call,” “Updated Customer Record,” “Submitted Form”).
- External System Interactions: When the agent application interacts with a CRM, telephony system, or database. Record the request, response, and any errors.
3. Use Structured Logging
JSON is your friend here. Don’t just dump plain text. Structured logs make it infinitely easier to parse, filter, and analyze your data later. Include common attributes in every log entry (timestamp, agent ID, session ID, application version) and specific attributes for each event type.
4. Centralize and Analyze
Ship all these logs to a central logging system (Elasticsearch, Splunk, Loki, DataDog, etc.). This is where the magic happens. Now you can:
- Build Real-time Dashboards: Visualize the current state of all agents, identify outliers (e.g., agents stuck in “Wrap-up” for too long).
- Query and Filter: Quickly find all events for a specific agent, customer, or call ID.
- Alert on Anomalies: Set up alerts for unexpected state transitions or durations (e.g., “Alert if an agent is in ‘Paused’ state for more than 30 minutes without a valid reason code”).
- Trace Workflows: Follow an agent’s journey through a complex process by linking related log entries.
Example 2: Detecting Stuck Agents with Log Analysis
Using a logging system like Elasticsearch with Kibana, you could set up a query to find agents stuck in an undesirable state. Imagine you’re looking for agents who entered “WrapUp” state but haven’t transitioned out of it for more than 5 minutes.
// Elasticsearch Query (Kibana Discover or Dev Tools)
{
"query": {
"bool": {
"must": [
{ "match": { "newState.keyword": "WrapUp" } },
{ "range": { "timestamp": { "lte": "now-5m" } } }
],
"must_not": [
{ "exists": { "field": "nextState" } } // Assuming you'd log a "nextState" or "transitionedOut" event
]
}
},
"aggs": {
"stuck_agents": {
"terms": {
"field": "agentId.keyword",
"size": 100
}
}
}
}
This query would give you a list of agent IDs that entered the “WrapUp” state more than 5 minutes ago and haven’t had a subsequent state change event logged. This is a very powerful way to proactively identify and address issues before they escalate.
My Takeaways and Actionable Steps
If you’re managing agents, whether they’re human or automated bots, you absolutely need to move beyond basic uptime monitoring. Here’s what I recommend:
- Stop Monitoring Just Resources, Start Observing States: Shift your focus from “Is the CPU high?” to “What is the agent application *doing* right now, and is that correct?”
- Map Your Agent Workflows: Document every significant state an agent can be in and the valid transitions. This is your blueprint for observability.
- Instrument Your Agent Applications: Build structured logging or tracing directly into your agent tools. Every critical state change or action should emit a detailed event. Use JSON.
- Centralize Your Data: Get those events into a system where you can query, visualize, and alert on them.
- Build Proactive Alerts: Don’t wait for agents to complain. Set up alerts for unexpected state durations or invalid transitions. For example, if an agent stays in “WrapUp” for more than 180 seconds on average, alert. If an agent tries to accept a call while in “Paused” state, alert.
- Educate Your Teams: Show your support, operations, and product teams how to use these new dashboards and queries. The more eyes on the data, the faster you’ll spot problems.
This isn’t a one-and-done project. It’s an ongoing process of refining your instrumentation, improving your dashboards, and asking deeper questions about your agent’s experience. But I promise you, the investment pays off. We went from reactive firefighting to proactively identifying and resolving agent-impacting issues within minutes, sometimes even before the agent themselves noticed a problem. It improved our client’s SLA adherence, reduced agent frustration, and frankly, made my job a lot less stressful. And that, in March 2026, is a win in my book.
Got thoughts on agent observability? Hit me up in the comments below!
Related Articles
- AI agent log-driven development
- My Debugging Strategy: From Chaos to Calm
- AI agent logging in production
🕒 Published: