\n\n\n\n Observability for LLM Apps: A Practical Case Study - AgntLog \n

Observability for LLM Apps: A Practical Case Study

📖 10 min read1,939 wordsUpdated Mar 26, 2026

The Rise of LLM Applications and the Need for Observability

The space of software development has been dramatically reshaped by the large language model (LLM) revolution. From sophisticated chatbots and intelligent content generators to code assistants and data analysis tools, LLMs are being integrated into an ever-expanding array of applications. This rapid adoption, while exciting, introduces a new class of challenges, particularly when it comes to understanding and maintaining these systems in production. Traditional monitoring tools, designed for deterministic, rule-based software, often fall short in capturing the nuances of LLM behavior.

Enter observability. Observability, in the context of LLM applications, is the ability to understand the internal state of an LLM-powered system from its external outputs. It goes beyond simple health checks to provide deep insights into how the LLM is interpreting prompts, generating responses, managing context, and interacting with external tools or data sources. Without solid observability, debugging performance issues, ensuring fairness, identifying hallucinations, or optimizing cost can become a daunting, almost impossible, task. This article presents a practical case study on implementing observability for an LLM application, demonstrating key principles and tools.

Case Study: ‘DocuChat’ – An Internal Knowledge Base Assistant

Let’s consider a hypothetical application called ‘DocuChat’. DocuChat is an internal knowledge base assistant designed to help employees quickly find answers within a vast collection of internal documents (Confluence pages, Slack archives, internal wikis, etc.). It uses a Retrieval-Augmented Generation (RAG) architecture:

  1. User Query: An employee asks a question (e.g., “How do I request a new laptop?”).
  2. Retrieval: The system searches a vector database (populated with embeddings of internal documents) to find relevant document chunks.
  3. Augmentation: These retrieved chunks, along with the original user query, are provided as context to an LLM.
  4. Generation: The LLM synthesizes an answer based on the provided context and the user’s query.

DocuChat uses OpenAI’s GPT-4 for generation and a self-hosted embedding model for retrieval. It’s built using Python, LangChain, and a vector database like Pinecone or Weaviate. Initially, the development team focused on functionality, but as the application gained users, a host of operational questions arose:

  • Why are some responses slow?
  • Is the LLM hallucinating, or is the retrieval failing?
  • Are users satisfied with the answers?
  • How much is this costing us per query?
  • Are we hitting API rate limits?
  • What types of questions are users asking most frequently?

These questions highlight the critical need for observability.

Pillars of LLM Observability for DocuChat

For DocuChat, we’ll focus on the three traditional pillars of observability, tailored for LLM applications:

  1. Logs: Structured records of events.
  2. Metrics: Aggregated numerical measurements over time.
  3. Traces: End-to-end views of requests across distributed systems.

Additionally, we’ll consider a fourth, crucial pillar for LLMs: Human Feedback/Evaluation.

1. Logs: The ‘What Happened’

Traditional application logs are vital, but for LLM apps, they need to capture specific LLM-centric details. For DocuChat, we instrumented various stages of the RAG pipeline to emit structured logs.

Example Log Structure (JSON):

{
 "timestamp": "2023-10-27T10:30:00Z",
 "service": "docuchat-rag-api",
 "level": "INFO",
 "event_type": "user_query",
 "session_id": "abcd-1234-efgh",
 "user_id": "user_123",
 "query": "How do I request a new laptop?",
 "metadata": {
 "user_agent": "Mozilla/5.0...",
 "ip_address": "192.168.1.10"
 }
},
{
 "timestamp": "2023-10-27T10:30:01Z",
 "service": "docuchat-rag-api",
 "level": "INFO",
 "event_type": "retrieval_start",
 "session_id": "abcd-1234-efgh",
 "query": "How do I request a new laptop?",
 "vector_db_query": "laptop request policy",
 "k_value": 5
},
{
 "timestamp": "2023-10-27T10:30:02Z",
 "service": "docuchat-rag-api",
 "level": "INFO",
 "event_type": "retrieval_complete",
 "session_id": "abcd-1234-efgh",
 "retrieved_docs_count": 3,
 "retrieved_docs": [
 {"id": "doc-456", "source": "confluence/laptop-policy", "score": 0.89},
 {"id": "doc-789", "source": "slack/it-announcements", "score": 0.81}
 ],
 "retrieval_duration_ms": 1000
},
{
 "timestamp": "2023-10-27T10:30:03Z",
 "service": "docuchat-rag-api",
 "level": "INFO",
 "event_type": "llm_generation_start",
 "session_id": "abcd-1234-efgh",
 "model_name": "gpt-4",
 "temperature": 0.7,
 "prompt_tokens": 500,
 "system_prompt_hash": "abc123def456" 
},
{
 "timestamp": "2023-10-27T10:30:08Z",
 "service": "docuchat-rag-api",
 "level": "INFO",
 "event_type": "llm_generation_complete",
 "session_id": "abcd-1234-efgh",
 "response": "To request a new laptop, please visit the IT portal...",
 "completion_tokens": 150,
 "total_tokens": 650,
 "llm_duration_ms": 5000,
 "finish_reason": "stop"
},
{
 "timestamp": "2023-10-27T10:30:09Z",
 "service": "docuchat-rag-api",
 "level": "INFO",
 "event_type": "response_sent",
 "session_id": "abcd-1234-efgh",
 "total_request_duration_ms": 9000
}

By centralizing these logs (e.g., using ELK Stack, Splunk, Datadog), the team can:

  • Filter by session_id to reconstruct an entire interaction.
  • Identify slow retrieval or LLM calls.
  • Analyze common queries and their outcomes.
  • Debug issues by examining the exact prompt provided to the LLM and the retrieved documents.

2. Metrics: The ‘How Much’ and ‘How Fast’

Metrics provide aggregated, quantifiable insights into the system’s performance and usage. For DocuChat, we track key metrics using Prometheus and visualize them with Grafana.

Key LLM-Specific Metrics:

  • Latency:
    • docuchat_total_request_duration_seconds_sum/_count (overall system response time)
    • docuchat_retrieval_duration_seconds_sum/_count
    • docuchat_llm_api_duration_seconds_sum/_count
  • Token Usage & Cost:
    • docuchat_llm_prompt_tokens_total (counter)
    • docuchat_llm_completion_tokens_total (counter)
    • docuchat_llm_total_tokens_cost_usd_total (calculated based on model pricing)
  • API Health:
    • docuchat_llm_api_calls_total (counter, labeled by model_name, status_code)
    • docuchat_vector_db_queries_total (counter, labeled by status_code)
    • docuchat_llm_api_rate_limit_errors_total (counter)
  • Retrieval Performance:
    • docuchat_retrieved_documents_count_sum/_count (average documents retrieved per query)
    • docuchat_retrieved_documents_avg_score (gauge/summary)
  • User Engagement:
    • docuchat_successful_responses_total (counter)
    • docuchat_failed_responses_total (counter)

These metrics enable the team to:

  • Set up alerts for high latency or increasing error rates.
  • Monitor the financial cost of LLM usage in real-time.
  • Identify bottlenecks in the RAG pipeline (e.g., slow retrieval vs. slow LLM).
  • Understand the distribution of retrieved document scores, indicating retrieval effectiveness.

3. Traces: The ‘Path Taken’

Traces provide a holistic view of a single request’s journey through a distributed system. For DocuChat, OpenTelemetry is used to instrument the LangChain pipeline and any external API calls. Each step in the RAG process becomes a span within a trace.

Example Trace Flow for DocuChat:

  1. Span: User Request (Root Span)
    • Attributes: user_id, query, session_id
  2. Span: Retrieval
    • Attributes: vector_db_query, k_value, retrieval_engine
    • Child Span: Vector DB Query
      • Attributes: query_vector_size, response_time_ms
    • Child Span: Document Pre-processing (if any)
      • Attributes: num_docs_processed
    • Events: “Retrieved 3 documents”, “No relevant documents found”
  3. Span: LLM Generation
    • Attributes: model_name, temperature, prompt_tokens, completion_tokens
    • Child Span: OpenAI API Call
      • Attributes: api_endpoint, status_code, request_id
    • Events: “LLM started generation”, “LLM finished generation”, “LLM hallucination detected (low confidence)”
  4. Span: Post-processing & Response
    • Attributes: response_length_chars, sentiment_score

Using a tracing backend (like Jaeger, Honeycomb, Datadog APM), the team can:

  • Visually inspect the exact sequence of operations for any given query.
  • Pinpoint which stage of the RAG pipeline is consuming the most time.
  • Understand the context passed to the LLM (retrieved documents) for a specific trace.
  • Identify errors or unexpected behaviors within specific spans.

4. Human Feedback & Evaluation: The ‘How Good’

LLMs, by their nature, are probabilistic. While logs, metrics, and traces tell us about the system’s mechanics, they don’t always tell us about the quality of the LLM’s output. This is where human feedback and automated evaluation metrics become crucial.

For DocuChat, we implemented:

  1. User Upvote/Downvote Buttons: Simple feedback mechanism on each response. Downvotes trigger an alert for manual review if a certain threshold is met.
  2. “Report Issue” Feature: Allows users to provide free-text feedback on hallucinations, incorrect answers, or irrelevant information.
  3. Golden Dataset Evaluation: A set of known questions with expert-verified answers. Regularly run DocuChat against this dataset and evaluate responses using metrics like:
    • RAGAS: A framework specifically for RAG evaluation, measuring aspects like faithfulness, answer relevance, context precision, and context recall.
    • Semantic Similarity: Using embedding models to compare the generated answer with the golden answer.
    • Keyword Overlap: Simple checks for presence of key terms.
  4. LLM-as-a-Judge: In some cases, a more powerful LLM (e.g., GPT-4) can be used to evaluate the output of a less powerful LLM (e.g., GPT-3.5) against a reference or for specific criteria (e.g., conciseness, tone).

Integrating this feedback allows the team to:

  • Directly measure user satisfaction and identify problematic areas.
  • Iteratively improve the RAG pipeline (e.g., refine chunking strategy, improve embedding model, adjust prompt engineering).
  • Track the performance of different LLM versions or retrieval algorithms over time.

Practical Implementation and Tools

The observability stack for DocuChat uses a combination of open-source and commercial tools:

  • Logging: Python’s logging module, configured to output JSON, shipped to Loki (Grafana Labs’ log aggregation system).
  • Metrics: Prometheus for time-series data collection, exposed via a custom FastAPI endpoint with prometheus_client. Grafana for dashboarding and alerting.
  • Tracing: OpenTelemetry Python SDK for instrumentation, sending traces to Jaeger. LangChain offers built-in callbacks that can be adapted for OpenTelemetry.
  • LLM-Specific Platforms: Tools like LangSmith (for LangChain tracing and evaluation), Weights & Biases (for experiment tracking and model evaluation), or Arize AI (for LLM observability and monitoring) can significantly streamline the process, especially for advanced use cases and large teams. For this case study, we opted for a more DIY approach using open-source components to illustrate the fundamentals.
  • Human Feedback: Custom UI elements integrated with a dedicated feedback service that logs to Loki and updates a metrics counter.

Benefits and Outcomes for DocuChat

Implementing this thorough observability strategy yielded significant benefits for the DocuChat team:

  • Faster Debugging: When a user reported an incorrect answer, engineers could quickly trace the query, examine the retrieved documents, the exact prompt, and the LLM’s response, often identifying issues within minutes.
  • Cost Optimization: By monitoring token usage, the team identified opportunities to reduce prompt size and experiment with cheaper LLM models for specific use cases, leading to a 15% reduction in API costs.
  • Improved Performance: Metrics revealed that the vector database was occasionally slow. This led to optimizing the vector indexing strategy and scaling up the database, resulting in a 20% improvement in average response time.
  • Enhanced Reliability: Alerts for LLM API errors or high hallucination rates allowed proactive intervention, preventing widespread user dissatisfaction.
  • Data-Driven Iteration: Human feedback and evaluation metrics provided concrete evidence for which changes to the RAG pipeline or prompt engineering were most effective, driving continuous improvement.
  • Transparency: Stakeholders gained better insight into the system’s performance and the value it was delivering.

Challenges and Future Work

While successful, the implementation wasn’t without its challenges:

  • Data Volume: LLM interactions can generate a lot of data (prompts, responses, context). Managing and storing this data efficiently is crucial.
  • Contextual Understanding: Interpreting LLM outputs still often requires human judgment. Automated evaluation is improving but not perfect.
  • Prompt Engineering Changes: Frequent changes to prompts require careful versioning and ensuring evaluation metrics remain relevant.
  • Hallucination Detection: While some metrics (like RAGAS faithfulness) help, solid, real-time hallucination detection remains an active research area.

Future work for DocuChat includes:

  • Integrating advanced anomaly detection for LLM outputs.
  • Developing more sophisticated A/B testing frameworks for different RAG configurations.
  • Exploring proactive detection of adversarial prompts or prompt injection attempts.
  • using fine-tuned smaller models for specific tasks to further optimize cost and latency.

Conclusion

Observability is not merely a good-to-have but a fundamental requirement for building reliable, performant, and cost-effective LLM applications. As demonstrated by the DocuChat case study, a thorough strategy incorporating structured logs, detailed metrics, end-to-end traces, and crucial human feedback provides the insights necessary to navigate the unique complexities of LLM-powered systems. By investing in observability from the outset, development teams can gain confidence in their LLM applications, debug issues rapidly, optimize performance, and ultimately deliver a superior user experience.

🕒 Last updated:  ·  Originally published: January 14, 2026

✍️
Written by Jake Chen

AI technology writer and researcher.

Learn more →

Leave a Comment

Your email address will not be published. Required fields are marked *

Browse Topics: Alerting | Analytics | Debugging | Logging | Observability

More AI Agent Resources

ClawseoBotsecAgntdevAi7bot
Scroll to Top