Observability for LLM Apps: Best Practices and Practical Examples

🌐🇩🇪 Deutsch 🇫🇷 Français 🇫🇷 Français 🇪🇸 Español 🇺🇸 English

📖 12 min read•2,268 words•Updated Mar 26, 2026

The Rise of LLM Applications and the Need for Advanced Observability

Large Language Models (LLMs) have rapidly transitioned from academic curiosities to foundational components of new applications across industries. From intelligent chatbots and content generators to code assistants and data analysis tools, LLM-powered applications are redefining user experiences and business processes. However, this transformative power comes with a unique set of operational challenges. Unlike traditional software, LLM applications introduce a new layer of complexity stemming from their probabilistic nature, dependency on external model providers, intricate prompt engineering, and the subjective quality of their outputs.

Traditional observability tools, designed for deterministic systems, often fall short when attempting to diagnose issues within LLM applications. A simple 5xx error might indicate a failing API call, but it doesn’t tell you if the model hallucinated, if the prompt was poorly constructed, or if the user’s input was misinterpreted. This gap necessitates a specialized approach to observability, one that focuses not just on system health but also on the quality, relevance, and safety of the LLM’s outputs, as well as the intricate dance between user input, prompt, model, and external tools.

What Makes LLM Observability Different?

The core differences in LLM application architecture and behavior drive the need for a distinct observability strategy:

Probabilistic Nature: LLMs don’t always produce the same output for the same input. This non-determinism makes debugging challenging.
Black Box Problem: While we can influence LLMs via prompts and fine-tuning, the internal reasoning process remains largely opaque.
Prompt Engineering Sensitivity: Small changes in prompts can lead to vastly different outputs, requiring careful tracking of prompt versions and their impact.
Subjective Quality: The ‘correctness’ or ‘usefulness’ of an LLM’s output is often subjective and context-dependent, making automated evaluation difficult.
External Dependencies: Many LLM apps rely on external APIs (for models, vector databases, RAG sources, etc.), introducing external points of failure and latency.
Hallucinations & Bias: LLMs can generate factually incorrect information or exhibit biases, which must be detected and mitigated.
Token Usage & Cost: LLM API calls are often billed per token, making cost monitoring a critical aspect of observability.
Chaining & Agent Behavior: Complex LLM applications often involve multiple LLM calls, tool usage, and decision-making agents, creating intricate execution paths.

Key Pillars of LLM Observability

Effective LLM observability can be broken down into several key pillars, each addressing a specific aspect of the application’s health and performance:

1. Request & Response Tracing

Just like traditional microservices, tracing individual requests through your LLM application is fundamental. However, for LLMs, this tracing needs to capture significantly more context.

Best Practices:

Capture Full Request/Response Payloads: Log the complete user input, the final prompt sent to the LLM, the LLM’s raw response, and your application’s processed output. This is crucial for post-hoc analysis and debugging.
Track Prompt Templates & Variables: If you’re using prompt templates, log the template ID/version and the specific variables injected into it for each request. This helps understand how prompt changes affect outcomes.
Record Model & Parameters: Log the specific LLM model used (e.g., GPT-4, Claude 3 Opus), temperature, top_p, max_tokens, stop sequences, and any other relevant parameters.
Timestamp & Latency: Standard practice, but critical for LLM calls due to potential rate limiting and varying response times. Track end-to-end latency and individual LLM API call latency.
User & Session IDs: Associate requests with specific users or sessions to understand individual user experiences and identify patterns.
Tool Usage: If your LLM app uses tools (e.g., search API, database query, code interpreter), log which tools were invoked, their inputs, and their outputs. This is especially important for agent-based systems.

Practical Example (Python/LangChain):

from langchain_core.tracers import ConsoleCallbackHandler
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# A simple LLM chain
model = ChatOpenAI(model="gpt-3.5-turbo")
prompt = ChatPromptTemplate.from_messages([
 ("system", "You are a helpful AI assistant."),
 ("user", "{query}")
])
chain = prompt | model

# To see basic tracing in console (for local dev/debug)
response = chain.invoke({"query": "What is the capital of France?"}, config={"callbacks": [ConsoleCallbackHandler()]})

# For production, integrate with a dedicated tracing platform (e.g., Langsmith, OpenTelemetry)
# Example with Langsmith (conceptual, requires setup):
# import os
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = "your_langsmith_api_key"
# os.environ["LANGCHAIN_PROJECT"] = "my_llm_app"
# response = chain.invoke({"query": "Tell me a fun fact about Paris."})
# print(response.content)

2. Output Quality & Evaluation

Monitoring the subjective quality of LLM outputs is perhaps the most challenging, yet crucial, aspect of LLM observability.

Best Practices:

Human Feedback Loops: Implement mechanisms for users to rate or provide feedback on LLM responses (e.g., thumbs up/down, free-text feedback forms). This is invaluable for identifying regressions and areas for improvement.
Automated Evaluation Metrics: For specific tasks, use automated metrics. For summarization, ROUGE scores; for factual retrieval, exact match or F1 scores against a ground truth. For code generation, unit test pass rates.
LLM-as-a-Judge: Use a more powerful LLM to evaluate the output of another LLM based on predefined criteria (e.g., coherence, relevance, factual accuracy, safety). This can be surprisingly effective for scaling evaluation.
Categorization of Failures: When a problem is detected (human or automated), categorize it (e.g., hallucination, irrelevant, incomplete, unsafe, bad formatting, sentiment mismatch). This helps pinpoint specific weaknesses.
Ground Truth Data Collection: Continuously collect and label examples of good and bad outputs to build a solid dataset for future model fine-tuning and automated evaluation.

Practical Example (LLM-as-a-Judge):

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

def evaluate_response(user_query, generated_response):
 evaluator_llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)
 eval_prompt = ChatPromptTemplate.from_messages([
 ("system", "You are an AI assistant designed to evaluate the quality of another AI's response. Rate the response on a scale of 1 to 5 for relevance, accuracy, and completeness. Provide a brief explanation for your rating."),
 ("user", "User Query: {query}\nAI Response: {response}\nEvaluation:")
 ])
 evaluation_chain = eval_prompt | evaluator_llm
 return evaluation_chain.invoke({"query": user_query, "response": generated_response}).content

# Example usage
query = "Tell me about the history of the internet."
response = "The internet was invented by Al Gore in 1990." # Incorrect response
eval_result = evaluate_response(query, response)
print(eval_result)
# Expected output might be: "Relevance: 5/5, Accuracy: 1/5, Completeness: 2/5. Explanation: While relevant, the response contains a significant factual inaccuracy about Al Gore inventing the internet."

3. Cost & Latency Monitoring

LLM API calls can be expensive, and latency directly impacts user experience. Monitoring these metrics is non-negotiable.

Best Practices:

Token Usage Tracking: Monitor input and output token counts for every LLM call. This is the primary driver of cost.
Cost Estimation: Translate token counts into estimated dollar costs based on provider pricing. Set alerts for unusual cost spikes.
API Latency: Track the time taken for each LLM API call. Differentiate between your application’s processing time and the external API’s response time.
Rate Limit Monitoring: Keep an eye on API rate limit errors (e.g., 429 Too Many Requests). Implement retry mechanisms with exponential backoff and alert if rate limits are frequently hit.
Provider Health: Monitor the status pages of your LLM providers (OpenAI, Anthropic, etc.) and integrate with their status APIs if available.

Practical Example (Cost/Latency Logging):

import time
from langchain_openai import ChatOpenAI

def call_llm_and_log_metrics(model_name, prompt_text):
 start_time = time.time()
 llm = ChatOpenAI(model=model_name)
 try:
 response = llm.invoke(prompt_text)
 end_time = time.time()

 # Langchain often provides token usage in the response metadata
 input_tokens = response.response_metadata.get('token_usage', {}).get('prompt_tokens', 0)
 output_tokens = response.response_metadata.get('token_usage', {}).get('completion_tokens', 0)
 total_tokens = input_tokens + output_tokens

 latency = (end_time - start_time) * 1000 # milliseconds

 # Simple print for illustration; in production, send to Prometheus/Grafana, Datadog, etc.
 print(f"LLM Call Metrics:")
 print(f" Model: {model_name}")
 print(f" Latency: {latency:.2f} ms")
 print(f" Input Tokens: {input_tokens}")
 print(f" Output Tokens: {output_tokens}")
 print(f" Total Tokens: {total_tokens}")
 # Estimate cost (example pricing, adjust as needed)
 # For gpt-3.5-turbo-0125: input $0.50/M tokens, output $1.50/M tokens
 input_cost = (input_tokens / 1_000_000) * 0.50
 output_cost = (output_tokens / 1_000_000) * 1.50
 total_cost = input_cost + output_cost
 print(f" Estimated Cost: ${total_cost:.5f}")

 return response.content
 except Exception as e:
 print(f"LLM Call Error: {e}")
 # Log error to error tracking system
 return None

# Example usage
call_llm_and_log_metrics("gpt-3.5-turbo-0125", "Explain quantum entanglement in simple terms.")

4. Safety & Security Monitoring

Preventing the generation of harmful, biased, or inappropriate content is a critical concern for LLM applications.

Best Practices:

Input/Output Moderation: Use content moderation APIs (e.g., OpenAI’s moderation endpoint, custom solutions) to scan both user inputs and LLM outputs for harmful content (hate speech, self-harm, sexual content, violence).
Jailbreak Detection: Monitor for attempts by users to bypass safety filters or coax the LLM into generating undesirable content (jailbreaking).
PII/PHI Detection: Implement PII (Personally Identifiable Information) and PHI (Protected Health Information) detection and redaction to prevent sensitive data leakage.
Bias Detection: While complex, try to detect statistically significant biases in LLM outputs over time (e.g., gender, racial bias in generated text).
Prompt Injection Monitoring: Look for patterns in user input that suggest prompt injection attempts.

Practical Example (OpenAI Moderation):

from openai import OpenAI

client = OpenAI()

def moderate_text(text):
 try:
 response = client.moderations.create(input=text)
 result = response.results[0]
 if result.flagged:
 print(f"Content Flagged: {result.categories}")
 return True, result.categories
 else:
 print("Content is safe.")
 return False, {}
 except Exception as e:
 print(f"Moderation API error: {e}")
 return False, {"error": str(e)}

# Example usage
is_flagged, categories = moderate_text("I hate everyone and want to hurt them.")
# is_flagged, categories = moderate_text("The quick brown fox jumps over the lazy dog.")

if is_flagged:
 # Take action: block response, escalate, etc.
 pass

5. Data & Context Monitoring (for RAG/Agents)

For applications using Retrieval Augmented Generation (RAG) or complex agents, monitoring the data sources and context is vital.

Best Practices:

Retrieval Performance: Track the latency and success rate of your vector database queries or external API calls for retrieving context.
Retrieved Document Quality: Log the content of the documents retrieved for each query. Evaluate their relevance to the user’s question. This can be done via human review or LLM-as-a-judge.
Context Window Utilization: Monitor how much of the LLM’s context window is being used by retrieved documents and the prompt. Overfilling can lead to truncation or decreased performance.
Data Freshness: For RAG systems, ensure that the underlying data sources (e.g., vector database, knowledge base) are up-to-date and that your indexing/embedding pipelines are functioning correctly.
Tool Selection Accuracy: For agents, monitor if the correct tools are being selected for given user queries. Log the agent’s reasoning process.

Practical Example (RAG Context Logging):

# Assuming you have a RAG chain setup (e.g., Langchain's RAG example)
# This is conceptual, specific implementation depends on your RAG setup

def query_rag_and_log_context(retriever, llm_chain, user_query):
 # Simulate retrieval
 retrieved_docs = retriever.invoke(user_query)
 
 # Log retrieved documents for observability
 print(f"\n--- Retrieved Documents for Query: '{user_query}' ---")
 for i, doc in enumerate(retrieved_docs):
 print(f"Doc {i+1} (Source: {doc.metadata.get('source', 'N/A')}):")
 print(f" Content Snippet: {doc.page_content[:200]}...")
 print("---------------------------------------------------")

 # Pass documents to the LLM chain along with the query
 response = llm_chain.invoke({"context": retrieved_docs, "question": user_query})
 
 return response

# Placeholder for a retriever and LLM chain
class MockRetriever:
 def invoke(self, query):
 if "internet" in query:
 return [
 {'page_content': 'The ARPANET, funded by DARPA, was the precursor to the Internet. Its development began in the late 1960s.', 'metadata': {'source': 'Wikipedia'}},
 {'page_content': 'Vinton Cerf and Robert Kahn developed the TCP/IP protocols, which became the standard for communication on the internet.', 'metadata': {'source': 'RFC 793'}}
 ]
 return []

class MockLLMChain:
 def invoke(self, inputs):
 context_summary = " ".join([d['page_content'] for d in inputs['context']])
 return f"Based on the provided context, here's an answer to \"{inputs['question']}\": {context_summary[:150]}..."

# Example usage
my_retriever = MockRetriever()
my_llm_chain = MockLLMChain()

query_rag_and_log_context(my_retriever, my_llm_chain, "Who developed the early internet protocols?")

Choosing the Right Tools for LLM Observability

A thorough LLM observability strategy often involves a combination of tools:

Specialized LLM Observability Platforms: Tools like Langsmith, Helicone, Athina.ai, Arize AI, and Phoenix are purpose-built for LLM applications, offering features like prompt versioning, trace visualization, human feedback integration, and automated evaluation.
Traditional APM/Logging Platforms: Datadog, New Relic, Splunk, Elastic Stack (ELK) are still essential for infrastructure monitoring, application logs, and aggregating metrics. Integrate LLM-specific metrics (token usage, latency) into these.
Vector Databases: Crucial for RAG, but also for storing embeddings of prompts and responses for similarity search-based analysis of problematic outputs.
Experiment Tracking Tools: MLflow, Weights & Biases can be adapted to track prompt engineering experiments and their associated metrics.
Custom Dashboards & Scripts: For specific needs, custom Python scripts combined with dashboarding tools like Grafana can provide targeted insights.

Conclusion

Observability for LLM applications is not a ‘nice-to-have’ but a fundamental requirement for building solid, reliable, and responsible systems. The unique characteristics of LLMs demand a shift from traditional monitoring to a more nuanced approach that encompasses input/output tracing, subjective quality evaluation, cost management, safety checks, and context awareness. By implementing the best practices outlined above and using appropriate tooling, developers and MLOps teams can gain the deep insights necessary to understand, debug, and continuously improve their LLM-powered applications, ensuring they deliver consistent value and maintain user trust.

🕒 Last updated: March 26, 2026 · Originally published: December 11, 2025

✍️

Written by Jake Chen

AI technology writer and researcher.

Learn more →

Observability for LLM Apps: Best Practices and Practical Examples

The Rise of LLM Applications and the Need for Advanced Observability

What Makes LLM Observability Different?

Key Pillars of LLM Observability

1. Request & Response Tracing

Best Practices:

Practical Example (Python/LangChain):

2. Output Quality & Evaluation

Best Practices:

Practical Example (LLM-as-a-Judge):

3. Cost & Latency Monitoring

Best Practices:

Practical Example (Cost/Latency Logging):

4. Safety & Security Monitoring

Best Practices:

Practical Example (OpenAI Moderation):

5. Data & Context Monitoring (for RAG/Agents)

Best Practices:

Practical Example (RAG Context Logging):

Choosing the Right Tools for LLM Observability

Conclusion

Related Articles

Leave a Comment Cancel Reply

The Rise of LLM Applications and the Need for Advanced Observability

What Makes LLM Observability Different?

Key Pillars of LLM Observability

1. Request & Response Tracing

Best Practices:

Practical Example (Python/LangChain):

2. Output Quality & Evaluation

Best Practices:

Practical Example (LLM-as-a-Judge):

3. Cost & Latency Monitoring

Best Practices:

Practical Example (Cost/Latency Logging):

4. Safety & Security Monitoring

Best Practices:

Practical Example (OpenAI Moderation):

5. Data & Context Monitoring (for RAG/Agents)

Best Practices:

Practical Example (RAG Context Logging):

Choosing the Right Tools for LLM Observability

Conclusion

You May Also Like

You May Also Like

📚 You Might Also Like

Related Articles

Leave a Comment Cancel Reply