A comprehensive and detailed guide to implementing observability in AI applications using LangSmith
- Why LangSmith?
- Core Concepts
- Getting Started
- RAG Implementation Evolution
- LangGraph Integration
- Advanced Features
- Best Practices
- Resources
Large Language Models (LLMs) present unique challenges that normal debugging tools cannot handle:
Core Issues:
- Non-Deterministic Behavior - Same input produces different outputs
- Black Box Nature - No visibility into internal decision-making (we only give input and get output )
- Silent Failures - Quality issues occur without any error messages (llm output quality reduces without any error message)
- Complex Workflows - Multi-step pipelines are hard to trace (from where error is coming)
Application: Automated job matching and cover letter generation
- Normal Performance: 1 minute execution time
- Production Issue: Sudden spike to 10 minutes
- Problem: Cannot identify which component in the complex workflow is causing the slowdown
Application: Detailed research query processing
- Normal Cost: βΉ1 per query
- Production Issue: Cost jumps to βΉ20 per query
- Problem: No visibility into token consumption across pipeline components
Application: Company policy and leave management assistant
- Production Issue: Hallucinations providing incorrect information
- Root Causes: Either retriever malfunction OR LLM generation issues
- Problem: Cannot determine which component is failing
- β No stack traces for quality degradation
- β No error messages for incorrect outputs
- β Cannot inspect LLM reasoning process
- β Difficult to track performance across multi-step workflows
Observability is the ability to understand your system's internal state by examining its external outputs.
- π Logs - What happened and when
- π Metrics - Quantitative measurements
- π Traces - Request flow through system components
| Traditional Apps | AI Applications |
|---|---|
| Predictable flows | Dynamic, context-dependent paths |
| Clear error messages | Silent quality degradation |
| Deterministic outputs | Non-deterministic responses |
| Standard monitoring | Custom evaluation metrics needed |
LangSmith is a unified platform for debugging, testing, and monitoring AI applications.
- π₯ Input/Output Data - Complete request/response pairs
- π Intermediate Steps - Every component execution
- β±οΈ Performance Metrics - Latency, token usage, costs
- β Error Tracking - Failures and their contexts
- π·οΈ Custom Metadata - Tags, versions, user information
- π¬ User Feedback - Real-world quality assessments
The complete AI application or workflow
Example: Customer Support Bot
A single execution of your entire project
User Query: "What is our vacation policy?" = 1 Trace
Individual component executions within a trace
Trace: "Vacation policy query"
βββ Run 1: Query Processing (Prompt Engineering)
βββ Run 2: Document Retrieval (Vector Search)
βββ Run 3: Context Ranking (Reranking)
βββ Run 4: Answer Generation (LLM)
graph LR
A[User Query] --> B[Prompt Engineering]
B --> C[Vector Search]
C --> D[Reranking]
D --> E[LLM Generation]
E --> F[Response]
- Python 3.8+
- LangSmith Account (Free tier available)
- Groq API Key (Free tier available)
git clone https://github.com/Shubham8831/LangSmith-Tutorial
cd LangSmith-Tutorial
pip install -r requirements.txtCreate a .env file in your project root:
# API Keys
GROQ_API_KEY=your_groq_api_key_here
LANGCHAIN_API_KEY=your_langsmith_api_key_here
# LangSmith Configuration
LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_PROJECT=your_project_nameπ‘ Get API Keys:
python 1_simple_llm_call.py- Open LangSmith Dashboard
- Navigate to your project
- Explore traces, metrics, and performance data
import os
os.environ['LANGCHAIN_PROJECT'] = f"Customer-Support-{datetime.now().strftime('%Y-%m')}"
os.environ['LANGCHAIN_PROJECT'] = "your-project-name"from langchain_core.runnables import RunnableConfig
config = RunnableConfig(
run_name="Policy Query Handler",
tags=["production", "policy-bot", "v2.1"],
metadata={
"user_id": "user_12345",
"session_id": "session_67890",
"version": "2.1.0",
"environment": "production",
"model_version": "gpt-4-turbo",
"retrieval_method": "semantic_search"
}
)
response = chain.invoke(
{"question": "What is the vacation policy?"},
config=config
)# Trace-level configuration
trace_config = {
"tags": ["high-priority", "policy-query"],
"metadata": {"urgency": "high", "department": "HR"}
}
# Component-level configuration
llm_config = {
"run_name": "GPT-4 Policy Generation",
"tags": ["llm-generation"],
"metadata": {"temperature": 0.1, "max_tokens": 500}
}Tutorial Files: GitHub Repository
Issues Identified:
- β Only LangChain runnables traced
- β Document loading invisible
- β Text chunking not monitored
- β Vector store recreation on each run
- β Incomplete pipeline visibility
Observation: Limited debugging capability for RAG components
Solution: @traceable decorator implementation above individual functions(can also add name, tag and metadata)
from langsmith import traceable
@traceable(
name="Document Loader",
tags=["preprocessing", "io"],
metadata={"loader_type": "pdf", "chunk_size": 1000}
)
def load_documents(file_path: str):
"""Load and process documents with full tracing"""
# Document loading implementation
return documents
@traceable(name="Text Chunking", tags=["preprocessing"])
def chunk_documents(documents):
"""Split documents into chunks with tracing"""
# Chunking implementation
return chunks
@traceable(name="Vector Store Creation", tags=["embedding"])
def create_vector_store(chunks):
"""Create vector database with tracing"""
# Vector store implementation
return vector_storeNew Issue: Components traced individually, lacking pipeline
Solution: Unified pipeline with hierarchical tracing
- β All components traced with proper relationships
- β Unified pipeline visualization
- β Organized metadata and tagging
- β Performance inefficiency remains
Achievement: Complete pipeline observability
Solution: Optimized persistent vector store + comprehensive tracing
- β Vector store created once, reused efficiently
- β Complete component tracing
- β Production-grade performance
- β Comprehensive error handling
| Feature | v1 | v2 | v3 | v4 |
|---|---|---|---|---|
| Chain Tracing | β | β | β | β |
| Function Tracing | β | β | β | β |
| Pipeline Unity | β | β | β | β |
| Performance Optimized | β | β | β | β |
| Production Ready | β | β | β | β |
Example File:
5_langgraph.py
Problem: LangGraph creates complex, branching workflows that are difficult to debug and monitor.
- Graph Execution β Complete trace with all nodes
- Each Node β Individual run (retriever, LLM, tools, sub-graphs)
- Path Visualization β Clear execution flow mapping
- Branch Tracking β Conditional and parallel path monitoring
START β Query Analysis β Document Retrieval β Relevance Check
β
Response Generation β Context Ranking β [Relevant?]
β
Fallback Response β [Not Relevant?]
Benefits:
- π Complete visibility into graph execution
- π Performance metrics for each node
- π Easy identification of bottlenecks
- π Understanding of decision paths
Implementation:
from langsmith import Client
client = Client()
# Capture user feedback
client.create_feedback(
run_id=run_id,
key="user_rating",
score=0.8,
comment="Accurate and helpful response"
)Capabilities:
- Real-time feedback collection
- Feedback-trace correlation
- Bulk analysis and insights
- Quality improvement loops
Features:
- Cross-functional dashboard access
- Trace annotation and commenting
- Shared experiment tracking
- Role-based permissions
Use Cases:
- Product managers reviewing AI quality
- QA teams annotating edge cases
- Engineers debugging issues collaboratively
Workflow:
from langsmith import Client
client = Client()
# Create evaluation dataset
dataset = client.create_dataset(
dataset_name="Customer_Support_Eval",
description="Evaluation dataset for policy questions"
)
# Add examples
client.create_example(
dataset_id=dataset.id,
inputs={"question": "What is the vacation policy?"},
outputs={"answer": "Employees get 20 days annual leave..."}
)Applications:
- Model evaluation benchmarks
- Fine-tuning data preparation
- A/B testing frameworks
- Performance regression testing
A/B Testing Framework:
def run_prompt_experiment():
prompts = {
"formal": "Please provide a professional response...",
"casual": "Hey! Can you help me understand...",
"detailed": "Provide a comprehensive explanation..."
}
for prompt_type, prompt_text in prompts.items():
config = {"tags": [f"experiment_{prompt_type}"]}
# Run evaluation with each promptComprehensive Quality Metrics:
from langsmith.evaluation import LangChainStringEvaluator
# Faithfulness evaluation
faithfulness_evaluator = LangChainStringEvaluator("faithfulness")
# Relevance evaluation
relevance_evaluator = LangChainStringEvaluator("relevance")
# Custom evaluation
def custom_policy_evaluator(run, example):
"""Custom evaluator for policy-specific responses"""
# Implementation
return {"score": score, "reasoning": explanation}Evaluation Types:
- Offline: Pre-deployment batch testing
- Online: Continuous production monitoring
- Custom: Domain-specific quality metrics
Alert Configuration:
# Monitor key metrics
metrics_to_track = {
"latency_p95": {"threshold": 2000, "unit": "ms"},
"cost_per_query": {"threshold": 0.50, "unit": "USD"},
"error_rate": {"threshold": 0.05, "unit": "percentage"},
"user_satisfaction": {"threshold": 0.8, "unit": "score"}
}Monitoring Capabilities:
- Real-time performance dashboards
- Automated alert systems
- Cost tracking and optimization
- Quality degradation detection
# Use descriptive project names
os.environ['LANGCHAIN_PROJECT'] = f"PolicyBot-{environment}-v{version}"
# Implement consistent tagging
tags = [
f"env-{environment}",
f"version-{app_version}",
f"model-{model_name}",
"production" if is_production else "development"
]@traceable(
name="Document Retrieval",
tags=["rag", "retrieval"],
metadata={
"retrieval_method": "semantic_search",
"top_k": 5,
"similarity_threshold": 0.8
}
)
def retrieve_documents(query: str):
# Clear, descriptive function names
# Comprehensive metadata
# Appropriate tagging
pass- Define Clear Metrics: Establish what "good" looks like
- Create Diverse Datasets: Cover edge cases and typical usage
- Automate Evaluations: Integrate into CI/CD pipeline
- Monitor Continuously: Track performance over time
- Act on Insights: Use data to drive improvements
- Sanitize sensitive data in traces
- Use appropriate access controls
- Implement data retention policies
- Regular security audits
π΄ Traces Not Appearing
# Verify environment variables
import os
print(f"LANGCHAIN_TRACING_V2: {os.getenv('LANGCHAIN_TRACING_V2')}")
print(f"LANGCHAIN_API_KEY: {'Set' if os.getenv('LANGCHAIN_API_KEY') else 'Not Set'}")π΄ High Costs
- Monitor token usage patterns
- Implement caching strategies
- Optimize prompt efficiency
- Use appropriate model tiers
π΄ Performance Issues
- Check network latency
- Optimize retrieval components
- Implement async operations
- Monitor resource utilization
- Beginner: Start with
1_simple_llm_call.py - Intermediate: Progress through RAG implementations (v1-v4)
- Advanced: Explore LangGraph integration and evaluation
- Expert: Implement custom evaluators and monitoring
LangSmith transforms AI development from experimental guesswork to systematic, data-driven engineering:
- π Complete Visibility - See every step of your AI workflow
- π Data-Driven Decisions - Optimize based on real performance data
- π Faster Debugging - Quickly identify and resolve issues
- π₯ Team Collaboration - Enable cross-functional AI development
- π Continuous Improvement - Build feedback loops for quality enhancement
- π Start Here: Clone the tutorial repository
- π§ Setup: Configure your environment following this guide
- π― Practice: Run examples from simple LLM to complex RAG
- π Explore: Use LangSmith dashboard for insights
- ποΈ Build: Apply these concepts to your AI projects
- π Scale: Implement production monitoring and evaluation
In the era of non-deterministic AI, observability isn't just helpfulβit's essential for building reliable, scalable applications that users can trust.
Ready to build better AI? Start your journey with LangSmith today! π