Skip to content

πŸ” Complete LangSmith Tutorial: Master AI Observability & Debugging | Learn to trace, monitor, and evaluate LLM applications with hands-on examples from simple chains to complex RAG pipelines and Agents

Notifications You must be signed in to change notification settings

Shubham8831/LangSmith-Tutorial

Repository files navigation

LangSmith: Complete Observability Guide for AI Applications

LangSmith Tutorial License

A comprehensive and detailed guide to implementing observability in AI applications using LangSmith

Table of Contents


Why LangSmith?

The AI Debugging Challenge/Problem

Large Language Models (LLMs) present unique challenges that normal debugging tools cannot handle:

Core Issues:

  • Non-Deterministic Behavior - Same input produces different outputs
  • Black Box Nature - No visibility into internal decision-making (we only give input and get output )
  • Silent Failures - Quality issues occur without any error messages (llm output quality reduces without any error message)
  • Complex Workflows - Multi-step pipelines are hard to trace (from where error is coming)

Real-World Production Issues

πŸ” Case Study 1: AI Job Assistant

Application: Automated job matching and cover letter generation

  • Normal Performance: 1 minute execution time
  • Production Issue: Sudden spike to 10 minutes
  • Problem: Cannot identify which component in the complex workflow is causing the slowdown

πŸ” Case Study 2: Research Assistant Agent

Application: Detailed research query processing

  • Normal Cost: β‚Ή1 per query
  • Production Issue: Cost jumps to β‚Ή20 per query
  • Problem: No visibility into token consumption across pipeline components

πŸ” Case Study 3: RAG-Based Policy Chatbot

Application: Company policy and leave management assistant

  • Production Issue: Hallucinations providing incorrect information
  • Root Causes: Either retriever malfunction OR LLM generation issues
  • Problem: Cannot determine which component is failing

Why Traditional Debugging Fails

  • ❌ No stack traces for quality degradation
  • ❌ No error messages for incorrect outputs
  • ❌ Cannot inspect LLM reasoning process
  • ❌ Difficult to track performance across multi-step workflows

What is Observability?

Observability is the ability to understand your system's internal state by examining its external outputs.

Key Components:

  • πŸ“Š Logs - What happened and when
  • πŸ“ˆ Metrics - Quantitative measurements
  • πŸ” Traces - Request flow through system components

AI-Specific Observability Challenges:

Traditional Apps AI Applications
Predictable flows Dynamic, context-dependent paths
Clear error messages Silent quality degradation
Deterministic outputs Non-deterministic responses
Standard monitoring Custom evaluation metrics needed

LangSmith: AI Observability Platform

LangSmith is a unified platform for debugging, testing, and monitoring AI applications.

What LangSmith Captures:

  • πŸ“₯ Input/Output Data - Complete request/response pairs
  • πŸ”„ Intermediate Steps - Every component execution
  • ⏱️ Performance Metrics - Latency, token usage, costs
  • ❌ Error Tracking - Failures and their contexts
  • 🏷️ Custom Metadata - Tags, versions, user information
  • πŸ’¬ User Feedback - Real-world quality assessments

Core Concepts

πŸ“ Project

The complete AI application or workflow

Example: Customer Support Bot

πŸ”— Trace

A single execution of your entire project

User Query: "What is our vacation policy?" = 1 Trace

βš™οΈ Run

Individual component executions within a trace

Trace: "Vacation policy query"
β”œβ”€β”€ Run 1: Query Processing (Prompt Engineering)
β”œβ”€β”€ Run 2: Document Retrieval (Vector Search)
β”œβ”€β”€ Run 3: Context Ranking (Reranking)
└── Run 4: Answer Generation (LLM)

Visual Representation:

graph LR
    A[User Query] --> B[Prompt Engineering]
    B --> C[Vector Search]
    C --> D[Reranking]
    D --> E[LLM Generation]
    E --> F[Response]
Loading

Getting Started

Prerequisites

Step 1: Repository Setup

git clone https://github.com/Shubham8831/LangSmith-Tutorial
cd LangSmith-Tutorial
pip install -r requirements.txt

Step 2: Environment Configuration

Create a .env file in your project root:

# API Keys
GROQ_API_KEY=your_groq_api_key_here
LANGCHAIN_API_KEY=your_langsmith_api_key_here

# LangSmith Configuration
LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_PROJECT=your_project_name

πŸ’‘ Get API Keys:

Step 3: Run First Example

python 1_simple_llm_call.py

Step 4: View Results

  1. Open LangSmith Dashboard
  2. Navigate to your project
  3. Explore traces, metrics, and performance data

Advanced Configuration

Dynamic Project Naming

import os
os.environ['LANGCHAIN_PROJECT'] = f"Customer-Support-{datetime.now().strftime('%Y-%m')}"
os.environ['LANGCHAIN_PROJECT'] = "your-project-name"

Comprehensive Tracing Configuration(adding run_name, metadata and tags)

from langchain_core.runnables import RunnableConfig

config = RunnableConfig(
    run_name="Policy Query Handler",
    tags=["production", "policy-bot", "v2.1"],
    metadata={
        "user_id": "user_12345",
        "session_id": "session_67890",
        "version": "2.1.0",
        "environment": "production",
        "model_version": "gpt-4-turbo",
        "retrieval_method": "semantic_search"
    }
)

response = chain.invoke(
    {"question": "What is the vacation policy?"},
    config=config
)

Hierarchical Trace Organization(adding run_name, metadata and tags)

# Trace-level configuration
trace_config = {
    "tags": ["high-priority", "policy-query"],
    "metadata": {"urgency": "high", "department": "HR"}
}

# Component-level configuration
llm_config = {
    "run_name": "GPT-4 Policy Generation",
    "tags": ["llm-generation"],
    "metadata": {"temperature": 0.1, "max_tokens": 500}
}

RAG Implementation Evolution

Tutorial Files: GitHub Repository

Version 1: Basic RAG (3_rag_v1.py)

Issues Identified:

  • ❌ Only LangChain runnables traced
  • ❌ Document loading invisible
  • ❌ Text chunking not monitored
  • ❌ Vector store recreation on each run
  • ❌ Incomplete pipeline visibility

Observation: Limited debugging capability for RAG components

Version 2: Adding Observability (3_rag_v2.py)

Solution: @traceable decorator implementation above individual functions(can also add name, tag and metadata)

from langsmith import traceable

@traceable(
    name="Document Loader",
    tags=["preprocessing", "io"],
    metadata={"loader_type": "pdf", "chunk_size": 1000}
)
def load_documents(file_path: str):
    """Load and process documents with full tracing"""
    # Document loading implementation
    return documents

@traceable(name="Text Chunking", tags=["preprocessing"])
def chunk_documents(documents):
    """Split documents into chunks with tracing"""
    # Chunking implementation
    return chunks

@traceable(name="Vector Store Creation", tags=["embedding"])
def create_vector_store(chunks):
    """Create vector database with tracing"""
    # Vector store implementation
    return vector_store

New Issue: Components traced individually, lacking pipeline

Version 3: Pipeline Integration (3_rag_v3.py)

Solution: Unified pipeline with hierarchical tracing

  • βœ… All components traced with proper relationships
  • βœ… Unified pipeline visualization
  • βœ… Organized metadata and tagging
  • ❌ Performance inefficiency remains

Achievement: Complete pipeline observability

Version 4: Production-Ready (3_rag_v4.py)

Solution: Optimized persistent vector store + comprehensive tracing

  • βœ… Vector store created once, reused efficiently
  • βœ… Complete component tracing
  • βœ… Production-grade performance
  • βœ… Comprehensive error handling

Evolution Comparison

Feature v1 v2 v3 v4
Chain Tracing βœ… βœ… βœ… βœ…
Function Tracing ❌ βœ… βœ… βœ…
Pipeline Unity ❌ ❌ βœ… βœ…
Performance Optimized ❌ ❌ ❌ βœ…
Production Ready ❌ ❌ ❌ βœ…

LangGraph Integration

Example File: 5_langgraph.py

Complex Workflow Challenges

Problem: LangGraph creates complex, branching workflows that are difficult to debug and monitor.

LangSmith Solution:

  • Graph Execution β†’ Complete trace with all nodes
  • Each Node β†’ Individual run (retriever, LLM, tools, sub-graphs)
  • Path Visualization β†’ Clear execution flow mapping
  • Branch Tracking β†’ Conditional and parallel path monitoring

Example Workflow:

START β†’ Query Analysis β†’ Document Retrieval β†’ Relevance Check
                                          ↓
    Response Generation ← Context Ranking ← [Relevant?]
                                          ↓
                              Fallback Response ← [Not Relevant?]

Benefits:

  • πŸ” Complete visibility into graph execution
  • πŸ“Š Performance metrics for each node
  • πŸ› Easy identification of bottlenecks
  • πŸ”€ Understanding of decision paths

Advanced Features

πŸ‘₯ User Feedback Integration

Implementation:

from langsmith import Client

client = Client()

# Capture user feedback
client.create_feedback(
    run_id=run_id,
    key="user_rating",
    score=0.8,
    comment="Accurate and helpful response"
)

Capabilities:

  • Real-time feedback collection
  • Feedback-trace correlation
  • Bulk analysis and insights
  • Quality improvement loops

🀝 Team Collaboration

Features:

  • Cross-functional dashboard access
  • Trace annotation and commenting
  • Shared experiment tracking
  • Role-based permissions

Use Cases:

  • Product managers reviewing AI quality
  • QA teams annotating edge cases
  • Engineers debugging issues collaboratively

πŸ“Š Dataset Management

Workflow:

from langsmith import Client

client = Client()

# Create evaluation dataset
dataset = client.create_dataset(
    dataset_name="Customer_Support_Eval",
    description="Evaluation dataset for policy questions"
)

# Add examples
client.create_example(
    dataset_id=dataset.id,
    inputs={"question": "What is the vacation policy?"},
    outputs={"answer": "Employees get 20 days annual leave..."}
)

Applications:

  • Model evaluation benchmarks
  • Fine-tuning data preparation
  • A/B testing frameworks
  • Performance regression testing

πŸ§ͺ Prompt Experimentation

A/B Testing Framework:

def run_prompt_experiment():
    prompts = {
        "formal": "Please provide a professional response...",
        "casual": "Hey! Can you help me understand...",
        "detailed": "Provide a comprehensive explanation..."
    }
    
    for prompt_type, prompt_text in prompts.items():
        config = {"tags": [f"experiment_{prompt_type}"]}
        # Run evaluation with each prompt

πŸ“ˆ Evaluation System

Comprehensive Quality Metrics:

from langsmith.evaluation import LangChainStringEvaluator

# Faithfulness evaluation
faithfulness_evaluator = LangChainStringEvaluator("faithfulness")

# Relevance evaluation
relevance_evaluator = LangChainStringEvaluator("relevance")

# Custom evaluation
def custom_policy_evaluator(run, example):
    """Custom evaluator for policy-specific responses"""
    # Implementation
    return {"score": score, "reasoning": explanation}

Evaluation Types:

  • Offline: Pre-deployment batch testing
  • Online: Continuous production monitoring
  • Custom: Domain-specific quality metrics

🚨 Production Monitoring

Alert Configuration:

# Monitor key metrics
metrics_to_track = {
    "latency_p95": {"threshold": 2000, "unit": "ms"},
    "cost_per_query": {"threshold": 0.50, "unit": "USD"},
    "error_rate": {"threshold": 0.05, "unit": "percentage"},
    "user_satisfaction": {"threshold": 0.8, "unit": "score"}
}

Monitoring Capabilities:

  • Real-time performance dashboards
  • Automated alert systems
  • Cost tracking and optimization
  • Quality degradation detection

Best Practices

πŸ—οΈ Project Organization

# Use descriptive project names
os.environ['LANGCHAIN_PROJECT'] = f"PolicyBot-{environment}-v{version}"

# Implement consistent tagging
tags = [
    f"env-{environment}",
    f"version-{app_version}",
    f"model-{model_name}",
    "production" if is_production else "development"
]

πŸ” Effective Tracing

@traceable(
    name="Document Retrieval",
    tags=["rag", "retrieval"],
    metadata={
        "retrieval_method": "semantic_search",
        "top_k": 5,
        "similarity_threshold": 0.8
    }
)
def retrieve_documents(query: str):
    # Clear, descriptive function names
    # Comprehensive metadata
    # Appropriate tagging
    pass

πŸ“Š Evaluation Strategy

  1. Define Clear Metrics: Establish what "good" looks like
  2. Create Diverse Datasets: Cover edge cases and typical usage
  3. Automate Evaluations: Integrate into CI/CD pipeline
  4. Monitor Continuously: Track performance over time
  5. Act on Insights: Use data to drive improvements

πŸ” Security Considerations

  • Sanitize sensitive data in traces
  • Use appropriate access controls
  • Implement data retention policies
  • Regular security audits

Troubleshooting

Common Issues:

πŸ”΄ Traces Not Appearing

# Verify environment variables
import os
print(f"LANGCHAIN_TRACING_V2: {os.getenv('LANGCHAIN_TRACING_V2')}")
print(f"LANGCHAIN_API_KEY: {'Set' if os.getenv('LANGCHAIN_API_KEY') else 'Not Set'}")

πŸ”΄ High Costs

  • Monitor token usage patterns
  • Implement caching strategies
  • Optimize prompt efficiency
  • Use appropriate model tiers

πŸ”΄ Performance Issues

  • Check network latency
  • Optimize retrieval components
  • Implement async operations
  • Monitor resource utilization

Resources

πŸ“š Documentation

🎯 Learning Path

  1. Beginner: Start with 1_simple_llm_call.py
  2. Intermediate: Progress through RAG implementations (v1-v4)
  3. Advanced: Explore LangGraph integration and evaluation
  4. Expert: Implement custom evaluators and monitoring

πŸ› οΈ Tools & Integrations

🀝 Community


Conclusion

LangSmith transforms AI development from experimental guesswork to systematic, data-driven engineering:

Key Benefits:

  • πŸ” Complete Visibility - See every step of your AI workflow
  • πŸ“Š Data-Driven Decisions - Optimize based on real performance data
  • πŸš€ Faster Debugging - Quickly identify and resolve issues
  • πŸ‘₯ Team Collaboration - Enable cross-functional AI development
  • πŸ“ˆ Continuous Improvement - Build feedback loops for quality enhancement

Your Learning Journey:

  1. πŸš€ Start Here: Clone the tutorial repository
  2. πŸ”§ Setup: Configure your environment following this guide
  3. 🎯 Practice: Run examples from simple LLM to complex RAG
  4. πŸ“Š Explore: Use LangSmith dashboard for insights
  5. πŸ—οΈ Build: Apply these concepts to your AI projects
  6. πŸ“ˆ Scale: Implement production monitoring and evaluation

Final Thoughts

In the era of non-deterministic AI, observability isn't just helpfulβ€”it's essential for building reliable, scalable applications that users can trust.

Ready to build better AI? Start your journey with LangSmith today! πŸš€


Happy Building! 🎯

GitHub LangSmith

About

πŸ” Complete LangSmith Tutorial: Master AI Observability & Debugging | Learn to trace, monitor, and evaluate LLM applications with hands-on examples from simple chains to complex RAG pipelines and Agents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages