Anomalyze is a production-grade prototype demonstrating a hybrid architecture for Customer Experience (CX) observability. It combines the statistical rigor of Classical Machine Learning for signal detection with the advanced reasoning of Generative AI Agents for root cause analysis.
This system is designed not just to flag anomalies, but to deliver actionable, structured insights that engineering and support teams can use immediately.
The architecture follows a "Detect-Diagnose-Report" pipeline, using the best tool for each stage.
graph LR
A[Ticket Stream] --> B{ML Model<br/>Detect}
B -->|Anomaly Found| C{AI Agent<br/>Diagnose}
C --> D[Fetch Tools:<br/>Stats & Samples]
D --> C
C --> E[Generate<br/>Report]
E --> F[Structured<br/>JSON Output]
style B fill:#d4e6f1,stroke:#333,stroke-width:2px
style C fill:#d5f5e3,stroke:#333,stroke-width:2px
style A fill:#f9f9f9,stroke:#666
style F fill:#fff4e6,stroke:#666
- Detect (Classical ML): A time-series model (Prophet, LightGBM, or ARIMA) is trained on historical ticket volume. It continuously forecasts expected volume and flags any significant deviation from its prediction as an anomaly signal.
- Diagnose (GenAI Agent): When an anomaly is detected, a Google ADK agent is invoked with the anomaly's timestamp. The agent uses a set of tools to investigate, comparing ticket distributions against a baseline and reading raw customer complaints to understand the semantic nature of the incident.
- Report (Structured Output): The agent synthesizes its findings into a structured JSON report, identifying the root cause, quantifying the impact, and suggesting actionable recommendations.
This project demonstrates a production-oriented approach from data to deployment.
- Google ADK Agent: A stateful agent built with the Google Agent Development Kit, featuring custom, data-aware tools and structured
pydanticschemas for reliable output. - Modular ML Backend: A swappable model architecture with a consistent
fit/predict/detect_anomaliesinterface supporting LightGBM, Prophet, and ARIMA. - Production-Grade LightGBM: Features time-series cross-validation for robust uncertainty estimation, avoiding the pitfalls of using in-sample residuals.
- Full Observability: Integrated with Arize Phoenix via OpenInference for complete LLM call tracing. Visualize agent reasoning, tool executions, token usage, and latency in real-time through an intuitive web dashboard (see Phoenix screenshot in Quick Start).
- Realistic Anomaly Simulation: Uses real customer support text from HuggingFace, programmatically mapped to a synthetic time-series. This provides a reliable ground truth for validation without circular logic.
- Ops-Ready: Comes with
Makefileautomation, Docker Compose for services, apyproject.tomlfor dependency management, and a comprehensivepytestsuite.
This project leverages modern, production-grade tools across the AI and ML ecosystem:
| Component | Technology | Purpose |
|---|---|---|
| Agent Framework | Google ADK | Stateful agent orchestration with tool calling |
| LLM Gateway | LiteLLM + OpenRouter | Unified interface to Claude, GPT-4, and other models |
| Time-Series ML | Prophet, LightGBM, ARIMA | Forecasting and anomaly detection |
| Observability | Arize Phoenix + OpenInference | LLM tracing and debugging |
| Data Source | HuggingFace Datasets | Real customer support ticket text |
| Validation | Pydantic | Schema validation and type safety |
| Testing | pytest | Comprehensive data pipeline testing |
.
βββ agent/ # Google ADK Agent for root cause analysis
β βββ tools/ # Custom tools for data fetching
β βββ schemas.py # Pydantic schemas for structured output
βββ ml/ # Machine Learning pipeline for signal detection
β βββ models/ # Pluggable model architecture (LGBM, Prophet, ARIMA)
β βββ features.py # Time-series feature engineering
βββ dataset/ # Data generation, exploration, and visualization
β βββ preprocess.py # Synthetic dataset generation with anomaly injection
β βββ visualize.py # Time-series plotting utilities
βββ tests/ # Pytest suite for data pipeline validation
βββ docker-compose.yml # Phoenix observability stack
βββ Makefile # Automation commands for setup and testing
βββ pyproject.toml # Dependency management and package metadata
- Python 3.10+
- Docker (for Phoenix observability)
- An OpenRouter API Key (supports Anthropic, OpenAI, etc.)
# Clone the repository
git clone https://github.com/abdullahmeda/anomalyze.git
cd anomalyze
# Create virtual environment and install dependencies
make env
make install
# Configure your API key
cp .env.example .env
# Now, edit the .env file and add your OPENROUTER_API_KEY# Step 1: Generate the dataset with an injected anomaly on Oct 5, 2023
make dataset
make prepare
# Step 2: Run the ML model to detect WHEN the anomaly occurred
# (Activate the virtual environment first)
source .venv/bin/activate
python3 -m ml.run --model lgbm --interval 0.99
# Expected Output:
# LightGBM Anomaly Detection (99% CI)
# Detected: 1 anomaly(s) | Ground truth: 2023-10-05
# β 2023-10-05: 589 tickets (expected 198, bound 245)
# Metrics: P=100% R=100% F1=100%
# Step 3: Run the ADK agent to diagnose WHY it occurred
python3 -m agent.run --date 2023-10-05
# Expected Output: A structured JSON incident report (see below)
# Step 4 (Optional): View agent traces in Phoenix
make start
# Open http://localhost:6006 in your browserPhoenix Dashboard Preview:
Live trace of the ADK agent showing tool calls, reasoning steps, and structured output generation
Building a reliable anomaly detection system requires ground truth dataβyou need to know when anomalies actually occurred to validate your models. However, publicly available customer support datasets with labeled anomalies are virtually non-existent. The few that exist suffer from:
- Synthetic oversimplification: Random noise or step functions that don't reflect real-world patterns
- Privacy constraints: Real production data with genuine incidents is rarely shared
- Lack of semantic richness: Most time-series datasets are just numbers, missing the qualitative ticket text needed for root cause analysis
- Validation circularity: Using a model's own predictions as "ground truth" defeats the purpose of benchmarking
Rather than relying on pre-labeled data, this project uses a controlled synthesis approach that combines real customer support text with programmatic anomaly injection:
We start with the Tobi-Bueck/customer-support-tickets dataset from HuggingFace, which contains ~62,000 authentic customer support tickets with rich metadata (priority, type, queue, tags, subject, body text).
Tickets are assigned timestamps following a realistic business pattern:
- Weekly seasonality: Higher volume on weekdays (Monday=1.15Γ, Sunday=0.40Γ)
- Hourly patterns: Peak during business hours (9 AM-3 PM), minimal overnight
- Natural variance: Random jitter (Β±15%) to simulate day-to-day fluctuations
WEEKLY_PATTERN = {
0: 1.15, # Monday
1: 1.05, # Tuesday
2: 1.00, # Wednesday
3: 1.00, # Thursday
4: 0.95, # Friday
5: 0.45, # Saturday
6: 0.40 # Sunday
}On the designated anomaly date (Oct 5, 2023), we don't just add random tickets. Instead, we use weighted sampling based on ticket characteristics to simulate a real production incident:
- Tickets with
priority=high/criticalare 3-5Γ more likely to be selected - Tickets tagged with
Bug,Outage,Crash,Securityare 3-5Γ more likely - Tickets in
Technical Supportqueue are 3Γ more likely - Result: The anomaly day naturally exhibits the distributional shifts you'd see in a real outage (84% high/critical priority vs 38% baseline)
This creates a semantically coherent anomalyβthe spike isn't just volume, it's a spike in the right kind of tickets.
Because we programmatically inject the anomaly, we have perfect ground truth:
- Exact date:
2023-10-05 - Exact volume multiplier:
1.5Γ(configurable) - Exact affected tickets: 296 tickets with
is_anomaly=Trueflag - Metadata exported to
anomaly_metadata.jsonfor reproducible validation
The anomaly only appears in the test set (Oct-Dec 2023), ensuring the ML models are evaluated on truly unseen data. All 22 pytest tests validate this split integrity.
β
Realistic patterns: Models must learn genuine weekly/hourly seasonality, not toy data
β
Semantic richness: Agent can analyze real customer complaints for root cause analysis
β
Reproducible: RANDOM_SEED=42 ensures identical datasets across runs
β
Flexible: Easy to adjust anomaly magnitude, date, or characteristics for experimentation
β
Validated: Comprehensive test suite ensures data quality and split integrity
This methodology bridges the gap between purely synthetic (unrealistic) and purely real (unavailable) datasets, providing a controlled environment that's still faithful to production complexity.
We use specialized time-series models to learn the business's normal rhythm, including strong weekly seasonality. This provides a robust and cost-effective signal for the more expensive LLM agent.
- LightGBM (Recommended): A gradient boosting model trained on a rich set of engineered features (
day_of_week,is_weekend,lag_1/7/14,rolling_mean_7d). Uncertainty is estimated using residuals from 5-fold time-series cross-validation, providing a realistic measure of out-of-sample error. - Prophet: Facebook's additive model, excellent for handling seasonality and trend with minimal feature engineering. Serves as a powerful baseline.
- ARIMA: A classical statistical benchmark (SARIMAX) configured to handle weekly seasonality.
All models are modular and can be extended by implementing the AnomalyModel interface.
Once a date is flagged, the Google ADK agent takes over. It is designed to mimic the workflow of a human Site Reliability Engineer.
fetch_ticket_stats(date): Moves beyond simple volume. It retrieves distributional shifts across key categories (priority, ticket type, queue, tags) and compares them to a 7-day baseline average. This helps answer, "Is this spike different in character?"fetch_ticket_samples(date, limit): Provides the qualitative evidence. It pulls raw text from customer tickets, allowing the agent to identify common error messages, complaint patterns, and customer sentiment.Production Enhancement: This currently uses random sampling for simplicity and speed. A production system would benefit from embedding-based clustering (e.g., HDBSCAN with sentence transformers) to automatically surface the most representative and semantically distinct complaint themes, reducing redundancy and improving signal quality.
The agent is instructed via the system prompt to generate a JSON object that conforms to a pydantic schema. This ensures the output is machine-readable and reliable for downstream automation (e.g., creating a Jira ticket).
Sample Output (report.json):
{
"title": "Major Service Disruption Due to Server Overload",
"executive_summary": "On October 5th, a critical issue led to a 268.8% increase in ticket volume, primarily affecting the Technical Support queue with high-priority incidents related to server performance and crashes.",
"root_cause": "Server overload during peak times affecting multiple platforms, indicated by a surge in tags like 'Bug', 'Technical', 'Security', 'Outage', and 'Crash'.",
"impact_metrics": {
"volume_increase_pct": 268.8,
"primary_priority": "high",
"primary_queue": "Technical Support",
"primary_type": "Incident"
},
"affected_services": ["Data Analytics Tool", "SaaS Platform", "Digital Campaign Integration"],
"customer_sentiment": "Frustrated",
"sample_complaints": [
"Critical issue with data analytics tool crashing during report generation.",
"SaaS platform crash due to server overload and resource constraints.",
"Our digital campaign integration is failing due to repeated server timeouts."
],
"recommendations": [
"Immediately scale server capacity to handle the increased load and stabilize services.",
"Investigate the root cause of the server overload, possibly related to a recent deployment or inefficient query.",
"Implement more robust monitoring and alerting for server resource utilization to prevent future occurrences."
]
}Note on ADK Structured Output: Google ADK does not currently support native Pydantic-based structured output when combined with tool calling in a single agent (see discussion). This prototype uses a proven workaround: the desired JSON schema is embedded in the system prompt with explicit instructions for the LLM to format its final response accordingly. For production use cases requiring stricter guarantees, consider implementing a two-agent pattern where a parent agent with tools delegates to a child agent with structured output, or use
response_mime_type="application/json"for schema-free JSON responses.
This prototype is designed with production considerations in mind. Here's how it could be deployed in a real-world environment:
graph LR
A["Ticket System<br/>(Zendesk, Jira, etc.)<br/>Webhook/API"] --> B["Data Pipeline<br/>(Airflow, Prefect, etc.)<br/>ETL + Storage"]
B --> C["ML Detector<br/>(Batch/Stream)"]
C --> D["Alert Queue<br/>(SQS, Kafka)"]
D --> E["ADK Agent<br/>(Lambda/ECS)"]
E --> F["Output<br/>(Slack, PagerDuty)"]
style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style B fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
style C fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style D fill:#fce4ec,stroke:#c2185b,stroke-width:2px
style E fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style F fill:#fff9c4,stroke:#f9a825,stroke-width:2px
-
Real-Time Monitoring: Deploy the ML model as a streaming service (e.g., AWS Kinesis, Kafka Streams) that processes ticket volumes in near-real-time. When an anomaly is detected, trigger the agent via an event queue.
-
Scheduled Batch Analysis: Run the detector on a cron schedule (e.g., every 15 minutes) using Airflow or similar. This is more cost-effective for systems with moderate ticket volumes.
-
Incident Management Integration: Automatically create Jira tickets or PagerDuty incidents from the agent's structured JSON output. The schema is designed to map directly to incident fields.
-
Slack/Teams Alerts: Use the agent's
executive_summaryandrecommendationsto generate human-readable messages. The structured format ensures consistent, actionable alerts.
- ML Model: The LightGBM model is lightweight (~50ms inference on CPU). Prophet and ARIMA are slower but still suitable for batch processing. For high-frequency updates, consider caching predictions or using a model server (e.g., TorchServe, BentoML).
- Agent Invocation: The ADK agent makes 2-4 LLM calls per analysis (~30-60 seconds total). To reduce latency, consider using faster models (e.g., GPT-4o-mini, Claude Haiku) for non-critical periods or implementing request batching.
- Cost Management: Agent analysis costs ~$0.10-0.30 per incident with Claude Sonnet 4.5. Use confidence thresholds on the ML detector to minimize false positives and unnecessary agent invocations.
- ML Performance: Track precision, recall, and false positive rate over time. The included evaluation metrics provide a starting point.
- Agent Quality: Use Phoenix to monitor LLM token usage, latency, and error rates. Review agent reasoning traces to identify failure modes.
- Business Metrics: Measure time-to-detection (how quickly anomalies are flagged) and time-to-resolution (how actionable the agent's recommendations are).
This project was built with engineering rigor to ensure it is robust, reproducible, and extensible.
-
Testing: A comprehensive 22-test suite using
pytestvalidates the entire data pipeline. Tests cover:- Data quality and schema validation
- Train/test split integrity (ensuring anomalies only appear in test set)
- Anomaly volume characteristics (~3x baseline with 20% tolerance)
- Metadata consistency across all generated files
- Temporal ordering and uniqueness constraints
Run with
make testfrom the project root. All tests must pass before the ML models can be trained. -
Reproducibility: The dataset generation process is deterministic, controlled by
RANDOM_SEED=42indataset/preprocess.py. This ensures identical anomaly characteristics across runs, enabling reliable benchmarking and debugging. -
Modularity: The ML models implement a consistent
AnomalyModelinterface withfit(),predict(), anddetect_anomalies()methods. The agent's tools are stateless functions that can be extended or replaced without modifying the agent logic.
