Skip to content

Add Prometheus Metrics Endpoint for Observability #79

@AnkanMisra

Description

@AnkanMisra

Problem

Production deployments have zero visibility into system health, performance, and usage patterns. There's no way to monitor request latency, cache hit rates, verification success rates, or set up alerting.

flowchart LR
    subgraph Current
        G[Gateway] --> L[log.Printf]
        V[Verifier] --> L2[println!]
        L --> N[Nothing]
        L2 --> N
    end
    
    style N fill:#f66
Loading

What We Can't Answer Today

Question Current Answer
What's the p99 latency? Unknown
How many requests per minute? Unknown
Cache hit rate? Unknown
Signature verification success rate? Unknown
Is the verifier healthy? Check logs manually

Solution

Add Prometheus-compatible /metrics endpoints to both Gateway (Go) and Verifier (Rust) services, exposing standard RED metrics (Rate, Errors, Duration).

flowchart LR
    subgraph Services
        G["Gateway :3000"] --> M1["/metrics"]
        V["Verifier :3002"] --> M2["/metrics"]
    end

    subgraph Monitoring
        P["Prometheus"] --> M1
        P --> M2
        P --> GF["Grafana Dashboard"]
    end

    style M1 fill:#6f6
    style M2 fill:#6f6
    style GF fill:#6f6

Loading

Metrics to Expose

Gateway (Go)

Metric Type Labels Description
gateway_requests_total Counter method, path, status Total HTTP requests
gateway_request_duration_seconds Histogram method, path Request latency
gateway_cache_hits_total Counter path Cache hits
gateway_cache_misses_total Counter path Cache misses
gateway_verification_total Counter result (success/failure) Signature verifications
gateway_rate_limit_hits_total Counter path Rate limit rejections
gateway_active_requests Gauge - Current in-flight requests

Verifier (Rust)

Metric Type Labels Description
verifier_requests_total Counter status Total verification requests
verifier_request_duration_seconds Histogram - Verification latency
verifier_signature_valid_total Counter - Valid signatures
verifier_signature_invalid_total Counter reason Invalid signatures by error type

Implementation

Gateway (Go)

Add dependency:

go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp

File: gateway/metrics.go

package main

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    requestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "gateway_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "gateway_request_duration_seconds",
            Help:    "Request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "path"},
    )

    cacheHits = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "gateway_cache_hits_total",
            Help: "Total cache hits",
        },
        []string{"path"},
    )

    cacheMisses = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "gateway_cache_misses_total",
            Help: "Total cache misses",
        },
        []string{"path"},
    )

    activeRequests = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "gateway_active_requests",
            Help: "Number of active requests",
        },
    )
)

File: gateway/middleware.go (add metrics middleware)

func MetricsMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()
        path := c.FullPath()
        if path == "" {
            path = c.Request.URL.Path
        }

        activeRequests.Inc()
        defer activeRequests.Dec()

        c.Next()

        duration := time.Since(start).Seconds()
        status := strconv.Itoa(c.Writer.Status())

        requestsTotal.WithLabelValues(c.Request.Method, path, status).Inc()
        requestDuration.WithLabelValues(c.Request.Method, path).Observe(duration)
    }
}

File: gateway/main.go (add endpoint)

import "github.com/prometheus/client_golang/prometheus/promhttp"

func main() {
    // ... existing setup

    // Metrics endpoint (no auth required)
    r.GET("/metrics", gin.WrapH(promhttp.Handler()))

    // Apply metrics middleware
    r.Use(MetricsMiddleware())

    // ... rest of routes
}

Verifier (Rust)

Add dependencies to Cargo.toml:

[dependencies]
metrics = "0.21"
metrics-exporter-prometheus = "0.12"

File: verifier/src/metrics.rs

use metrics::{counter, histogram};
use metrics_exporter_prometheus::PrometheusBuilder;
use std::time::Instant;

pub fn init_metrics() {
    PrometheusBuilder::new()
        .install()
        .expect("Failed to install Prometheus recorder");
}

pub fn record_verification(valid: bool, duration: f64, error_reason: Option<&str>) {
    counter!("verifier_requests_total", 1);
    histogram!("verifier_request_duration_seconds", duration);
    
    if valid {
        counter!("verifier_signature_valid_total", 1);
    } else {
        let reason = error_reason.unwrap_or("unknown");
        counter!("verifier_signature_invalid_total", 1, "reason" => reason.to_string());
    }
}

Update verifier/src/main.rs:

mod metrics;

use metrics_exporter_prometheus::PrometheusHandle;

#[tokio::main]
async fn main() {
    // Initialize metrics
    let handle = PrometheusBuilder::new()
        .install_recorder()
        .expect("Failed to install Prometheus recorder");

    let app = Router::new()
        .route("/health", get(health))
        .route("/verify", post(verify_signature))
        .route("/metrics", get(move || ready(handle.render())));

    // ... rest of setup
}

Architecture

flowchart TD
    subgraph Gateway
        G[Request] --> MW[MetricsMiddleware]
        MW --> H[Handler]
        H --> MW
        MW --> M1[Prometheus Registry]
    end

    subgraph Verifier
        V[Request] --> VH[verify_signature]
        VH --> M2[Prometheus Registry]
    end

    subgraph Scraping
        P[Prometheus] -->|/metrics| M1
        P -->|/metrics| M2
        P --> GF[Grafana]
    end
Loading

Acceptance Criteria

Gateway (Go)

  • Add prometheus/client_golang dependency
  • Create metrics.go with metric definitions
  • Add MetricsMiddleware for automatic request instrumentation
  • Instrument cache hits/misses in cache.go
  • Instrument rate limiting in ratelimit.go
  • Add /metrics endpoint with promhttp.Handler()
  • Add unit tests for metrics middleware

Verifier (Rust)

  • Add metrics and metrics-exporter-prometheus dependencies
  • Create metrics.rs module
  • Instrument verify_signature function
  • Add /metrics endpoint
  • Add tests for metrics recording

Documentation

  • Update gateway README with metrics documentation
  • Update verifier README with metrics documentation
  • Add example Prometheus scrape config
  • Add example Grafana dashboard JSON (optional)

Environment Variables

# Enable/disable metrics endpoint (default: true)
METRICS_ENABLED=true

# Metrics endpoint path (default: /metrics)
METRICS_PATH=/metrics

Testing

# Gateway
cd gateway && go test -v -run TestMetrics

# Verifier  
cd verifier && cargo test

# Manual verification
bun run stack

# Check Gateway metrics
curl http://localhost:3000/metrics | grep gateway_

# Check Verifier metrics
curl http://localhost:3002/metrics | grep verifier_

# Make some requests and verify counters increase
curl -X POST http://localhost:3000/api/ai/summarize -d '{"text":"test"}'
curl http://localhost:3000/metrics | grep gateway_requests_total

Example Prometheus Config

# prometheus.yml
scrape_configs:
  - job_name: 'microai-gateway'
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: /metrics

  - job_name: 'microai-verifier'
    static_configs:
      - targets: ['localhost:3002']
    metrics_path: /metrics

Example Queries

# Request rate (last 5 min)
rate(gateway_requests_total[5m])

# p99 latency
histogram_quantile(0.99, rate(gateway_request_duration_seconds_bucket[5m]))

# Cache hit ratio
sum(rate(gateway_cache_hits_total[5m])) / 
(sum(rate(gateway_cache_hits_total[5m])) + sum(rate(gateway_cache_misses_total[5m])))

# Verification success rate
sum(rate(verifier_signature_valid_total[5m])) /
sum(rate(verifier_requests_total[5m]))

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions