Skip to content

degree-flowchart/load-testing

Repository files navigation

Load Testing Guide - Degree Flowchart Application

📋 Overview

This directory contains load testing tools and scripts to test the Degree Flowchart application with up to 1000 concurrent users and extrapolate results for 1 million concurrent users.

🎯 Non-Functional Requirement

Requirement: The system should accommodate 1 million concurrent active sessions/users.

Testing Strategy:

  1. Run load tests with 1000 concurrent users (practical test)
  2. Collect performance metrics and identify bottlenecks
  3. Calculate resource requirements for 1M users
  4. Provide architectural recommendations for scaling

🛠️ Tools Provided

1. k6 (Recommended)

  • Best for: High-performance load testing, easy scripting
  • Language: JavaScript
  • Files:
    • k6-load-test.js - Comprehensive test with multiple scenarios
    • k6-simple-1000-users.js - Simple 1000 user test

2. Locust

  • Best for: Python developers, distributed testing
  • Language: Python
  • Files: locust-test.py

3. Monitoring Stack

  • Prometheus: Metrics collection from services
  • Grafana: Metrics visualization
  • InfluxDB: k6 results storage
  • Files: docker-compose-monitoring.yml

🚀 Quick Start

Prerequisites

# Install k6 (macOS)
brew install k6

# Or download from: https://k6.io/docs/get-started/installation/

# Install Locust (Python)
pip install locust

# Ensure main application is running
cd ..
docker-compose up -d

Option 1: Run Simple 1000 User Test with k6

# Navigate to load-testing directory
cd load-testing

# Run 1000 concurrent users for 5 minutes
k6 run --vus 1000 --duration 5m k6-simple-1000-users.js

# Generate HTML report
k6 run --vus 1000 --duration 5m \
   --out json=results.json \
   k6-simple-1000-users.js

Option 2: Run Comprehensive k6 Test

# Runs staged test: 100 → 500 → 1000 users
k6 run k6-load-test.js

# With custom environment
k6 run --env BASE_URL=http://localhost:9000 k6-load-test.js

Option 3: Run Locust Test

# Start Locust web UI
locust -f locust-test.py --host=http://localhost:9000

# Open browser: http://localhost:8089
# Configure:
#   - Number of users: 1000
#   - Spawn rate: 50/second
#   - Host: http://localhost:9000

# Or run headless
locust -f locust-test.py --host=http://localhost:9000 \
       --users 1000 --spawn-rate 50 --run-time 5m \
       --html=report.html --csv=results

Option 4: Run with Monitoring Stack

# Start monitoring services
docker-compose -f docker-compose-monitoring.yml up -d

# Wait 30 seconds for services to start
sleep 30

# Run k6 test with InfluxDB output
k6 run --out influxdb=http://localhost:8086 k6-simple-1000-users.js

# View metrics in Grafana
open http://localhost:3000
# Login: admin / admin

# View Prometheus
open http://localhost:9090

📊 Test Scenarios

Scenario 1: Browse Courses (40% of users)

  • GET /courses
  • GET /courses?level=GRADUATE
  • GET /courses?department=Computer Science
  • GET /courses/code/{code}

Scenario 2: View Degrees (30% of users)

  • GET /degrees
  • GET /degrees/code/MSCS
  • GET /degrees/{id}/requirements
  • GET /degrees/{id}/constraints

Scenario 3: Authenticated Users (30% of users)

  • Check authentication status
  • View student profile
  • View transcript
  • Plan courses

📈 Metrics to Monitor

Application Metrics (from Spring Boot Actuator)

  1. Request Rate

    • Requests per second (RPS)
    • Target: > 1000 RPS for 1000 users
  2. Response Time

    • p50 (median): < 500ms
    • p95: < 2000ms
    • p99: < 5000ms
  3. Error Rate

    • Target: < 5%
  4. JVM Metrics

    • Heap memory usage
    • GC pauses
    • Thread count

Database Metrics (from PostgreSQL)

  1. Connection Pool

    • Active connections
    • Waiting connections
    • Pool utilization
  2. Query Performance

    • Slow queries (> 100ms)
    • Query rate
  3. Resource Usage

    • CPU usage
    • Memory usage
    • Disk I/O

System Metrics

  1. CPU Usage

    • Per service
    • System-wide
  2. Memory Usage

    • Per service
    • System-wide
  3. Network

    • Bandwidth usage
    • Connection count

🔬 Running Tests

Step 1: Baseline Test (10 users)

k6 run --vus 10 --duration 1m k6-simple-1000-users.js

Expected Results:

  • Error rate: 0%
  • p95 response time: < 500ms

Step 2: Load Test (100 users)

k6 run --vus 100 --duration 5m k6-simple-1000-users.js

Expected Results:

  • Error rate: < 1%
  • p95 response time: < 1000ms

Step 3: Stress Test (500 users)

k6 run --vus 500 --duration 5m k6-simple-1000-users.js

Expected Results:

  • Error rate: < 3%
  • p95 response time: < 2000ms

Step 4: Peak Test (1000 users)

k6 run --vus 1000 --duration 10m k6-simple-1000-users.js

Expected Results:

  • Error rate: < 5%
  • p95 response time: < 3000ms
  • System remains stable

Step 5: Spike Test

k6 run --stage 1m:100,30s:1000,1m:100,30s:1000 k6-load-test.js

Tests: System recovery after sudden load spikes


📊 Example k6 Output

     ✓ status is 2xx
     ✓ response time OK

     checks.........................: 95.23% ✓ 95230  ✗ 4770
     data_received..................: 245 MB 408 kB/s
     data_sent......................: 12 MB  20 kB/s
     http_req_blocked...............: avg=1.2ms   min=1µs    med=3µs    max=145ms  p(90)=5µs    p(95)=7µs
     http_req_connecting............: avg=523µs   min=0s     med=0s     max=98ms   p(90)=0s     p(95)=0s
     http_req_duration..............: avg=285ms   min=12ms   med=245ms  max=5.2s   p(90)=456ms  p(95)=678ms
       { expected_response:true }...: avg=278ms   min=12ms   med=243ms  max=1.8s   p(90)=445ms  p(95)=654ms
     http_req_failed................: 4.76%  ✓ 4770   ✗ 95230
     http_req_receiving.............: avg=145µs   min=11µs   med=98µs   max=12ms   p(90)=234µs  p(95)=345µs
     http_req_sending...............: avg=45µs    min=5µs    med=32µs   max=8ms    p(90)=78µs   p(95)=123µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s     max=0s     p(90)=0s     p(95)=0s
     http_req_waiting...............: avg=284ms   min=11ms   med=244ms  max=5.2s   p(90)=455ms  p(95)=676ms
     http_reqs......................: 100000 166.67/s
     iteration_duration.............: avg=5.9s    min=2.1s   med=5.8s   max=15.2s  p(90)=8.2s   p(95)=9.5s
     iterations.....................: 16667  27.78/s
     vus............................: 1000   min=1000 max=1000
     vus_max........................: 1000   min=1000 max=1000

Key Metrics Explained:

  • http_reqs: Total requests = 100,000 at 166.67 req/sec
  • http_req_duration (p95): 95% of requests completed in < 678ms
  • http_req_failed: 4.76% error rate
  • vus: 1000 concurrent users maintained

🧮 Extrapolation to 1 Million Users

Methodology

Based on 1000 concurrent user test results:

Observed Metrics (Example):

  • Request rate: 166 RPS
  • CPU usage: 40% per service
  • Memory: 1GB per service
  • Database connections: 50 active

Calculation for 1M users:

Scaling Factor = 1,000,000 / 1,000 = 1,000x

Required Resources:
- Request rate: 166 * 1,000 = 166,000 RPS
- Service instances: 1,000 instances (with load balancing)
- Database connections: 50,000 (distributed across replicas)
- Memory: 1TB total (1GB × 1000 instances)
- CPU: Proportional scaling

Resource Requirements (1M Users)

Component 1K Users 1M Users (1000x)
API Gateway Instances 1 100-200
Course Service Instances 1 200-300
Student Service Instances 1 300-400
Degree Service Instances 1 200-300
PostgreSQL Replicas 1 20-30 (read replicas)
Redis Cluster Nodes 1 10-15
Total Memory ~8GB ~8TB
Total CPU Cores ~8 ~8000

🏗️ Architectural Changes Required for 1M Users

1. Horizontal Scaling

# Kubernetes Horizontal Pod Autoscaler (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: course-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: course-service
  minReplicas: 10
  maxReplicas: 300
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

2. Database Scaling

  • Master-Slave Replication: 1 master + 20-30 read replicas
  • Connection Pooling: PgBouncer or connection pools
  • Sharding: Partition data by degree program or user ID
  • Caching: Aggressive Redis caching (95%+ cache hit rate)

3. API Gateway Scaling

  • Multiple Gateway Instances: 100-200 instances
  • Load Balancer: AWS ALB, GCP Load Balancer, or Nginx
  • Rate Limiting: Per-user rate limits
  • Circuit Breakers: Prevent cascade failures

4. Session Management

  • Redis Cluster: Distributed session storage
  • JWT Tokens: Stateless authentication
  • Session Timeout: Short-lived sessions (30 min)

5. Caching Strategy

  • L1 Cache: In-memory (per service instance)
  • L2 Cache: Redis (distributed)
  • CDN: CloudFront, Cloudflare for static content
  • Cache Invalidation: Event-driven updates

6. Message Queue

  • Async Processing: Kafka or RabbitMQ
  • Event-Driven Architecture: Course updates, degree changes
  • CQRS: Separate read/write models

7. Monitoring & Observability

  • Distributed Tracing: Jaeger, Zipkin
  • Log Aggregation: ELK Stack, Splunk
  • Metrics: Prometheus + Grafana
  • Alerting: PagerDuty, Opsgenie

📄 Deliverables

1. Test Results

  • k6 test results for 1000 users
  • Locust test results
  • Grafana dashboards screenshots
  • Performance metrics CSV/JSON

2. Analysis Document

  • Bottleneck identification
  • Resource utilization analysis
  • Scaling calculations
  • Cost estimation

3. Recommendations

  • Architecture diagram for 1M users
  • Infrastructure requirements
  • Implementation roadmap
  • Risk assessment

🎯 Success Criteria

For 1000 Users (Actual Test)

  • ✅ 95% of requests complete in < 3 seconds
  • ✅ Error rate < 5%
  • ✅ System remains stable for 10+ minutes
  • ✅ No memory leaks or resource exhaustion

For 1M Users (Extrapolation)

  • ✅ Mathematical justification for resource scaling
  • ✅ Architecture supports horizontal scaling
  • ✅ Database can handle load with replication
  • ✅ Cost estimation provided
  • ✅ Implementation plan documented

🐛 Troubleshooting

High Error Rates

# Check service logs
docker logs degree-edge-service
docker logs course-service

# Check database connections
docker exec -it postgres psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check memory
docker stats

Slow Response Times

# Check database slow queries
docker exec -it postgres psql -U postgres -c "SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

# Check JVM heap
curl http://localhost:9001/actuator/metrics/jvm.memory.used

# Check Redis performance
docker exec -it redis redis-cli INFO stats

Connection Timeouts

  • Increase connection pool size in application.yml
  • Add more database replicas
  • Check network latency

📚 Additional Resources


🔗 Next Steps

  1. Run baseline tests (10, 100, 500, 1000 users)
  2. Collect and analyze metrics
  3. Identify bottlenecks
  4. Create scaling plan
  5. Document findings
  6. Present results with justification for 1M users

About

Load Testing for Microservices

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published