- Built an AWS DataOps pipeline using S3, Lambda (Python 3.12), EventBridge Scheduler, CloudWatch Logs & Dashboards.
- Lambda reads
diabetes_preprocessed.csvfrom S3, imputes, normalizes, computes correlations, and writes outputs toprocessed/. - EventBridge triggers Lambda every 2 minutes.
- CloudWatch Dashboard shows Invocations, Errors, Duration (avg & p99), EventBridge invocations, S3 growth, and a Log Insights activity feed.
- Bucket name:
dataops-pipeline-bucket - Region:
ap-south-1 (Mumbai)
Steps:
- Uploaded
diabetes_preprocessed.csvto the bucket. - Created a folder
processed/to store outputs generated by the pipeline.
- Function name:
DataOpsPipelineFunction - Runtime: Python 3.12
- Memory: 1024 MB
- Timeout: 60 seconds
BUCKET_NAME = dataops-pipeline-bucket INPUT_KEY = diabetes_preprocessed.csv OUTPUT_PREFIX = processed/
- Attached AWS managed Pandas layer (
AWSSDKPandas-Python312) to use Pandas and NumPy without packaging dependencies.
The Lambda:
- Reads the file from S3
- Fills missing values and normalizes numeric columns
- Saves correlation matrix, cleaned file, summary JSON, and dtypes JSON to
processed/ - Logs each run to CloudWatch Logs and S3
To automate pipeline execution:
- Type: Rate-based
- Expression:
rate(2 minutes) - Flexible time window: OFF
- Target: Lambda function
- Execution role: EventBridge scheduler role with Lambda invoke permission
The Lambda runs automatically every 2 minutes.
Lambda logs go to: /aws/lambda/DataOpsPipelineFunction
Every run logs:
- Start time
- Dataset shape
- Missing values handled
- Files saved
Widgets added:
- Lambda Invocations (Sum, 120 sec)
- Lambda Errors (Sum, 120 sec)
- Lambda Duration (Average + p99, 120 sec)
- EventBridge Invocations (Sum, 120 sec)
- CloudWatch Log Insights table with custom emojis πππβ
- Optional: S3 NumberOfObjects / BucketSizeBytes (to show processed data growth)
Tip: 120-second periods line up with each scheduled run.
- S3 processed folder shows multiple output files.
- 455+ scheduled runs executed reliably by EventBridge every 2 minutes.
- Zero errors during observation period β stable processing, no unhandled exceptions.
- No throttling, indicating Lambda is well within capacity.
- Async Event Age remains low β no backlog or delay.
- Avg execution: 250β300 ms
p99: occasional spikes to 450β500 ms (cold starts / input size variation). - Performance: consistent and stable over time.
- Pipeline Activity Feed: clear run-level traceability (dataset shape, start time, output file paths).
- Error Feed: empty β confirms reliability.
- KPI tiles (Invocations, Errors, Throttles, Duration, Event Age) β instant operational overview.
- Monitoring setup reflects production-grade observability suitable for scaling and operational visibility.
