Skip to content

jagriti10/Lambda-Handler-For-Normalizing-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataOps Pipeline on AWS

Overview

  • Built an AWS DataOps pipeline using S3, Lambda (Python 3.12), EventBridge Scheduler, CloudWatch Logs & Dashboards.
  • Lambda reads diabetes_preprocessed.csv from S3, imputes, normalizes, computes correlations, and writes outputs to processed/.
  • EventBridge triggers Lambda every 2 minutes.
  • CloudWatch Dashboard shows Invocations, Errors, Duration (avg & p99), EventBridge invocations, S3 growth, and a Log Insights activity feed.

Project Architecture


S3 Bucket Setup

  • Bucket name: dataops-pipeline-bucket
  • Region: ap-south-1 (Mumbai)

Steps:

  1. Uploaded diabetes_preprocessed.csv to the bucket.
  2. Created a folder processed/ to store outputs generated by the pipeline.

Lambda Function

  • Function name: DataOpsPipelineFunction
  • Runtime: Python 3.12
  • Memory: 1024 MB
  • Timeout: 60 seconds

Environment Variables

BUCKET_NAME = dataops-pipeline-bucket INPUT_KEY = diabetes_preprocessed.csv OUTPUT_PREFIX = processed/

Additional Setup

  • Attached AWS managed Pandas layer (AWSSDKPandas-Python312) to use Pandas and NumPy without packaging dependencies.

Lambda Code

The Lambda:

  • Reads the file from S3
  • Fills missing values and normalizes numeric columns
  • Saves correlation matrix, cleaned file, summary JSON, and dtypes JSON to processed/
  • Logs each run to CloudWatch Logs and S3

EventBridge Scheduler

To automate pipeline execution:

  • Type: Rate-based
  • Expression: rate(2 minutes)
  • Flexible time window: OFF
  • Target: Lambda function
  • Execution role: EventBridge scheduler role with Lambda invoke permission

The Lambda runs automatically every 2 minutes.


Logging with CloudWatch

Lambda logs go to: /aws/lambda/DataOpsPipelineFunction

Every run logs:

  • Start time
  • Dataset shape
  • Missing values handled
  • Files saved

πŸ“ˆ CloudWatch Dashboard

Widgets added:

  • Lambda Invocations (Sum, 120 sec)
  • Lambda Errors (Sum, 120 sec)
  • Lambda Duration (Average + p99, 120 sec)
  • EventBridge Invocations (Sum, 120 sec)
  • CloudWatch Log Insights table with custom emojis πŸš€πŸ“ŠπŸ“ˆβœ…
  • Optional: S3 NumberOfObjects / BucketSizeBytes (to show processed data growth)

Tip: 120-second periods line up with each scheduled run.


Verifying the Pipeline

  • S3 processed folder shows multiple output files.

KPI / Dashboard Insights

  • 455+ scheduled runs executed reliably by EventBridge every 2 minutes.
  • Zero errors during observation period β€” stable processing, no unhandled exceptions.
  • No throttling, indicating Lambda is well within capacity.
  • Async Event Age remains low β€” no backlog or delay.
  • Avg execution: 250–300 ms
    p99: occasional spikes to 450–500 ms (cold starts / input size variation).
  • Performance: consistent and stable over time.
  • Pipeline Activity Feed: clear run-level traceability (dataset shape, start time, output file paths).
  • Error Feed: empty β€” confirms reliability.
  • KPI tiles (Invocations, Errors, Throttles, Duration, Event Age) β†’ instant operational overview.
  • Monitoring setup reflects production-grade observability suitable for scaling and operational visibility.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages