Skip to content

Add realtime metric for data freshness#2

Open
jugomezv wants to merge 3 commits intomasterfrom
jugomez/RealtimeLastMileIngestionDelay
Open

Add realtime metric for data freshness#2
jugomezv wants to merge 3 commits intomasterfrom
jugomez/RealtimeLastMileIngestionDelay

Conversation

@jugomezv
Copy link
Owner

@jugomezv jugomezv commented Nov 2, 2022

Add a metrics that measures the time lag from when an event was posted to the last stage of the ingestion stream till the event was consumed by the realtime server.

Tests performed: ran realtime LLC cluster integration test which includes table with two partitions and used jvisualvm to visualize metrics for both partitions after test completes. Enable warn log and monitored delta in time for each consumed row in realtime server, confirmed the last values logged are the last values contained in the metrics via jvisualvm. Confirmed we get a metric per-table, per partition.

Sample log traces of delay in milliseconds:
14:14:40.789 WARN [LLRealtimeSegmentDataManager_mytable__1__23__20221102T2114Z] [mytable__1__23__20221102T2114Z] ADDING REALTIME DATA FRESHNESS: delta=905[MS]
14:14:40.791 WARN [LLRealtimeSegmentDataManager_mytable__1__23__20221102T2114Z] [mytable__1__23__20221102T2114Z] ADDING REALTIME DATA FRESHNESS: delta=907[MS]
OSSDataFreshnessMetricPartition1
OSSDataFreshnessMetricPartition0

vvivekiyer and others added 2 commits November 2, 2022 13:44
…#9678)

* Enable dictionary code changes

* Address review comments.

* Checkstyle violation

* Add e2e query execution test

* Review comments
* Customize stopword for Lucene Index

* Customize stopword for Lucene Index

* Customize stopword for Lucene Index

* Customize stopword for Lucene Index

* Customize stopword for Lucene Index
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: suggest renaming the metric to something more descriptive like PARTITION_INGESTION_LAG_MS.

Copy link

@GSharayu GSharayu Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also change milliseconds to something like partitionIngestionLagMs

Copy link

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Consider emitting one metric for a table-host combination. That can be the maximum lag amongst all partitions for that table in the host. We have other metrics that indicate consumption rate on a per-partition basis, so that should help if we want to debug further.
  2. We should be emitting zero if we get back 0 events from the poll, right?

@jugomezv jugomezv force-pushed the jugomez/RealtimeLastMileIngestionDelay branch from d68f32c to 7d58fd2 Compare November 3, 2022 20:22
@jugomezv
Copy link
Owner Author

jugomezv commented Nov 3, 2022

@mcvsubbu We considered a metric per table but that wont work because partitions are distributed across servers. Not sure what value will be added keeping a metric per host/per table, I have scheduled a discussing next Monday it would be nice to close on this then.

Zero is being issues when we hit an empty batch on the most recent change.

@jugomezv jugomezv force-pushed the jugomez/RealtimeLastMileIngestionDelay branch from 7d58fd2 to ed1c4a3 Compare November 3, 2022 20:37
@jugomezv jugomezv force-pushed the jugomez/RealtimeLastMileIngestionDelay branch from ed1c4a3 to 1315db0 Compare November 3, 2022 20:41
@GSharayu
Copy link

GSharayu commented Nov 7, 2022

Hey @jugomezv Can you please rebase this PR with the latest. I am just confused with Stop words PR and your changes. It will be helpful for review

@sajjad-moradi
Copy link

Hey @jugomezv Can you please rebase this PR with the latest. I am just confused with Stop words PR and your changes. It will be helpful for review

+1

@GSharayu I think you need to only review LLRealtimeSegmentDataManager.java and ServerGauge.java.

Copy link

@sajjad-moradi sajjad-moradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

One thing to consider:
I believe we need to come up with a strategy for the period in which the consuming segment is being completed. There's no consumption in that period. We have no idea what the lag is! Should we make the value of the new metric zero? Should we continue with the latest value? Or may be increase it linearly as the time passes? None of these cases represents the reality because we don't know even if there's a data available in the stream at that time!

long lastMileIngestionDelayMs = System.currentTimeMillis() - msgMetadata.getRecordIngestionTimeMs();
lastMileIngestionDelayMs = lastMileIngestionDelayMs < 0 ? 0 : lastMileIngestionDelayMs;
_serverMetrics.setValueOfPartitionGauge(_tableNameWithType, _partitionGroupId,
ServerGauge.PARTITION_INGESTION_LAG_MS, lastMileIngestionDelayMs);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastMile and end2end (for _t header) are internal to our projects at Linkedin. From Pinot OSS's perspective, there's only one kafka topic and that's the one the RT servers consume from. Maybe drop lastMile or simply change it to the name of the metric partitionIngestionLagMs?

@sajjad-moradi
Copy link

... used jvisualvm to visualize metrics for both partitions after test completes ...

The metrics you highlighted in the screenshots are not the ones you added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants