-
Notifications
You must be signed in to change notification settings - Fork 66
Description
Add the capability to report the actual OS memory usage (RSS) for the pytorch_inference process, similar to what was implemented for autodetect in #2846.
Background
PR #2846 introduced reporting of actual memory usage (via getrusage RSS) for the autodetect process. This provides valuable insight into the real memory footprint of anomaly detection jobs as reported by the OS, rather than relying solely on internal memory tracking.
The pytorch_inference process currently:
- Has the infrastructure to report RSS values via
writeProcessStats()(called on-demand viaE_ProcessStatscontrol message) - Uses
CProcessStats::residentSetSize()andCProcessStats::maxResidentSetSize()inMain.cc - Does not periodically report this information back to the Java process
Proposed Changes
-
Add periodic reporting of system memory usage for
pytorch_inference, similar to howautodetectupdatesE_TSADSystemMemoryUsageandE_TSADMaxSystemMemoryUsageprogram counters. -
Include the RSS values in the output stream that can be consumed by the Java side. Options include:
- Adding new fields to an existing result type
- Creating a new periodic stats message
- Extending the response from
E_ProcessStatsto be sent periodically
-
The values to report:
system_memory_bytes- current resident set size (CProcessStats::residentSetSize())max_system_memory_bytes- peak resident set size (CProcessStats::maxResidentSetSize())
Files likely to be modified
bin/pytorch_inference/Main.cc- Add periodic memory reportingbin/pytorch_inference/CResultWriter.cc/CResultWriter.h- Potentially extend output formatbin/pytorch_inference/CCommandParser.cc/CCommandParser.h- If new message types are needed
Relates to
- [ML] Report the "actual" memory usage of the autodetect process #2846 (autodetect actual memory reporting)
- [ML] Report actual memory usage for trained model deployments in TrainedModelSizeStats elasticsearch#139233 (Java-side changes for trained model stats)