RDKB-63377 : Add patch for SH Logic for reap hung child#127
RDKB-63377 : Add patch for SH Logic for reap hung child#127shivabhaskar wants to merge 2 commits intosupport/1.1.0from
Conversation
Reason for change: Self heal logic to detect and recover hung child withouit impacting main process. Test Procedure: create a shellscript with name telemetry2_0 and run the selfheal script and see if it reaps the process. NOTE The test procedure is just to simulate the issue as we real issue frequency is very low Risks: Low Priority : P0
There was a problem hiding this comment.
Pull request overview
This pull request adds self-heal logic to detect and terminate hung child processes of telemetry2_0 without impacting the main process, and enhances error detection for rbus communication failures.
Changes:
- Added new
detect_and_kill_locked_pids()function to identify and kill telemetry2_0 child processes running longer than 120 seconds - Integrated the new function into
self_heal_t2()to run before existing health checks - Added "Caching the event to File" as an additional error string to detect rbus communication failures
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Check for rbus communication failure | ||
| ERROR_STRING="rbus_set Failed for \[Telemetry.ReportProfiles.EventMarker\]" | ||
| ERROR_STRING_NEW="Caching the event to File" |
There was a problem hiding this comment.
Mixed indentation detected. Line 494 uses tabs while the surrounding code uses spaces. This violates consistent code formatting standards.
| ERROR_STRING_NEW="Caching the event to File" | |
| ERROR_STRING_NEW="Caching the event to File" |
|
|
||
| kill -15 "$pid" 2>/dev/null | ||
| sleep 5 | ||
|
|
||
| if [ -d "/proc/$pid" ]; then | ||
| echo " Forcing SIGKILL $pid" | ||
| kill -9 "$pid" 2>/dev/null | ||
| sleep 1 |
There was a problem hiding this comment.
Race condition in process state checks. The code checks if a process exists with "[ ! -d "/proc/$pid" ]" on line 405, but by the time the code tries to read from /proc/$pid/stat on line 409, the process might have exited. While the sed command has error redirection (2>/dev/null), there's no check to verify the process still exists before attempting to kill it on lines 430 and 435. If the process exits naturally between the age calculation and the kill attempt, the kill commands will fail silently (due to 2>/dev/null), but this could lead to attempting to kill a PID that has been recycled and assigned to a different process. Consider adding another existence check right before the kill commands.
| kill -15 "$pid" 2>/dev/null | |
| sleep 5 | |
| if [ -d "/proc/$pid" ]; then | |
| echo " Forcing SIGKILL $pid" | |
| kill -9 "$pid" 2>/dev/null | |
| sleep 1 | |
| # Re-validate that the PID still refers to the same process before sending SIGTERM | |
| if [ -d "/proc/$pid" ]; then | |
| local current_stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null) | |
| if [ -n "$current_stat_data" ]; then | |
| local current_start_ticks=$(echo "$current_stat_data" | cut -d' ' -f20) | |
| if [ "$current_start_ticks" = "$start_ticks" ]; then | |
| kill -15 "$pid" 2>/dev/null | |
| fi | |
| fi | |
| fi | |
| sleep 5 | |
| # Re-validate again before forcing SIGKILL | |
| if [ -d "/proc/$pid" ]; then | |
| local current_stat_data_kill=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null) | |
| if [ -n "$current_stat_data_kill" ]; then | |
| local current_start_ticks_kill=$(echo "$current_stat_data_kill" | cut -d' ' -f20) | |
| if [ "$current_start_ticks_kill" = "$start_ticks" ]; then | |
| echo " Forcing SIGKILL $pid" | |
| kill -9 "$pid" 2>/dev/null | |
| sleep 1 | |
| fi | |
| fi |
| detect_and_kill_locked_pids() { | ||
| local name="$1" THRESH="${2:-120}" | ||
|
|
||
| if [ -z "$name" ]; then | ||
| return 2 | ||
| fi | ||
|
|
||
| local pids=$(pidof "$name") | ||
| if [ -z "$pids" ]; then | ||
| return 0 | ||
| fi | ||
|
|
||
| local pid_count | ||
| pid_count=$(set -- $pids; echo $#) | ||
| if [ "$pid_count" -le 1 ]; then | ||
| return 0 | ||
| fi | ||
|
|
||
| echo_t "[RDKB_SELFHEAL_T2] Multiple telemetry pids are running $pids" | ||
|
|
||
| # 1. CLK_TCK (USER_HZ) & Uptime | ||
| # USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ | ||
| local hz=1000 | ||
| if [ -r /proc/config.gz ]; then | ||
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | ||
| if [ -n "$detected_hz" ]; then | ||
| hz=$detected_hz | ||
| fi | ||
| fi | ||
|
|
||
| local uptime_sec=$(awk '{print int($1)}' /proc/uptime) | ||
|
|
||
| # 2. Identify Parent (prefer multi-threaded; fallback to oldest) | ||
| local parent="" | ||
| local oldest_pid="" | ||
| local oldest_start_ticks="" | ||
|
|
||
| for pid in $pids; do | ||
| if [ ! -r "/proc/$pid/stat" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| local stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null) | ||
| if [ -z "$stat_data" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| local start_ticks=$(echo "$stat_data" | cut -d' ' -f20) | ||
| if [ -z "$start_ticks" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| # Track oldest (smallest start_ticks) | ||
| if [ -z "$oldest_pid" ] || [ "$start_ticks" -lt "$oldest_start_ticks" ]; then | ||
| oldest_pid="$pid" | ||
| oldest_start_ticks="$start_ticks" | ||
| fi | ||
|
|
||
| # Prefer "multi-threaded" heuristic if present | ||
| if [ -z "$parent" ]; then | ||
| if [ -d "/proc/$pid/task" ]; then | ||
| local task_count=$(ls "/proc/$pid/task" 2>/dev/null | wc -l) | ||
| if [ "$task_count" -gt 1 ]; then | ||
| parent="$pid" | ||
| fi | ||
| fi | ||
| fi | ||
| done | ||
|
|
||
| echo_t "[RDKB_SELFHEAL_T2] Received Parent PID: $parent, Oldest PID: $oldest_pid" | ||
|
|
||
| if [ -z "$parent" ]; then | ||
| parent="$oldest_pid" | ||
| fi | ||
|
|
||
| if [ -z "$parent" ]; then | ||
| return 0 | ||
| fi | ||
|
|
||
| echo_t "[RDKB_SELFHEAL_T2] Selected Parent PID: $parent" | ||
|
|
||
| # 3. Loop Children | ||
| for pid in $pids; do | ||
| # Skip if it is the parent or process is already gone | ||
| if [ "$pid" = "$parent" ]; then | ||
| continue | ||
| fi | ||
| if [ ! -d "/proc/$pid" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| local stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null) | ||
| if [ -z "$stat_data" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| local state=$(echo "$stat_data" | cut -d' ' -f1) | ||
| local ppid=$(echo "$stat_data" | cut -d' ' -f2) | ||
| local start_ticks=$(echo "$stat_data" | cut -d' ' -f20) | ||
|
|
||
| if [ "$ppid" != "$parent" ]; then | ||
| continue | ||
| fi | ||
|
|
||
| # Calculate Age | ||
| local started_sec=$(( start_ticks / hz )) | ||
| local elapsed=$(( uptime_sec - started_sec )) | ||
|
|
||
| # 4. Action Logic | ||
| if [ "$elapsed" -ge "$THRESH" ]; then | ||
| echo_t "[RDKB_SELFHEAL_T2] : PID $pid: State=$state, Age=${elapsed}s (Kill Triggered)" | ||
|
|
||
| kill -15 "$pid" 2>/dev/null | ||
| sleep 5 | ||
|
|
||
| if [ -d "/proc/$pid" ]; then | ||
| echo " Forcing SIGKILL $pid" | ||
| kill -9 "$pid" 2>/dev/null | ||
| sleep 1 | ||
| fi | ||
| fi | ||
| done | ||
| } |
There was a problem hiding this comment.
Potential performance issue: The function calls pidof which scans the entire process table, then iterates through all matching PIDs twice (once to identify parent, once to check/kill children). For each PID, it performs multiple file reads from /proc. If telemetry2_0 has many child processes, this could cause noticeable overhead. Additionally, the sleep commands (5 seconds on line 431, 1 second on line 436) will block the selfheal script execution. If this function is called from a critical path or frequently, consider optimizing the process detection or making the sleep durations configurable.
| continue | ||
| fi | ||
|
|
||
| local state=$(echo "$stat_data" | cut -d' ' -f1) |
There was a problem hiding this comment.
The 'state' variable extracted on line 414 is logged but never actually used in any logic. If the intent was to only kill processes in certain states (e.g., not killing processes in 'R' running state or 'S' sleeping state), this logic is missing. If state checking is not needed, the variable extraction and logging could be removed to simplify the code.
scripts/task_health_monitor.sh
Outdated
| sleep 5 | ||
|
|
||
| if [ -d "/proc/$pid" ]; then | ||
| echo " Forcing SIGKILL $pid" |
There was a problem hiding this comment.
Inconsistent use of echo_t versus plain echo. Line 434 uses plain "echo" while all other logging in this function and throughout the file uses "echo_t" for logging (as seen on lines 336, 387, 397, 428). This inconsistency means this particular log message won't be timestamped or formatted consistently with other log messages.
| echo " Forcing SIGKILL $pid" | |
| echo_t " Forcing SIGKILL $pid" |
| # Prefer "multi-threaded" heuristic if present | ||
| if [ -z "$parent" ]; then | ||
| if [ -d "/proc/$pid/task" ]; then | ||
| local task_count=$(ls "/proc/$pid/task" 2>/dev/null | wc -l) | ||
| if [ "$task_count" -gt 1 ]; then | ||
| parent="$pid" | ||
| fi | ||
| fi | ||
| fi |
There was a problem hiding this comment.
The parent identification logic has a potential issue. The code prefers a process with multiple threads (task_count > 1) as the parent, but once a parent is found this way, it never gets reassessed. However, this heuristic may select the wrong process as the parent since having multiple threads doesn't necessarily mean a process is the parent of other processes with the same name. A more reliable approach would be to check the actual parent-child relationship by examining the PPID field in /proc/pid/stat to determine which PID is the parent. The fallback to "oldest_pid" is reasonable, but the multi-threaded heuristic is questionable and could lead to incorrect identification of the parent process.
scripts/task_health_monitor.sh
Outdated
| } | ||
|
|
||
| # This is a workaround till fork calls are removed from t2 | ||
| # Purpose of this selfheal is to kil t2 telemetry2_0 childs if it is : |
There was a problem hiding this comment.
Typo in the comment. "kil" should be "kill".
| # Purpose of this selfheal is to kil t2 telemetry2_0 childs if it is : | |
| # Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is : |
| local hz=1000 | ||
| if [ -r /proc/config.gz ]; then | ||
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | ||
| if [ -n "$detected_hz" ]; then | ||
| hz=$detected_hz |
There was a problem hiding this comment.
The comment on line 339 states "USER_HZ is almost always 100 on Linux" but the code initializes hz to 1000 instead of 100. This is incorrect. The comment is referring to USER_HZ (the value used for converting /proc/pid/stat clock ticks to seconds), which is typically 100 on Linux systems. However, the code is confusing USER_HZ with CONFIG_HZ (the kernel's internal tick rate). For calculating process age from /proc/pid/stat field 22 (starttime), USER_HZ=100 should be used, not CONFIG_HZ. The code should initialize hz=100, or better yet, use getconf CLK_TCK to get the actual USER_HZ value.
| local hz=1000 | |
| if [ -r /proc/config.gz ]; then | |
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | |
| if [ -n "$detected_hz" ]; then | |
| hz=$detected_hz | |
| # Prefer the actual USER_HZ from getconf; fall back to 100 if unavailable. | |
| local hz=100 | |
| if command -v getconf >/dev/null 2>&1; then | |
| local detected_hz | |
| detected_hz=$(getconf CLK_TCK 2>/dev/null || echo "") | |
| if [ -n "$detected_hz" ]; then | |
| hz=$detected_hz |
scripts/task_health_monitor.sh
Outdated
|
|
||
| restartNeeded=0 | ||
|
|
||
| detect_and_kill_locked_pids "telemetry2_0" |
There was a problem hiding this comment.
Mixed indentation detected. Line 450 uses tabs while the surrounding code (lines 448, 452-453) uses spaces. This violates consistent code formatting standards and can cause issues in shell scripts.
| detect_and_kill_locked_pids "telemetry2_0" | |
| detect_and_kill_locked_pids "telemetry2_0" |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ | ||
| local hz=1000 | ||
| if [ -r /proc/config.gz ]; then | ||
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | ||
| if [ -n "$detected_hz" ]; then | ||
| hz=$detected_hz | ||
| fi | ||
| fi | ||
|
|
There was a problem hiding this comment.
Clock-tick handling looks incorrect: /proc/[pid]/stat starttime is in USER_HZ (sysconf/_SC_CLK_TCK), which is typically 100. Defaulting hz to 1000 and/or reading CONFIG_HZ will skew the age calculation (often by ~10x) and can cause killing healthy children. Prefer getconf CLK_TCK (or equivalent) with a safe fallback (commonly 100) and avoid using CONFIG_HZ as a proxy for USER_HZ.
| # USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ | |
| local hz=1000 | |
| if [ -r /proc/config.gz ]; then | |
| local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2) | |
| if [ -n "$detected_hz" ]; then | |
| hz=$detected_hz | |
| fi | |
| fi | |
| # USER_HZ is obtained from getconf(_SC_CLK_TCK); default safely to 100 if unavailable | |
| local hz="" | |
| if command -v getconf >/dev/null 2>&1; then | |
| hz=$(getconf CLK_TCK 2>/dev/null || echo "") | |
| fi | |
| case "$hz" in | |
| ''|*[!0-9]*) | |
| hz=100 | |
| ;; | |
| esac |
| telemetry2_0_client "TEST_RT_CONNECTION" "1" > /tmp/t2_test_broker_health 2>&1 | ||
| if [ -f /tmp/t2_test_broker_health ]; then | ||
| if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ]; then | ||
| if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ] || [ `grep -c "$ERROR_STRING_NEW" /tmp/t2_test_broker_health` -gt 0 ]; then |
There was a problem hiding this comment.
This condition runs grep twice and uses command substitution inside [ ... ], which is more error-prone and slower than necessary. Consider switching to a single grep -Eq with multiple patterns (or one grep with -e) and checking the exit status instead of counting matches.
| if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ] || [ `grep -c "$ERROR_STRING_NEW" /tmp/t2_test_broker_health` -gt 0 ]; then | |
| if grep -q -e "$ERROR_STRING" -e "$ERROR_STRING_NEW" /tmp/t2_test_broker_health; then |
Reason for change: Self heal logic to detect and recover hung child withouit impacting main process.
Test Procedure: create a shellscript with name telemetry2_0 and run the selfheal script and see if it reaps the process. NOTE The test procedure is just to simulate the issue as we real issue frequency is very low
Risks: Low