Skip to content

RDKB-63377 : Add patch for SH Logic for reap hung child#127

Open
shivabhaskar wants to merge 2 commits intosupport/1.1.0from
topic/-RDKB-63377
Open

RDKB-63377 : Add patch for SH Logic for reap hung child#127
shivabhaskar wants to merge 2 commits intosupport/1.1.0from
topic/-RDKB-63377

Conversation

@shivabhaskar
Copy link

Reason for change: Self heal logic to detect and recover hung child withouit impacting main process.
Test Procedure: create a shellscript with name telemetry2_0 and run the selfheal script and see if it reaps the process. NOTE The test procedure is just to simulate the issue as we real issue frequency is very low
Risks: Low

Reason for change: Self heal logic to detect and recover hung child withouit impacting main process.
Test Procedure: create a shellscript with name telemetry2_0 and run the selfheal script and see if it reaps the process. 
NOTE The test procedure is just to simulate the issue as we real issue frequency is very low
Risks: Low
Priority : P0
@shivabhaskar shivabhaskar requested a review from a team as a code owner February 5, 2026 11:22
Copilot AI review requested due to automatic review settings February 5, 2026 11:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds self-heal logic to detect and terminate hung child processes of telemetry2_0 without impacting the main process, and enhances error detection for rbus communication failures.

Changes:

  • Added new detect_and_kill_locked_pids() function to identify and kill telemetry2_0 child processes running longer than 120 seconds
  • Integrated the new function into self_heal_t2() to run before existing health checks
  • Added "Caching the event to File" as an additional error string to detect rbus communication failures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


# Check for rbus communication failure
ERROR_STRING="rbus_set Failed for \[Telemetry.ReportProfiles.EventMarker\]"
ERROR_STRING_NEW="Caching the event to File"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed indentation detected. Line 494 uses tabs while the surrounding code uses spaces. This violates consistent code formatting standards.

Suggested change
ERROR_STRING_NEW="Caching the event to File"
ERROR_STRING_NEW="Caching the event to File"

Copilot uses AI. Check for mistakes.
Comment on lines 429 to 436

kill -15 "$pid" 2>/dev/null
sleep 5

if [ -d "/proc/$pid" ]; then
echo " Forcing SIGKILL $pid"
kill -9 "$pid" 2>/dev/null
sleep 1
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition in process state checks. The code checks if a process exists with "[ ! -d "/proc/$pid" ]" on line 405, but by the time the code tries to read from /proc/$pid/stat on line 409, the process might have exited. While the sed command has error redirection (2>/dev/null), there's no check to verify the process still exists before attempting to kill it on lines 430 and 435. If the process exits naturally between the age calculation and the kill attempt, the kill commands will fail silently (due to 2>/dev/null), but this could lead to attempting to kill a PID that has been recycled and assigned to a different process. Consider adding another existence check right before the kill commands.

Suggested change
kill -15 "$pid" 2>/dev/null
sleep 5
if [ -d "/proc/$pid" ]; then
echo " Forcing SIGKILL $pid"
kill -9 "$pid" 2>/dev/null
sleep 1
# Re-validate that the PID still refers to the same process before sending SIGTERM
if [ -d "/proc/$pid" ]; then
local current_stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -n "$current_stat_data" ]; then
local current_start_ticks=$(echo "$current_stat_data" | cut -d' ' -f20)
if [ "$current_start_ticks" = "$start_ticks" ]; then
kill -15 "$pid" 2>/dev/null
fi
fi
fi
sleep 5
# Re-validate again before forcing SIGKILL
if [ -d "/proc/$pid" ]; then
local current_stat_data_kill=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -n "$current_stat_data_kill" ]; then
local current_start_ticks_kill=$(echo "$current_stat_data_kill" | cut -d' ' -f20)
if [ "$current_start_ticks_kill" = "$start_ticks" ]; then
echo " Forcing SIGKILL $pid"
kill -9 "$pid" 2>/dev/null
sleep 1
fi
fi

Copilot uses AI. Check for mistakes.
Comment on lines 318 to 440
detect_and_kill_locked_pids() {
local name="$1" THRESH="${2:-120}"

if [ -z "$name" ]; then
return 2
fi

local pids=$(pidof "$name")
if [ -z "$pids" ]; then
return 0
fi

local pid_count
pid_count=$(set -- $pids; echo $#)
if [ "$pid_count" -le 1 ]; then
return 0
fi

echo_t "[RDKB_SELFHEAL_T2] Multiple telemetry pids are running $pids"

# 1. CLK_TCK (USER_HZ) & Uptime
# USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
fi
fi

local uptime_sec=$(awk '{print int($1)}' /proc/uptime)

# 2. Identify Parent (prefer multi-threaded; fallback to oldest)
local parent=""
local oldest_pid=""
local oldest_start_ticks=""

for pid in $pids; do
if [ ! -r "/proc/$pid/stat" ]; then
continue
fi

local stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -z "$stat_data" ]; then
continue
fi

local start_ticks=$(echo "$stat_data" | cut -d' ' -f20)
if [ -z "$start_ticks" ]; then
continue
fi

# Track oldest (smallest start_ticks)
if [ -z "$oldest_pid" ] || [ "$start_ticks" -lt "$oldest_start_ticks" ]; then
oldest_pid="$pid"
oldest_start_ticks="$start_ticks"
fi

# Prefer "multi-threaded" heuristic if present
if [ -z "$parent" ]; then
if [ -d "/proc/$pid/task" ]; then
local task_count=$(ls "/proc/$pid/task" 2>/dev/null | wc -l)
if [ "$task_count" -gt 1 ]; then
parent="$pid"
fi
fi
fi
done

echo_t "[RDKB_SELFHEAL_T2] Received Parent PID: $parent, Oldest PID: $oldest_pid"

if [ -z "$parent" ]; then
parent="$oldest_pid"
fi

if [ -z "$parent" ]; then
return 0
fi

echo_t "[RDKB_SELFHEAL_T2] Selected Parent PID: $parent"

# 3. Loop Children
for pid in $pids; do
# Skip if it is the parent or process is already gone
if [ "$pid" = "$parent" ]; then
continue
fi
if [ ! -d "/proc/$pid" ]; then
continue
fi

local stat_data=$(sed 's/.*) //' "/proc/$pid/stat" 2>/dev/null)
if [ -z "$stat_data" ]; then
continue
fi

local state=$(echo "$stat_data" | cut -d' ' -f1)
local ppid=$(echo "$stat_data" | cut -d' ' -f2)
local start_ticks=$(echo "$stat_data" | cut -d' ' -f20)

if [ "$ppid" != "$parent" ]; then
continue
fi

# Calculate Age
local started_sec=$(( start_ticks / hz ))
local elapsed=$(( uptime_sec - started_sec ))

# 4. Action Logic
if [ "$elapsed" -ge "$THRESH" ]; then
echo_t "[RDKB_SELFHEAL_T2] : PID $pid: State=$state, Age=${elapsed}s (Kill Triggered)"

kill -15 "$pid" 2>/dev/null
sleep 5

if [ -d "/proc/$pid" ]; then
echo " Forcing SIGKILL $pid"
kill -9 "$pid" 2>/dev/null
sleep 1
fi
fi
done
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential performance issue: The function calls pidof which scans the entire process table, then iterates through all matching PIDs twice (once to identify parent, once to check/kill children). For each PID, it performs multiple file reads from /proc. If telemetry2_0 has many child processes, this could cause noticeable overhead. Additionally, the sleep commands (5 seconds on line 431, 1 second on line 436) will block the selfheal script execution. If this function is called from a critical path or frequently, consider optimizing the process detection or making the sleep durations configurable.

Copilot uses AI. Check for mistakes.
continue
fi

local state=$(echo "$stat_data" | cut -d' ' -f1)
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'state' variable extracted on line 414 is logged but never actually used in any logic. If the intent was to only kill processes in certain states (e.g., not killing processes in 'R' running state or 'S' sleeping state), this logic is missing. If state checking is not needed, the variable extraction and logging could be removed to simplify the code.

Copilot uses AI. Check for mistakes.
sleep 5

if [ -d "/proc/$pid" ]; then
echo " Forcing SIGKILL $pid"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent use of echo_t versus plain echo. Line 434 uses plain "echo" while all other logging in this function and throughout the file uses "echo_t" for logging (as seen on lines 336, 387, 397, 428). This inconsistency means this particular log message won't be timestamped or formatted consistently with other log messages.

Suggested change
echo " Forcing SIGKILL $pid"
echo_t " Forcing SIGKILL $pid"

Copilot uses AI. Check for mistakes.
Comment on lines +376 to +384
# Prefer "multi-threaded" heuristic if present
if [ -z "$parent" ]; then
if [ -d "/proc/$pid/task" ]; then
local task_count=$(ls "/proc/$pid/task" 2>/dev/null | wc -l)
if [ "$task_count" -gt 1 ]; then
parent="$pid"
fi
fi
fi
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent identification logic has a potential issue. The code prefers a process with multiple threads (task_count > 1) as the parent, but once a parent is found this way, it never gets reassessed. However, this heuristic may select the wrong process as the parent since having multiple threads doesn't necessarily mean a process is the parent of other processes with the same name. A more reliable approach would be to check the actual parent-child relationship by examining the PPID field in /proc/pid/stat to determine which PID is the parent. The fallback to "oldest_pid" is reasonable, but the multi-threaded heuristic is questionable and could lead to incorrect identification of the parent process.

Copilot uses AI. Check for mistakes.
}

# This is a workaround till fork calls are removed from t2
# Purpose of this selfheal is to kil t2 telemetry2_0 childs if it is :
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the comment. "kil" should be "kill".

Suggested change
# Purpose of this selfheal is to kil t2 telemetry2_0 childs if it is :
# Purpose of this selfheal is to kill t2 telemetry2_0 childs if it is :

Copilot uses AI. Check for mistakes.
Comment on lines +340 to +344
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 339 states "USER_HZ is almost always 100 on Linux" but the code initializes hz to 1000 instead of 100. This is incorrect. The comment is referring to USER_HZ (the value used for converting /proc/pid/stat clock ticks to seconds), which is typically 100 on Linux systems. However, the code is confusing USER_HZ with CONFIG_HZ (the kernel's internal tick rate). For calculating process age from /proc/pid/stat field 22 (starttime), USER_HZ=100 should be used, not CONFIG_HZ. The code should initialize hz=100, or better yet, use getconf CLK_TCK to get the actual USER_HZ value.

Suggested change
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
# Prefer the actual USER_HZ from getconf; fall back to 100 if unavailable.
local hz=100
if command -v getconf >/dev/null 2>&1; then
local detected_hz
detected_hz=$(getconf CLK_TCK 2>/dev/null || echo "")
if [ -n "$detected_hz" ]; then
hz=$detected_hz

Copilot uses AI. Check for mistakes.

restartNeeded=0

detect_and_kill_locked_pids "telemetry2_0"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed indentation detected. Line 450 uses tabs while the surrounding code (lines 448, 452-453) uses spaces. This violates consistent code formatting standards and can cause issues in shell scripts.

Suggested change
detect_and_kill_locked_pids "telemetry2_0"
detect_and_kill_locked_pids "telemetry2_0"

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 6, 2026 17:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +339 to +347
# USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
fi
fi

Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clock-tick handling looks incorrect: /proc/[pid]/stat starttime is in USER_HZ (sysconf/_SC_CLK_TCK), which is typically 100. Defaulting hz to 1000 and/or reading CONFIG_HZ will skew the age calculation (often by ~10x) and can cause killing healthy children. Prefer getconf CLK_TCK (or equivalent) with a safe fallback (commonly 100) and avoid using CONFIG_HZ as a proxy for USER_HZ.

Suggested change
# USER_HZ is almost always 100 on Linux regardless of CONFIG_HZ
local hz=1000
if [ -r /proc/config.gz ]; then
local detected_hz=$(zcat /proc/config.gz 2>/dev/null | grep "^CONFIG_HZ=" | cut -d= -f2)
if [ -n "$detected_hz" ]; then
hz=$detected_hz
fi
fi
# USER_HZ is obtained from getconf(_SC_CLK_TCK); default safely to 100 if unavailable
local hz=""
if command -v getconf >/dev/null 2>&1; then
hz=$(getconf CLK_TCK 2>/dev/null || echo "")
fi
case "$hz" in
''|*[!0-9]*)
hz=100
;;
esac

Copilot uses AI. Check for mistakes.
telemetry2_0_client "TEST_RT_CONNECTION" "1" > /tmp/t2_test_broker_health 2>&1
if [ -f /tmp/t2_test_broker_health ]; then
if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ]; then
if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ] || [ `grep -c "$ERROR_STRING_NEW" /tmp/t2_test_broker_health` -gt 0 ]; then
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition runs grep twice and uses command substitution inside [ ... ], which is more error-prone and slower than necessary. Consider switching to a single grep -Eq with multiple patterns (or one grep with -e) and checking the exit status instead of counting matches.

Suggested change
if [ `grep -c "$ERROR_STRING" /tmp/t2_test_broker_health` -gt 0 ] || [ `grep -c "$ERROR_STRING_NEW" /tmp/t2_test_broker_health` -gt 0 ]; then
if grep -q -e "$ERROR_STRING" -e "$ERROR_STRING_NEW" /tmp/t2_test_broker_health; then

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant