Cst jack performance problem tester #153

jackchallen · 2025-02-03T13:10:16Z

Separate directory so should be entirely merge-safe

…cific criteria If we can classify problem conditions by specific criteria in stats and then check to see if those values apply, maybe we can start diagnosing or at least pointing to hardware problems. The basic theory being (e.g. for a faulty NVME): If one disk has read latency > 5ms And cluster average read latency > 100usec And ... And ... Then maybe we can build a set of test parameters for each one of those outcomes. This is not going to be a small undertaking, and unfortunately we're stuck with bash. However, if the outline test conditions and theory can be proven useful, this is a step towards a useful enhancement. Ideally the logic/tests that come out of this would make it into the product somehow.

These tests all come from: https://docs.google.com/document/d/1IhWTKu4SkQ9ubYQ8szv9ex5rs_kHDvQxcLlHz6lLzmY/edit?tab=t.0

Wekachecker assumes everything is a shell script. Fortunately it's based on a glob so we can skip it

…script

Wekachecker tries to be the governing over-arching controller, so having a single script that parses and tests multiple conditions separately doesn't fit with that architecture. Therefore make every test self-contained

vrragosta

I think these are the major changes, but I will re-review once changes are made.

Please note, I added comments to the first file, but they are applicable globally.

vrragosta · 2025-03-03T17:28:16Z

scripts.d/performance/001_small_io_writes.sh

+
+DESCRIPTION="Examine WEKA stats for small IO writes"
+SCRIPT_TYPE="single"
+INTERNAL_REFERENCE="WEKAPP-XXXXX"


All of these tests, except for 002_faulty_drive.sh, refer to a "fake" internal reference of WEKAPP-XXXXX. Can this be changed to an actual reference, or if there is none, omit it from the output?

vrragosta · 2025-03-03T17:36:17Z

scripts.d/performance/001_small_io_writes.sh

+#there's a chance we will need to break this down into very different structures of testing, so
+# keeping DRIVE vs CLUSTER tests very separate at the moment, despite the fact it looks ripe
+# for re-factoring.
+for DRIVE_TEST in ${!DRIVE_TESTS[@]} ; do


Instead of duplicating (most) of the code for DRIVE_TEST and CLUSTER_TEST, can this logic be turned into a function and called from within the for loops?

vrragosta · 2025-03-03T18:07:21Z

scripts.d/performance/001_small_io_writes.sh

+    # Now we need to convert VALUE to standardised units. We can't rely on the existence of units(1) or $(systemd-analyze timestamp), unfortunately
+    #   This means we're going to convert using sed and awk, for which I apologize
+    # right now we're only calculating based on units of time, and we only seem to see micro- and milliseconds.
+    VALUE_CALC=$(echo ${VALUE} | sed 's! *µs!/1000000!g;s! *Microseconds!/1000000!g;s! *ms!/1000!g;s! *!!g;s!s$!!g;s!%!!g')


I don't think this is handing the drive stats correctly -- on my test cluster, the values are reported in units of Operations/Sec. Trace output:

++ weka stats --show-internal --stat ssd/DRIVE_IO_TOO_LONG --param 'disk:*' --sort value --start-time -2m --output value --raw-units ++ tail -n 1 + HIGHEST_VALUE='0.0166667 Operations/Sec' ++ convert_to_standard_units '0.0166667 Operations/Sec' ++ local 'VALUE=0.0166667 Operations/Sec' +++ echo 0.0166667 Operations/Sec +++ sed 's! *µs!/1000000!g;s! *Microseconds!/1000000!g;s! *ms!/1000!g;s! *!!g;s!s$!!g;s!%!!g' ++ VALUE_CALC=0.0166667Operations/Sec +++ echo +++ awk '{print 0.0166667Operations/Sec}' awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted

jackchallen added 10 commits November 23, 2024 17:39

junk commit while branch-splitting

47e3c18

First pass at 7 (hopefully) known performance-impacting fault conditions

dfccde6

These tests all come from: https://docs.google.com/document/d/1IhWTKu4SkQ9ubYQ8szv9ex5rs_kHDvQxcLlHz6lLzmY/edit?tab=t.0

Bugfix: forgot to include the preamble

24e3489

Rename to hide test files from wekachecker

23d16db

Wekachecker assumes everything is a shell script. Fortunately it's based on a glob so we can skip it

Include variables necessary to cause wekachecker to execute the test …

c79be25

…script

wekachecker only runs scripts that start with a digit

b141ce6

Bugfix: now tests are hidden by .

5711235

Bugfix: strip percentage symbols from stats

b242f0b

wekachecker architecture doesn't handle separated test like this

30b06b4

Wekachecker tries to be the governing over-arching controller, so having a single script that parses and tests multiple conditions separately doesn't fit with that architecture. Therefore make every test self-contained

jackchallen requested a review from vrragosta February 3, 2025 13:10

Bugfix: different versions of weka return "microseconds" instead of "µs"

7e1c1f6

vrragosta requested changes Mar 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cst jack performance problem tester #153

Cst jack performance problem tester #153

Uh oh!

jackchallen commented Feb 3, 2025

Uh oh!

vrragosta left a comment

Uh oh!

vrragosta Mar 3, 2025

Uh oh!

vrragosta Mar 3, 2025

Uh oh!

vrragosta Mar 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cst jack performance problem tester #153

Are you sure you want to change the base?

Cst jack performance problem tester #153

Uh oh!

Conversation

jackchallen commented Feb 3, 2025

Uh oh!

vrragosta left a comment

Choose a reason for hiding this comment

Uh oh!

vrragosta Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

vrragosta Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

vrragosta Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants