Skip to content

Conversation

@jackchallen
Copy link
Collaborator

Separate directory so should be entirely merge-safe

…cific criteria

If we can classify problem conditions by specific criteria in stats
and then check to see if those values apply, maybe we can start
diagnosing or at least pointing to hardware problems.

The basic theory being (e.g. for a faulty NVME):
    If one disk has read latency > 5ms
    And cluster average read latency > 100usec
    And ...
    And ...

Then maybe we can build a set of test parameters for each one of
those outcomes.
This is not going to be a small undertaking, and unfortunately
we're stuck with bash. However, if the outline test conditions
and theory can be proven useful, this is a step towards a useful
enhancement.

Ideally the logic/tests that come out of this would make it into
the product somehow.
Wekachecker assumes everything is a shell script. Fortunately it's based on a
glob so we can skip it
Wekachecker tries to be the governing over-arching controller, so
having a single script that parses and tests multiple conditions
separately doesn't fit with that architecture.
Therefore make every test self-contained
@jackchallen jackchallen requested a review from vrragosta February 3, 2025 13:10
Copy link
Contributor

@vrragosta vrragosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are the major changes, but I will re-review once changes are made.

Please note, I added comments to the first file, but they are applicable globally.


DESCRIPTION="Examine WEKA stats for small IO writes"
SCRIPT_TYPE="single"
INTERNAL_REFERENCE="WEKAPP-XXXXX"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these tests, except for 002_faulty_drive.sh, refer to a "fake" internal reference of WEKAPP-XXXXX. Can this be changed to an actual reference, or if there is none, omit it from the output?

#there's a chance we will need to break this down into very different structures of testing, so
# keeping DRIVE vs CLUSTER tests very separate at the moment, despite the fact it looks ripe
# for re-factoring.
for DRIVE_TEST in ${!DRIVE_TESTS[@]} ; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of duplicating (most) of the code for DRIVE_TEST and CLUSTER_TEST, can this logic be turned into a function and called from within the for loops?

# Now we need to convert VALUE to standardised units. We can't rely on the existence of units(1) or $(systemd-analyze timestamp), unfortunately
# This means we're going to convert using sed and awk, for which I apologize
# right now we're only calculating based on units of time, and we only seem to see micro- and milliseconds.
VALUE_CALC=$(echo ${VALUE} | sed 's! *µs!/1000000!g;s! *Microseconds!/1000000!g;s! *ms!/1000!g;s! *!!g;s!s$!!g;s!%!!g')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is handing the drive stats correctly -- on my test cluster, the values are reported in units of Operations/Sec. Trace output:

++ weka stats --show-internal --stat ssd/DRIVE_IO_TOO_LONG --param 'disk:*' --sort value --start-time -2m --output value --raw-units
++ tail -n 1
+ HIGHEST_VALUE='0.0166667 Operations/Sec'
++ convert_to_standard_units '0.0166667 Operations/Sec'
++ local 'VALUE=0.0166667 Operations/Sec'
+++ echo 0.0166667 Operations/Sec
+++ sed 's! *µs!/1000000!g;s! *Microseconds!/1000000!g;s! *ms!/1000!g;s! *!!g;s!s$!!g;s!%!!g'
++ VALUE_CALC=0.0166667Operations/Sec
+++ echo
+++ awk '{print 0.0166667Operations/Sec}'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants