-
Notifications
You must be signed in to change notification settings - Fork 0
Cst jack performance problem tester #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…cific criteria
If we can classify problem conditions by specific criteria in stats
and then check to see if those values apply, maybe we can start
diagnosing or at least pointing to hardware problems.
The basic theory being (e.g. for a faulty NVME):
If one disk has read latency > 5ms
And cluster average read latency > 100usec
And ...
And ...
Then maybe we can build a set of test parameters for each one of
those outcomes.
This is not going to be a small undertaking, and unfortunately
we're stuck with bash. However, if the outline test conditions
and theory can be proven useful, this is a step towards a useful
enhancement.
Ideally the logic/tests that come out of this would make it into
the product somehow.
Wekachecker assumes everything is a shell script. Fortunately it's based on a glob so we can skip it
Wekachecker tries to be the governing over-arching controller, so having a single script that parses and tests multiple conditions separately doesn't fit with that architecture. Therefore make every test self-contained
vrragosta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are the major changes, but I will re-review once changes are made.
Please note, I added comments to the first file, but they are applicable globally.
|
|
||
| DESCRIPTION="Examine WEKA stats for small IO writes" | ||
| SCRIPT_TYPE="single" | ||
| INTERNAL_REFERENCE="WEKAPP-XXXXX" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these tests, except for 002_faulty_drive.sh, refer to a "fake" internal reference of WEKAPP-XXXXX. Can this be changed to an actual reference, or if there is none, omit it from the output?
| #there's a chance we will need to break this down into very different structures of testing, so | ||
| # keeping DRIVE vs CLUSTER tests very separate at the moment, despite the fact it looks ripe | ||
| # for re-factoring. | ||
| for DRIVE_TEST in ${!DRIVE_TESTS[@]} ; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of duplicating (most) of the code for DRIVE_TEST and CLUSTER_TEST, can this logic be turned into a function and called from within the for loops?
| # Now we need to convert VALUE to standardised units. We can't rely on the existence of units(1) or $(systemd-analyze timestamp), unfortunately | ||
| # This means we're going to convert using sed and awk, for which I apologize | ||
| # right now we're only calculating based on units of time, and we only seem to see micro- and milliseconds. | ||
| VALUE_CALC=$(echo ${VALUE} | sed 's! *µs!/1000000!g;s! *Microseconds!/1000000!g;s! *ms!/1000!g;s! *!!g;s!s$!!g;s!%!!g') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is handing the drive stats correctly -- on my test cluster, the values are reported in units of Operations/Sec. Trace output:
++ weka stats --show-internal --stat ssd/DRIVE_IO_TOO_LONG --param 'disk:*' --sort value --start-time -2m --output value --raw-units
++ tail -n 1
+ HIGHEST_VALUE='0.0166667 Operations/Sec'
++ convert_to_standard_units '0.0166667 Operations/Sec'
++ local 'VALUE=0.0166667 Operations/Sec'
+++ echo 0.0166667 Operations/Sec
+++ sed 's! *µs!/1000000!g;s! *Microseconds!/1000000!g;s! *ms!/1000!g;s! *!!g;s!s$!!g;s!%!!g'
++ VALUE_CALC=0.0166667Operations/Sec
+++ echo
+++ awk '{print 0.0166667Operations/Sec}'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted
Separate directory so should be entirely merge-safe