-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Simd/v11 #14496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Simd/v11 #14496
Conversation
Rename to match coding style. Update callers.
Systems with SSE 4.1 as the highest SSE version are getting pretty rare, so it's hard to test.
AVX2 implementation that compares 32 bytes at a time. Rearrange code to make parts reusable. Fall back to smaller SIMD for remaining buffer. When (remaining) buffer is smaller than 32 bytes fall back to other SIMD implementations that deal with 16 bytes of data per iteration. Add 16/32/64 byte implementations using AVX512.
Implement for AVX512, AVX2 and SSE42.
Wrapper around `memmem`. The case sensitive search is implemented by directly calling `memmem`. As there is no case insensitieve variant available, a wrapper around memmem is created, that takes a sliding window approach: 1. take a slice of the haystack 2. convert it to lowercase 3. search it using memmem 4. move window forward
Tool to benchmark detection engine content inspection, which is the inspection of individual groups of content, etc matches for a buffer. Also add a set of basic tests for the various single pattern matching implementation. Output is in csv. To files for the rule based tests. To stdout for the spm tests.
To show differences betweeen 2 result files or between spm algos in a single result file.
TEST AVX512 6144
Test multiple lengths in each test Many of the inputs are too short to take SIMD code paths
|
NOTE: This PR may contain new authors. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #14496 +/- ##
==========================================
- Coverage 82.11% 82.01% -0.11%
==========================================
Files 1013 1014 +1
Lines 262322 263020 +698
==========================================
+ Hits 215408 215705 +297
- Misses 46914 47315 +401
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
Here is |
|
Another result, this time from Again SSE3 for lowercase, libc for memcmp. |
|
|
|
|
|
SSE3 again here, but the |
|
SSE3 looks best for lowercase again. |
|
|
|
|
|
Apple M1 result is not really useful, need a longer test? Still Neon better than no simd? |
|
Apple M4 Neon doing well. |
|
Arm A55 core Neon better as well. Arm A76 core Unclear result. |
|
Overall it seems: The minisforum is an outlier. Not sure what is up with that. |
|
ERROR: ERROR: QA failed on SURI_TLPW2_autofp_suri_time. Pipeline = 28784 |
|
The minisforum result was with clang 21. When I use gcc 15.2 results look more in line with my expectations Arm A520 core: |
#14295 + #11725 + more Intel SIMD experiments with unrolling loops, etc.
My conclusion about the SIMD stuff is that it's not really worth it for exact memcmp as the implementations are sometimes faster than the libc implementation in certain tests, but overall libc memcmp is just much better.
For MemcmpLowercase it seems we have some success, on Intel with the SSE3 implementation. NEON on Arm seems mostly better than the non-SIMD version.
@AGSaidi SVE doesn't seem worth it here.
These are the numbers from a AWS graviton 3 instance:
A bit better in the small tests, far worse in the bigger checks. I see similar on Intel.
Interestingly my new hardware, a Minisforum R1, is way worse (EDIT: on clang only, gcc is fine. See below):
The SVE/Neon case is about 8x slower here.
For reference, here is the result for a Intel W-2245:
The
MemcmpTestLowercaseSSE3may be the only one worth keeping.Here is
AMD Ryzen Threadripper PRO 5965WX