-
Notifications
You must be signed in to change notification settings - Fork 136
Use four samples at a time for estimating corr. #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Getting a lot of ctest failures. I did recently do a full reinstall of MacPorts so I might have missed something, so will investigate. |
|
Looks like only (master had a lot more failures due to Anyway, the code in this PR should be equivalent to what master does, so I'm not sure why it's causing |
|
Hmm, I might have fixed the ctests but now we're back to about the same amount of time as before. Or maybe a tiny bit faster: Will have to think some more to see if we can improve on the ~0.4s savings. |
|
x86_64 results on master: This PR: aarch64 with this PR (Mac M1): Looks like results are slightly worse on x86 but improved on ARM, but at least ctests pass. The latest also can't be used on embedded platforms due to the use of double. Might be worth seeing how CMSIS does it as presumably they can do it without introducing calculation errors. |
|
@DJ2LS ☝️ Thanks for taking a look at this @tmiw 👍 Here's my results for a 60 second file of noise (I generated it with master: 9.2s I'm sure your'e close, just something we are missing.... |
|
Now averaging 7.06s on x86_64 (vs. 8.54s with master) and did even better with aarch64 (2.55s with this branch vs. ~4.85s with master). 👍 ctests continue to pass, too. EDIT: oh, and this is without using double, so SM1000/ezDV could potentially have performance improvements, too. |
|
Actually, macOS ctests are still failing, but I'm not sure if it's related to this PR: For example: and |
I'm getting: Yes I think breaking it up into 4 real dot products is the way to go. I wonder if there would be any improvement with 4 for loops 🤔 |
|
Some brainstorms:
|
|
@tmiw - yes it would be nice to automatically ensure the ctests run on macOS, they seem to slip out of sync every now and again. |
|
It is curious that the vectorisation is not reducing the CPU load by a factor 4. I wonder if the CPU is being use somewhere else? |
src/ofdm.c
Outdated
| vec2 = mvec[0].c, mvec[0].d, mvec[1].c, mvec1[1].d, ... */ | ||
| float4 vec1 = { rxPtr[0], rxPtr[1], rxPtr[2], rxPtr[3] }; | ||
| float4 vec2 = { vecPtr[0], vecPtr[1], vecPtr[2], vecPtr[3] }; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These assignments could be chewing up some CPU. Could we use pointers pvec1 & pvec2 so the inner loop is just the vector operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcc seems to be using the following assembly for the assignments so I suspect there won't be much of a difference:
float4 vec2 = { vecPtr[0], vecPtr[1], vecPtr[2], vecPtr[3] };
26740: f3 0f 10 52 0c movss 0xc(%rdx),%xmm2
float4 vec1 = { rxPtr[0], rxPtr[1], rxPtr[2], rxPtr[3] };
26745: f3 0f 10 28 movss (%rax),%xmm5
src/ofdm.c
Outdated
|
|
||
| accumPos += vec1 * vec2; | ||
| accumNeg -= vec1 * vec2; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is accumPos == -accumNeg ? vec1*vec2 appears to be computed twice, although I imagine the compiler reuses the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the compiler is indeed reusing the result:
accumPos += vec1 * vec2;
26797: 0f 59 ca mulps %xmm2,%xmm1
accumImag += vec3 * vec2;
2679a: 0f 59 c2 mulps %xmm2,%xmm0
accumPos += vec1 * vec2;
2679d: 0f 58 f9 addps %xmm1,%xmm7
accumNeg -= vec1 * vec2;
267a0: 0f 5c f1 subps %xmm1,%xmm6
accumImag += vec3 * vec2;
267a3: 0f 58 e0 addps %xmm0,%xmm
src/ofdm.c
Outdated
| Multiply vec3 by vec2 to get us bc, ad, bc, ad | ||
| and add to second accumulator. */ | ||
| float4 vec3 = { rxPtr[1], rxPtr[0], rxPtr[3], rxPtr[2] }; | ||
| accumImag += vec3 * vec2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the ordering rxPtr[1], rxPtr[0], rxPtr[3], rxPtr[2] be done outside the inner loop, and just have pvec4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried using a mvecRev array outside the loop that already does the reversing and it didn't seem to make a difference.
Interestingly, I just tried this PR on my 2019 MacBook Pro and got different results: master: 11.27s (Previous results were on a "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz" per /proc/cpuinfo, whereas my 2019 MBP runs a 2.3 GHz 8-core Core i9.)
Per Stack Overflow it shouldn't matter either way for anything Intel based. |
src/codec2_fifo.c
Outdated
| if (pin == (fifo->buf + fifo->nshort)) | ||
| pin = fifo->buf; | ||
| } | ||
| if ((pin + n) >= (fifo->buf + fifo->nshort)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for test_fifo to pass, but since the TBD was already in the comments, I figured I might as well. 👍
|
Current ctest failures on macOS now: |
src/ofdm.c
Outdated
| } | ||
| #elif __EMBEDDED__ | ||
| float corrReal = 0, corrImag = 0; | ||
| codec2_complex_dot_product_f32((COMP*)&rx[t], (COMP*)mvec, Npsam, &corrReal, &corrImag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the current embedded platforms don't use this function, which makes sense since this only seems to deal with data modes. That said, I did find out that codec2_complex_dot_product_f32 was implemented incorrectly on ezDV (I had been using __REAL__, so it wasn't getting used before).
Some benchmarks for 700D syncing on ezDV, BTW:
- gcc vector approach in this PR: ~160ms per 1280 samples
- ESP-DSP (see here): ~61ms per 1280 samples
__REAL__along with__EMBEDDED__: ~24ms per 1280 samples
…be used on the voice side as well.
|
@tmiw - we're starting to mix in OSX ctest fixes and other optimization this PR - which was really targeted at the raw data mode acquisition. This makes it harder to review and I don't have time available to review these changes, being focused on data modes and another project. Can you move everything unrelated to raw data mode acquisition to another PR please? I'll take a look at it when I can. |
…onsistently pass." This reverts commit 64c1470.
…s-ofdm-timing-vec
|
My test results:time src/freedv_data_raw_rx datac3 test /dev/null 14yrs old macbook, running Ubuntu, Dual Core:master: 23,805s 10yrs old laptop Dual Coremaster: 10,588s 3yrs old MacBook Pro, i9 8-Coremaster: 6,46s |
Interesting how the speedup is different depending on the generation of Intel processor. I'm not fully sure why, either. The good news is that there does seem to be a significant improvement. Next question: is this level of improvement acceptable enough or do we need to aim for faster? Could it perhaps be parallelized onto multiple threads somehow? |
|
@tmiw, its great seeing there's a good improvement specially on old CPUs. That's the area where these improvements are most needed. I'd like to test this in a tnc environment over some time, but I'm sure it's a great improvement! Thanks, @tmiw ! I also want to have a look at raspberry pi somewhen the next days. Are you talking about multithreading as part of this PR? |
Probably a future one as that would need more investigation. |
|
I'd prefer to have any multithreading outside of codec2, perhaps managed by the caller of the API function. So far the rest of libcodec2 (apart from some tests) in thread free. You could allocate different parts of the acquisition search range to different threads/cores, e..g - 50Hz to 0 on one core. |
Also, could we perhaps short-circuit out of the calculation if |
I'd say we can probably merge this. If we need to improve speeds further, we can always revisit. |
Yes the stretch goal here is to dramatically reduce CPU when it's idling, just sitting there listening to noise. Medium term - the FFT approach I mentioned above is probably our best shot at that. Another approach is to select special sequences for the preamble that are easy to detect efficiently. There are some papers on this I've seen, would mean waveform changes too. |
|
Re merge, I'll take a closer look this weekend. @DJ2LS - as a further check could you please try this branch with FreeDATA on a few machines? |
|
#@drowe67 yes, that's no problem. Do we need statistics or is it just if everything is working as expected? |
Just a basic sanity check to make sure it works OK. |
|
@drowe67 some test runs within FreeDATA on different machines are running promising fine. |
|
Got some positive feedback from testers. No problems so far. AFAIK there's some confusion about cpu load, where some more improvement is expected, but this seems to depend on cpu generation and architecture. |
|
Note to self: ctest to make sure algorithm is OK: |
|
Note for future: another option to look at here is the |
On supported compilers, use vector types to force compiler to generate SIMD instructions inside
est_timing_and_freq(). This improves CPU usage during acquisition vs. current master. For example, on an M1 Mac using master:vs. this PR:
(Test file generated using
dd if=/dev/random of=test bs=128k count=8.)See #364 for more details.