Skip to content

Conversation

@tmiw
Copy link
Collaborator

@tmiw tmiw commented Dec 31, 2022

On supported compilers, use vector types to force compiler to generate SIMD instructions inside est_timing_and_freq(). This improves CPU usage during acquisition vs. current master. For example, on an M1 Mac using master:

mooneer@ubuntu-server src % time ./freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:   nan
./freedv_data_raw_rx datac3 test /dev/null  4.87s user 0.02s system 99% cpu 4.901 total
mooneer@ubuntu-server src % time ./freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:   nan
./freedv_data_raw_rx datac3 test /dev/null  4.82s user 0.02s system 99% cpu 4.845 total
mooneer@ubuntu-server src % time ./freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:   nan
./freedv_data_raw_rx datac3 test /dev/null  4.87s user 0.02s system 99% cpu 4.899 total
mooneer@ubuntu-server src %

vs. this PR:

mooneer@ubuntu-server src % time ./freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    510 bytes:     0 Frms.:     0 SNRAv:   nan
./freedv_data_raw_rx datac3 test /dev/null  1.99s user 0.01s system 99% cpu 2.012 total
mooneer@ubuntu-server src % time ./freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    510 bytes:     0 Frms.:     0 SNRAv:   nan
./freedv_data_raw_rx datac3 test /dev/null  1.90s user 0.01s system 99% cpu 1.912 total
mooneer@ubuntu-server src % time ./freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    510 bytes:     0 Frms.:     0 SNRAv:   nan
./freedv_data_raw_rx datac3 test /dev/null  1.91s user 0.01s system 99% cpu 1.925 total
mooneer@ubuntu-server src %

(Test file generated using dd if=/dev/random of=test bs=128k count=8.)

See #364 for more details.

@tmiw
Copy link
Collaborator Author

tmiw commented Dec 31, 2022

Getting a lot of ctest failures. I did recently do a full reinstall of MacPorts so I might have missed something, so will investigate.

@tmiw
Copy link
Collaborator Author

tmiw commented Dec 31, 2022

Looks like only test_OFDM_modem_burst_acq_port is failing in the GitHub environment now. I have a few more failures on macOS:

The following tests FAILED:
	 20 - test_OFDM_modem_octave_port (Failed)
	 21 - test_OFDM_modem_octave_port_Nc_31 (Failed)
	 27 - test_OFDM_modem_burst_acq_port (Failed)
	 73 - test_freedv_api_rawdata_800XA (Failed)
	116 - test_freedv_data_raw_fsk_ldpc_100 (Failed)

(master had a lot more failures due to ofdm_demod accessing already-freed memory, which was fixed in 49a2398.)

Anyway, the code in this PR should be equivalent to what master does, so I'm not sure why it's causing test_OFDM_modem_burst_acq_port to fail.

@tmiw
Copy link
Collaborator Author

tmiw commented Dec 31, 2022

Hmm, I might have fixed the ctests but now we're back to about the same amount of time as before. Or maybe a tiny bit faster:

mooneer@ubuntu-server build % time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:   nan
src/freedv_data_raw_rx datac3 test /dev/null  4.43s user 0.01s system 99% cpu 4.442 total
mooneer@ubuntu-server build % time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:   nan
src/freedv_data_raw_rx datac3 test /dev/null  4.43s user 0.01s system 99% cpu 4.449 total
mooneer@ubuntu-server build % time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:   nan
src/freedv_data_raw_rx datac3 test /dev/null  4.43s user 0.01s system 99% cpu 4.449 total
mooneer@ubuntu-server build %

Will have to think some more to see if we can improve on the ~0.4s savings.

@tmiw
Copy link
Collaborator Author

tmiw commented Dec 31, 2022

x86_64 results on master:

mooneer@hamradio:~/codec2/build$ time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:  -nan

real	0m8.542s
user	0m8.539s
sys	0m0.000s
mooneer@hamradio:~/codec2/build$ time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:  -nan

real	0m8.545s
user	0m8.544s
sys	0m0.000s
mooneer@hamradio:~/codec2/build$ time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:  -nan

real	0m8.546s
user	0m8.546s
sys	0m0.000s
mooneer@hamradio:~/codec2/build$ 

This PR:

mooneer@hamradio:~/codec2/build$ time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:  -nan

real	0m10.202s
user	0m10.201s
sys	0m0.000s
mooneer@hamradio:~/codec2/build$ time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:  -nan

real	0m10.201s
user	0m10.200s
sys	0m0.000s
mooneer@hamradio:~/codec2/build$ time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:  -nan

real	0m10.238s
user	0m10.234s
sys	0m0.004s
mooneer@hamradio:~/codec2/build$

aarch64 with this PR (Mac M1):

mooneer@ubuntu-server build % time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:   nan
src/freedv_data_raw_rx datac3 test /dev/null  3.33s user 0.02s system 99% cpu 3.369 total
mooneer@ubuntu-server build % time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:   nan
src/freedv_data_raw_rx datac3 test /dev/null  3.34s user 0.02s system 99% cpu 3.366 total
mooneer@ubuntu-server build % time src/freedv_data_raw_rx datac3 test /dev/null
payload bytes_per_modem_frame: 126
modembufs:    595 bytes:     0 Frms.:     0 SNRAv:   nan
src/freedv_data_raw_rx datac3 test /dev/null  3.34s user 0.02s system 99% cpu 3.369 total
mooneer@ubuntu-server build %

Looks like results are slightly worse on x86 but improved on ARM, but at least ctests pass. The latest also can't be used on embedded platforms due to the use of double. Might be worth seeing how CMSIS does it as presumably they can do it without introducing calculation errors.

@drowe67
Copy link
Owner

drowe67 commented Dec 31, 2022

@DJ2LS ☝️

Thanks for taking a look at this @tmiw 👍 Here's my results for a 60 second file of noise (I generated it with ch).

master: 9.2s
drowe67/codec2@89569c1: 12.3s

I'm sure your'e close, just something we are missing....

@tmiw
Copy link
Collaborator Author

tmiw commented Dec 31, 2022

Now averaging 7.06s on x86_64 (vs. 8.54s with master) and did even better with aarch64 (2.55s with this branch vs. ~4.85s with master). 👍 ctests continue to pass, too.

EDIT: oh, and this is without using double, so SM1000/ezDV could potentially have performance improvements, too.

@tmiw
Copy link
Collaborator Author

tmiw commented Dec 31, 2022

Actually, macOS ctests are still failing, but I'm not sure if it's related to this PR:

The following tests FAILED:
	 20 - test_OFDM_modem_octave_port (Failed)
	 21 - test_OFDM_modem_octave_port_Nc_31 (Failed)
	 73 - test_freedv_api_rawdata_800XA (Failed)
	116 - test_freedv_data_raw_fsk_ldpc_100 (Failed)

For example:

    Start 20: test_OFDM_modem_octave_port

20: Test command: /bin/sh "-c" "PATH_TO_TOFDM=/Users/mooneer/codec2/build/unittest/tofdm DISPLAY="" octave-cli --no-gui -qf /Users/mooneer/codec2/octave/tofdm.m"
20: Working Directory: /Users/mooneer/codec2/octave
20: Environment variables: 
20:  CML_PATH=/Users/mooneer/codec2/build/cml
20: Test timeout computed to be: 1500
20: cml_support = 1
20: Nc = 17 LDPC testing: 1
20: Nbitsperframe: 238
20: 
20: Running C version....
20: path_to_tofdm = ../build_linux/unittest/tofdm
20: path_to_tofdm = /Users/mooneer/codec2/build/unittest/tofdm
20: setting path from env var
20: Nc = 17
20: Assertion failed: (isnan(EsNodB) == 0), function ofdm_esno_est_calc, file ofdm.c, line 1884.
20: Nc = 17 ofdm_bitsperframe: 238
20: error: load: unable to find file tofdm_out.txt
20: error: called from
20:     tofdm at line 206 column 1
1/2 Test #20: test_OFDM_modem_octave_port .........***Failed  Required regular expression not found. Regex=[fails: 0
]  0.68 sec

and

    Start 73: test_freedv_api_rawdata_800XA

73: Test command: /bin/sh "-c" "./tfreedv_800XA_rawdata"
73: Working Directory: /Users/mooneer/codec2/build/unittest
73: Test timeout computed to be: 1500
73: freedv_api tests for mode 800XA
73: freedv_open(FREEDV_MODE_800XA) Passed
73: freedv_get_mode() Passed
73: freedv_get_n_max_modem_samples() 660 Passed
73: freedv_get_n_nom_modem_samples() 640 Passed
73: freedv_get_n_speech_samples() 640 Passed
73: freedv_get_n_bits_per_codec_frame() 28 Passed
73: freedv_get_n_bits_per_modem_frame() 56 Passed
73: freedv_codec_frames_from_rawdata() Passed
73: freedv_rawdata_from_codec_frames() byte 0: 0x12 does not match expected 0x00
73: Test failed
1/1 Test #73: test_freedv_api_rawdata_800XA ....***Failed    0.00 sec

@drowe67
Copy link
Owner

drowe67 commented Dec 31, 2022

Now averaging 7.06s on x86_64 (vs. 8.54s with master) and did even better with aarch64 (2.55s with this branch vs. ~4.85s with master). +1 ctests continue to pass, too.

EDIT: oh, and this is without using double, so SM1000/ezDV could potentially have performance improvements, too.

I'm getting:
master: 9.2s
drowe67/codec2@9353b24: 7.6s

Yes I think breaking it up into 4 real dot products is the way to go. I wonder if there would be any improvement with 4 for loops 🤔

@drowe67
Copy link
Owner

drowe67 commented Dec 31, 2022

Some brainstorms:

  1. Another way to compute this is with a FFT. A multiplication in the freq domain is a correlation (well, convolution, so time reversed) in the time domain. This would need to be prototyped first, to get the algorithm right. The known signal FFT could be pre-computed, so it would be a single multiplication the inverse FFT and peak peak.
  2. Are float operations faster than doubles on a x86 FPU?

@drowe67
Copy link
Owner

drowe67 commented Dec 31, 2022

@tmiw - yes it would be nice to automatically ensure the ctests run on macOS, they seem to slip out of sync every now and again.

@drowe67
Copy link
Owner

drowe67 commented Dec 31, 2022

It is curious that the vectorisation is not reducing the CPU load by a factor 4. I wonder if the CPU is being use somewhere else?

src/ofdm.c Outdated
vec2 = mvec[0].c, mvec[0].d, mvec[1].c, mvec1[1].d, ... */
float4 vec1 = { rxPtr[0], rxPtr[1], rxPtr[2], rxPtr[3] };
float4 vec2 = { vecPtr[0], vecPtr[1], vecPtr[2], vecPtr[3] };

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assignments could be chewing up some CPU. Could we use pointers pvec1 & pvec2 so the inner loop is just the vector operations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcc seems to be using the following assembly for the assignments so I suspect there won't be much of a difference:

                float4 vec2 = { vecPtr[0], vecPtr[1], vecPtr[2], vecPtr[3] };
   26740:       f3 0f 10 52 0c          movss  0xc(%rdx),%xmm2
                float4 vec1 = { rxPtr[0], rxPtr[1], rxPtr[2], rxPtr[3] };
   26745:       f3 0f 10 28             movss  (%rax),%xmm5

src/ofdm.c Outdated

accumPos += vec1 * vec2;
accumNeg -= vec1 * vec2;

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is accumPos == -accumNeg ? vec1*vec2 appears to be computed twice, although I imagine the compiler reuses the result.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the compiler is indeed reusing the result:

                accumPos += vec1 * vec2;
   26797:       0f 59 ca                mulps  %xmm2,%xmm1
                accumImag += vec3 * vec2;
   2679a:       0f 59 c2                mulps  %xmm2,%xmm0
                accumPos += vec1 * vec2;
   2679d:       0f 58 f9                addps  %xmm1,%xmm7
                accumNeg -= vec1 * vec2;
   267a0:       0f 5c f1                subps  %xmm1,%xmm6
                accumImag += vec3 * vec2;
   267a3:       0f 58 e0                addps  %xmm0,%xmm

src/ofdm.c Outdated
Multiply vec3 by vec2 to get us bc, ad, bc, ad
and add to second accumulator. */
float4 vec3 = { rxPtr[1], rxPtr[0], rxPtr[3], rxPtr[2] };
accumImag += vec3 * vec2;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the ordering rxPtr[1], rxPtr[0], rxPtr[3], rxPtr[2] be done outside the inner loop, and just have pvec4?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried using a mvecRev array outside the loop that already does the reversing and it didn't seem to make a difference.

@tmiw
Copy link
Collaborator Author

tmiw commented Jan 1, 2023

It is curious that the vectorisation is not reducing the CPU load by a factor 4. I wonder if the CPU is being use somewhere else?

Interestingly, I just tried this PR on my 2019 MacBook Pro and got different results:

master: 11.27s
9353b24: 4.82s

(Previous results were on a "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz" per /proc/cpuinfo, whereas my 2019 MBP runs a 2.3 GHz 8-core Core i9.)

  1. Are float operations faster than doubles on a x86 FPU?

Per Stack Overflow it shouldn't matter either way for anything Intel based.

if (pin == (fifo->buf + fifo->nshort))
pin = fifo->buf;
}
if ((pin + n) >= (fifo->buf + fifo->nshort))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for test_fifo to pass, but since the TBD was already in the comments, I figured I might as well. 👍

@tmiw
Copy link
Collaborator Author

tmiw commented Jan 1, 2023

Current ctest failures on macOS now:

The following tests FAILED:
	 20 - test_OFDM_modem_octave_port (Failed)
	 73 - test_freedv_api_rawdata_800XA (Failed)

src/ofdm.c Outdated
}
#elif __EMBEDDED__
float corrReal = 0, corrImag = 0;
codec2_complex_dot_product_f32((COMP*)&rx[t], (COMP*)mvec, Npsam, &corrReal, &corrImag);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the current embedded platforms don't use this function, which makes sense since this only seems to deal with data modes. That said, I did find out that codec2_complex_dot_product_f32 was implemented incorrectly on ezDV (I had been using __REAL__, so it wasn't getting used before).

Some benchmarks for 700D syncing on ezDV, BTW:

  • gcc vector approach in this PR: ~160ms per 1280 samples
  • ESP-DSP (see here): ~61ms per 1280 samples
  • __REAL__ along with __EMBEDDED__: ~24ms per 1280 samples

@drowe67
Copy link
Owner

drowe67 commented Jan 1, 2023

@tmiw - we're starting to mix in OSX ctest fixes and other optimization this PR - which was really targeted at the raw data mode acquisition. This makes it harder to review and I don't have time available to review these changes, being focused on data modes and another project.

Can you move everything unrelated to raw data mode acquisition to another PR please? I'll take a look at it when I can.

@tmiw tmiw mentioned this pull request Jan 2, 2023
@tmiw
Copy link
Collaborator Author

tmiw commented Jan 2, 2023

@tmiw - we're starting to mix in OSX ctest fixes and other optimization this PR - which was really targeted at the raw data mode acquisition. This makes it harder to review and I don't have time available to review these changes, being focused on data modes and another project.

Can you move everything unrelated to raw data mode acquisition to another PR please? I'll take a look at it when I can.

Done. See #375, #376.

@DJ2LS
Copy link
Contributor

DJ2LS commented Jan 8, 2023

My test results:

time src/freedv_data_raw_rx datac3 test /dev/null
dd if=/dev/random of=test bs=128k count=8

14yrs old macbook, running Ubuntu, Dual Core:

master: 23,805s
this PR: 14,855s

10yrs old laptop Dual Core

master: 10,588s
this PR: 8,822s

3yrs old MacBook Pro, i9 8-Core

master: 6,46s
this PR: 2,86s

@tmiw
Copy link
Collaborator Author

tmiw commented Jan 8, 2023

My test results:

time src/freedv_data_raw_rx datac3 test /dev/null dd if=/dev/random of=test bs=128k count=8

14yrs old macbook, running Ubuntu, Dual Core:

master: 23,805s this PR: 14,855s

10yrs old laptop Dual Core

master: 10,588s this PR: 8,822s

3yrs old MacBook Pro, i9 8-Core

master: 6,46s this PR: 2,86s

Interesting how the speedup is different depending on the generation of Intel processor. I'm not fully sure why, either. The good news is that there does seem to be a significant improvement.

Next question: is this level of improvement acceptable enough or do we need to aim for faster? Could it perhaps be parallelized onto multiple threads somehow?

@DJ2LS
Copy link
Contributor

DJ2LS commented Jan 8, 2023

@tmiw, its great seeing there's a good improvement specially on old CPUs. That's the area where these improvements are most needed. I'd like to test this in a tnc environment over some time, but I'm sure it's a great improvement! Thanks, @tmiw ! I also want to have a look at raspberry pi somewhen the next days.

Are you talking about multithreading as part of this PR?

@tmiw
Copy link
Collaborator Author

tmiw commented Jan 8, 2023

Are you talking about multithreading as part of this PR?

Probably a future one as that would need more investigation.

@drowe67
Copy link
Owner

drowe67 commented Jan 8, 2023

I'd prefer to have any multithreading outside of codec2, perhaps managed by the caller of the API function. So far the rest of libcodec2 (apart from some tests) in thread free.

You could allocate different parts of the acquisition search range to different threads/cores, e..g - 50Hz to 0 on one core.

@tmiw
Copy link
Collaborator Author

tmiw commented Jan 9, 2023

I'd prefer to have any multithreading outside of codec2, perhaps managed by the caller of the API function. So far the rest of libcodec2 (apart from some tests) in thread free.

You could allocate different parts of the acquisition search range to different threads/cores, e..g - 50Hz to 0 on one core.

est_timing_and_freq() is pretty deep in the internal code so this would likely need additional thought before implementing.

Also, could we perhaps short-circuit out of the calculation if max_corr exceeds some limit (possibly determined experimentally)? This might only improve performance in the case where there's sync, though.

@DJ2LS
Copy link
Contributor

DJ2LS commented Jan 16, 2023

@tmiw @drowe67 do we need any further testing here before merging? Is something else planned here for the near future? Something else I can do for supporting here with testing?

Reducing CPU load is a minor improvement, so I'd be happy seeing this in real world, soon.

@tmiw
Copy link
Collaborator Author

tmiw commented Jan 16, 2023

@tmiw @drowe67 do we need any further testing here before merging? Is something else planned here for the near future? Something else I can do for supporting here with testing?

Reducing CPU load is a minor improvement, so I'd be happy seeing this in real world, soon.

I'd say we can probably merge this. If we need to improve speeds further, we can always revisit.

@drowe67
Copy link
Owner

drowe67 commented Jan 16, 2023

Also, could we perhaps short-circuit out of the calculation if max_corr exceeds some limit (possibly determined experimentally)? This might only improve performance in the case where there's sync, though.

Yes the stretch goal here is to dramatically reduce CPU when it's idling, just sitting there listening to noise. Medium term - the FFT approach I mentioned above is probably our best shot at that. Another approach is to select special sequences for the preamble that are easy to detect efficiently. There are some papers on this I've seen, would mean waveform changes too.

@drowe67
Copy link
Owner

drowe67 commented Jan 16, 2023

Re merge, I'll take a closer look this weekend.

@DJ2LS - as a further check could you please try this branch with FreeDATA on a few machines?

@DJ2LS
Copy link
Contributor

DJ2LS commented Jan 16, 2023

#@drowe67 yes, that's no problem. Do we need statistics or is it just if everything is working as expected?

@drowe67
Copy link
Owner

drowe67 commented Jan 16, 2023

#@drowe67 yes, that's no problem. Do we need statistics or is it just if everything is working as expected?

Just a basic sanity check to make sure it works OK.

@DJ2LS
Copy link
Contributor

DJ2LS commented Jan 19, 2023

@drowe67 some test runs within FreeDATA on different machines are running promising fine.

@DJ2LS
Copy link
Contributor

DJ2LS commented Jan 21, 2023

Got some positive feedback from testers. No problems so far. AFAIK there's some confusion about cpu load, where some more improvement is expected, but this seems to depend on cpu generation and architecture.

@drowe67
Copy link
Owner

drowe67 commented Jan 22, 2023

Note to self: ctest to make sure algorithm is OK:

ctest -V -R test_OFDM_modem_burst_acq_port

@drowe67
Copy link
Owner

drowe67 commented Jan 22, 2023

Note for future: another option to look at here is the vec.h arch-specific assembler in LPCNet, e.g. https://github.com/drowe67/LPCNet/blob/master/src/vec_sse.h.

@drowe67 drowe67 merged commit d662ee0 into master Jan 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants