Skip to content

CI: self hosting#941

Draft
Maxwell-Rosen wants to merge 26 commits intomainfrom
ci-local-host
Draft

CI: self hosting#941
Maxwell-Rosen wants to merge 26 commits intomainfrom
ci-local-host

Conversation

@Maxwell-Rosen
Copy link
Collaborator

@Maxwell-Rosen Maxwell-Rosen commented Feb 4, 2026

Documentation Changes

Purpose

The current Mac CI build fails intermittently and inconsistently. Rather than fixing that CI file, we can solve this issue permanently by hosting CI on a dedicated machine. @JunoRavin has volunteered to host this on his super server at PPPL. By enabling local hosting, we can gain enhanced control and testing on this computer.

CI is expanded to ensure all unit tests pass, including those that trigger compiler warnings and errors.

Both CPU and GPU builds are tested.

Affected Areas

.github/workflows and files located within for modifying the CI.

Failing unit tests are commented out so we can establish a baseline. I am not going to spend time doing the hard work of fixing the unit test or identifying why it is failing. That work is left to the individuals to whom these unit tests are relevant. There is an open issue about this (#845)

Failing unit tests are:

  • gyrokinetic/unit/ctest_gyrokinetic_cross_prim_moms_bgk.c test_1x2v_p1
  • gyrokinetic/unit/ctest_mom_gyrokinetic.c test_2x2v_p1 test_2x2v_p1_cu
  • gyrokinetic/unit/ctest_rescale_ghost_jacf.c test_1x1v_ho test_2x2v_ho and device versions
  • gyrokinetic/unit/ctest_dg_gyrokinetic_kern_tm.c -- all tests removed and fail.
  • gyrokinetic/unit/ctest_dg_interpolate.c -- all tests related to _gk_ fail.
  • gyrokinetic/unit/ctest_dg_rad_gyrokinetic.c -- All tests under _Li1 have erroneous print statements and warnings that they are not set up correctly.
  • moments/unit/ctest_gr_spacetime.c test_gr_schwarzschild test_gr_kerr
  • moments/unit/ctest_wv_gr_mhd.c test_gr_mhd_waves_schwarzschild
  • moments/unit/ctest_wv_gr_mhd_tetrad.c test_gr_mhd_tetrad_waves_schwarzschild
  • moments/unit/ctest_wv_gr_ultra_rel_euler.c test_gr_ultra_rel_euler_waves_schwarzschild
  • moments/unit/ctest_wv_gr_ultra_rel_euler_tetrad.c test_gr_ultra_rel_euler_tetrad_waves_schwarzschild
  • vlasov/unit/ctest_hyper3x_dg.c -- Has some print statements which are removed
  • vlasov/unit/ctest_dg_em_vars.c -- All gpu tests are commented out, as well as test_2x and 3x tensor p2.
  • core/unit/ctest_cudss.cu -- test_simple has some print statements removed
  • gyrokinetic/unit/ctest_dg_gyrokinetic_kern_tm.c -- all tests removed and fail

Additional Notes

So far, I have decided not to run regression tests to speed up CI.
To enhance CI, a few regression tests can be run to compare with the main; however, these should be a selective sample. Rather than building a main for each CI instance, it would be more efficient to have a cron job to initialize the runregression system after each push to main (or every day, but that seems excessive)

We can add Valgrind testing to the CPU build and/or memory sanitizer checks to the GPU build.

This work is progressing to using @JunoRavin 's super server for enhanced Gkeyll robustness and testing. Future work will include nightly runregression testing, powered through cron jobs.

Relevant issues:
#913
#784
fixes #116 (I just found out that using the keywords of "fixes ###" adds this issue to the "development" tab and that issue will be closed when the PR is merged)

Checklist

  • I have reviewed the documentation for accuracy.
  • All technical terms and code examples have been double-checked.
  • The update aligns with the overall style and tone of the documentation.
  • The updated documentation builds correctly (if applicable).

… and have every unit test create a null position map
…ing in ctest_gk_geometry_tok which needed to be updated after the filepath of the .geqdsk was moved in a recent PR. It's really difficult to tell if I broke something in this branch because so many unit tests are failing that the errors exceed my terminal context length. I will update the issue about failing unit tests. It's very important that our unit tests pass so that we can have reliable checks that we didn't break anything. It would be really great to have nightly reminders about any unit tests which are broken on main. Also, it's really anoying that when some unit tests fail, it spits out like 10 thousand lines of failures instead of just one line that the unit test failed.
…r build unit, build regression, make check, for all modules
…king baseline for CI. People really need to fix their failing unit tests. These are only the CPU versions, but I'm sure the gpu version fails too.

Disable pkpm unit testing because pkpm has zero unit tests.... That's kind of concerning
…PU will be quick and easy, but the GPU one will take a bit more time. I'm pretty sure I configured the scripts correctly, but I'd like Jimmy to ensure that the standard configure.linux.###.sh works on his server with the correct modules. We do not need to mkdeps, which saves time.
@Maxwell-Rosen Maxwell-Rosen changed the title CI: Hosting on a local computer CI: self hosting Feb 5, 2026
…x build. Maybe the reason it was failing was a timeout error because we were building everything all at once. Also, the maximum number of make -j processes we can use is 3 says https://docs.github.com/en/actions/reference/runners/github-hosted-runners#single-cpu-runners which uses an M1 mac arm64 architecture. Maybe using -j 3 will help this issue too
… says it's because my laptop has bash 4 but CI mac uses bash 3 which didn't support the ^^ logic
…ixes. I think it's important that we remove the logs at the end of each make-module so that we don't hit our storage limits for CI. We are relatively constrained in this and those build logs can be big files
…ore, I was just testing, but the mac build shouldn't launch for drafts. I have it set so only the linux one launches for drafts. We have some limits on how many times per month we can launch the mac build, so we should be more stringent on its use cases, but we can run lots of CI jobs on Jimmys cluster since it can do several at a time
…KPM does not have unit tests, so it doesn't have to make check or make unit. Format the mac build to have consistent indenting
…ments from unit tests. There were a few warnings I was able to fix, but there are a lot that I don't know how to fix and others in the code should fix them on their own time. Failing unit tests should not be a reason that we do not have a working CI baseline. CI that does not work is useless to us all. ctest_cudss.cu has some print statement checks and I'm not sure why they're neccisary. The other cuda unit tests do not check thier accuracy in this way. There are some tests in ctest_dg_rad_gyrokinetic which have a warning print statement deep down so something is wrong with the unit test but I don't have the knowledge to fix them.
…d to do unit tests since the same machine is doing the GPU unit tests. The CPU build can do valgrind checks so that it can compliment the GPU build
…y very not valgrind clean, so I'm disabling it
…y tests which were reading files were not reading them correctly
…e genuinely not valgrind clean and I had to do a few releases. The valcheck takes quite a while, maybe 15 minutes on my laptop, so we should consider making some of the heavier unit tests lighter. dg_em_vars has a very heavy unit test. I made a few of the tests lighter, with less cells, but I didn't achieve much performance. Now, core, moments, and vlasov all pass valcheck
…nd runs in 15 minutes. I did have to merge a fix for position map that has been sitting around for a month in order to get everything valgrind clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failing Linux Action & Other Github Action issues

1 participant