Skip to content

Inflated Raft state size is very hard to clean up #27442

@nh2

Description

@nh2

I recently suffered from an oversized raft.db and snapshots/ that I found very difficult to clean up. It make my nomad agents take 30 minutes to start, made them take 90 GB of memory and 100-400% CPU during that time, and generally caused a lot of downtime.

Below is a report of the details, and what I think the project should do to improve this.

Details

Nomad version

v1.10.5

Operating system and Environment details

Linux

Issue

Reproduction steps

My task was to run a couple of image processing jobs where each job would have to process a couple thousand of images.

I thought that a good way to do that would be to launch a couple thousand Task Groups per job, so that these would be scheduled across my cluster and so that I could stop and start the whole job with one button press in the GUI.

Thus I started 5 jobs, each having 2000 Task Groups, where each Task Group had 1 task with its own command line of < 200 characters (you can think of it as some equivalent to process-image myimage123.jpg ...).

I though that scheduling a queue of 10k command lines across my 20 servers should be no problem for Nomad.

Unfortunately, it didn't work well, causing high CPU usage, high memory usage, at both Nomad servers and clients, apparently caused by a hugely inflated Raft log, with 7 GB raft.db and 22 GB snapshots/ dir.

I still don't fully understand how it can be that large, because even when you put 2000 200-char strings into a Raft log, you shouldn't end up with this many GB. There seems to be some process that amplifies this greatly, perhaps making another fully copy of everything on every scheduling decision; I am not sure.

(Note that I also have high GC settings because to try and prevent Nomad from deleting my logs, see #26765, but independent of that, the problem I'm complaining about below is that even when I GC manually, some large things don't get cleaned up.)

After plenty of research (none of that seems to be in the Nomad docs, I got lucky with the Google Gemini LLM doing some transfer thinking from Kubernetes, as trying to find an answer on the Internet gave no good reults), I found that this can be improved with a trick: Use count = 2000 for my task group, and use the NOMAD_ALLOC_INDEX environment variable to determine at runtime which task group is running, and thus decide which command line to run. Apparently this creates a much smaller Raft state because only 1 command line is stored.

Apparently this pattern is called "Indexed Job" or "Static Partitioning" in Kubernetes, but not well-documented for Nomad.

After doing that, I had a tough time recovering my Nomad servers to a good state. Even with all the old jobs/allocations stopped, I still observed that:

  • In /var/lib/nomad/server/raft, raft.db and snapshots/ remained huge.
  • nomad agent, upon startup, would load the 22 GB snapshots/ into memory (quite quickly, I checked in strace), and then do something with it for 30 minutes on 100% - 400% CPU, and with 90 GB RES memory.
    • During these 30 minutes, the Nomad server would not work, and apparently would not serve telemetry or have something in the logs either. After that time, the CPU would go below 100%.
    • I have no clue what it could possibly be doing for that long with the 22 GB. Sure, 22 GB is a reasonable amount of data, but not to keep a modern server busy for half an hour.
    • It is also worth pointing out that I was lucky that I was using Nomad to process batch jobs. Had I been using Nomad to run production services as well, then all Nomad servers "doing nothing" for 30 minutes after rebooting / being restart would have caused a huge downtime with no evident recourse.

After looking in detail at the GUI, I noticed that my old allocations with the 10000 command lines were still visible. So I concluded that those must still be in the Raft state.

I ran nomad system gc a couple times. Eventually that brought me to a state where I had a 3.5 GB raft.db. I restarted all nomad servers and again observed at 100%-400% CPU an 12 GB RES for 30 minutes.

During that time, the Nomad API and UI were completely unavailable and everything returned Not ready to serve consistent reads.

I believe that at that time, the raft.db was smaller with its 3.5 GB, but the snapshots/ dirs were still large.

Here are some logs of such a "startup run":

Details
==> WARNING: mTLS is not configured - Nomad is not secure without mTLS!
==> Loaded configuration from /etc/nomad.json, /etc/nomad.custom.json
==> Starting Nomad agent...
==> Nomad agent configuration:
       Advertise Addrs: HTTP: 10.0.0.4:4646; RPC: 10.0.0.4:4647; Serf: 10.0.0.4:4648
            Bind Addrs: HTTP: [10.0.0.4:4646 127.0.0.1:4646]; RPC: 10.0.0.4:4647; Serf: 10.0.0.4:4648
                Client: false
             Log Level: INFO
               Node Id: af2ad40a-c83f-88b0-a935-9c95b71c8c50
                Region: eu (DC: hetzner)
                Server: true
               Version: 1.10.5
==> Nomad agent started! Log data will stream in below:
    2026-01-20T17:33:48.045Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2026-01-20T17:33:48.075Z [INFO]  nomad.raft: starting restore from snapshot: id=77964-3959432-1768883552843 last-index=3959432 last-term=77964 size-in-bytes=3734640279
    2026-01-20T17:34:05.619Z [INFO]  nomad.raft: snapshot restore progress: id=77964-3959432-1768883552843 last-index=3959432 last-term=77964 size-in-bytes=3734640279 read-bytes=734331681 percent-complete="19.66%"
    2026-01-20T17:34:15.619Z [INFO]  nomad.raft: snapshot restore progress: id=77964-3959432-1768883552843 last-index=3959432 last-term=77964 size-in-bytes=3734640279 read-bytes=1498481171 percent-complete="40.12%"
    2026-01-20T17:34:25.619Z [INFO]  nomad.raft: snapshot restore progress: id=77964-3959432-1768883552843 last-index=3959432 last-term=77964 size-in-bytes=3734640279 read-bytes=2259967590 percent-complete="60.51%"
    2026-01-20T17:34:35.618Z [INFO]  nomad.raft: snapshot restore progress: id=77964-3959432-1768883552843 last-index=3959432 last-term=77964 size-in-bytes=3734640279 read-bytes=3030747232 percent-complete="81.15%"
    2026-01-20T17:34:44.906Z [INFO]  nomad.raft: restored from snapshot: id=77964-3959432-1768883552843 last-index=3959432 last-term=77964 size-in-bytes=3734640279
    2026-01-20T17:34:44.920Z [INFO]  nomad.raft: initial configuration: index=3951414 servers="[{Suffrage:Voter ID:e60cda86-cd2b-0931-5c49-8c67820267aa Address:10.0.0.5:4647} {Suffrage:Voter ID:64494c81-d562-dab7-1168-de2883a5f470 Address:10.0.0.6:4647} {Suffrage:Voter I>
    2026-01-20T17:34:44.920Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.0.4:4647 [Follower]" leader-address= leader-id=
    2026-01-20T17:34:44.920Z [INFO]  nomad: serf: EventMemberJoin: node-4.eu 10.0.0.4
    2026-01-20T17:34:44.920Z [INFO]  nomad: starting scheduling worker(s): num_workers=16 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2026-01-20T17:34:44.921Z [INFO]  nomad: started scheduling worker(s): num_workers=16 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2026-01-20T17:34:44.921Z [INFO]  nomad: serf: Attempting re-join to previously known node: node-5.eu: 10.0.0.5:4648
    2026-01-20T17:34:44.921Z [INFO]  nomad: adding server: server="node-4.eu (Addr: 10.0.0.4:4647) (DC: hetzner)"
    2026-01-20T17:34:44.921Z [WARN]  nomad.raft: failed to get previous log: previous-index=3960565 last-index=3960544 error="log not found"
    2026-01-20T17:34:44.921Z [WARN]  nomad.raft: failed to get previous log: previous-index=3960565 last-index=3960544 error="log not found"
    2026-01-20T17:34:44.921Z [WARN]  nomad.raft: failed to get previous log: previous-index=3960565 last-index=3960544 error="log not found"
    2026-01-20T17:34:44.921Z [WARN]  nomad.raft: failed to get previous log: previous-index=3960565 last-index=3960544 error="log not found"
    2026-01-20T17:34:44.922Z [WARN]  nomad: memberlist: Refuting a suspect message (from: node-4.eu)
    2026-01-20T17:34:44.922Z [INFO]  nomad: serf: EventMemberJoin: node-5.eu 10.0.0.5
    2026-01-20T17:34:44.922Z [INFO]  nomad: serf: Re-joined to previously known node: node-5.eu: 10.0.0.5:4648
    2026-01-20T17:34:44.922Z [INFO]  nomad: adding server: server="node-5.eu (Addr: 10.0.0.5:4647) (DC: hetzner)"
    2026-01-20T17:34:53.209Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{\"server\":{\"ok\":false,\"message\":\"rpc error: Not ready to serve consistent reads\"}}" code=500
    2026-01-20T17:34:55.080Z [ERROR] worker: failed to dequeue evaluation: worker_id=3d2a4409-59db-5387-3139-a4e56232a796 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.088Z [ERROR] worker: failed to dequeue evaluation: worker_id=bd67a095-a0d8-4a39-f1e0-33208b6ff9e3 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.161Z [ERROR] worker: failed to dequeue evaluation: worker_id=b2eef47f-5d95-eb9b-d514-6d44fb07b8f8 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.231Z [ERROR] worker: failed to dequeue evaluation: worker_id=f8611cef-4321-49d4-c5b8-39ff402d5f90 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.245Z [ERROR] worker: failed to dequeue evaluation: worker_id=cf324183-5db6-c1dd-e08c-361e7acbc365 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.291Z [ERROR] worker: failed to dequeue evaluation: worker_id=571213c7-1955-9597-7f91-7d4bedd6a5d9 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.330Z [ERROR] worker: failed to dequeue evaluation: worker_id=e498c2df-36e3-102a-d250-ad0b78eaf320 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.357Z [ERROR] worker: failed to dequeue evaluation: worker_id=df2c59d7-4797-6cbf-119a-ca9a2b4eefeb error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.378Z [ERROR] worker: failed to dequeue evaluation: worker_id=8fe8d7cd-1961-f467-2206-5c9223e3be3f error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.388Z [ERROR] worker: failed to dequeue evaluation: worker_id=c0cf0d8c-a116-6468-4964-d2fcf413b582 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.428Z [ERROR] worker: failed to dequeue evaluation: worker_id=289e7563-b098-67c4-de05-a08ddd607c8a error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.434Z [ERROR] worker: failed to dequeue evaluation: worker_id=3dd12ee0-54fb-e270-dd6f-aaa258af2efb error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.442Z [ERROR] worker: failed to dequeue evaluation: worker_id=6d310d30-741e-652c-9e06-b9287f142670 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.502Z [ERROR] worker: failed to dequeue evaluation: worker_id=a18e8f80-c0f5-bfff-e427-93593cafdf0c error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.566Z [ERROR] worker: failed to dequeue evaluation: worker_id=53370c56-f7f9-0088-6dbd-38ad25c9a6be error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:55.581Z [ERROR] worker: failed to dequeue evaluation: worker_id=190fef86-164d-2040-177c-53d16fa450e9 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:34:56.710Z [ERROR] http: request failed: method=GET path=/v1/job/process-post-clean-scan-id-ce479116-1806-4a84-a672-44c6f64ad6c5-job-ec75028d-fcc3-4775-b0f3-01c555eca6e9 error="rpc error: Not ready to serve consistent reads" code=500
    2026-01-20T17:34:56.730Z [ERROR] http: request failed: method=GET path=/v1/job/process-pre-clean-scan-id-040b6f74-d87b-4ec6-be79-f893ac2c0412-job-9ed39932-31f8-49b5-8b2a-3bcfaff93dfb error="rpc error: Not ready to serve consistent reads" code=500
... 11 minutes pass ...
    2026-01-20T17:44:35.628Z [ERROR] worker: failed to dequeue evaluation: worker_id=bd67a095-a0d8-4a39-f1e0-33208b6ff9e3 error="rpc error: Not ready to serve consistent reads"
    2026-01-20T17:44:35.980Z [ERROR] http: request failed: method=GET path=/v1/job/process-pre-clean-scan-id-0674e45f-646e-4abb-bd00-f83b1c7ad4cf-job-2daaf110-cd24-4c5d-bd92-1e1314787fb4 error="rpc error: Not ready to serve consistent reads" code=500
    2026-01-20T17:44:38.449Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{\"server\":{\"ok\":false,\"message\":\"rpc error: Not ready to serve consistent reads\"}}" code=500

I changed all GC settings to 1m time in the hope this would help. raft.db decreased to 13 MB (but snapshots/ still large) and it was still Not ready to serve consistent reads.

I ran nomad job stop -purge on all jobs I had changed from 2000 command lines to single ones with count, and recreated them. This took 1 minute. It did remove the old entries from the GUI.

After that, I had 3 GB snapshots/.

I ran systemctl stop nomad && rm -r /var/lib/nomad/server/raft/ && systemctl start nomad sequentially across all servers; I observed that they fetched the snapshots/ from the other servers on restart.

At that point I noted:

The raft.db is now 30 MB, snapshots/ is 3.5 GB, and Nomad RES RAM usage is ~1 GB and low CPU after full startup (good).

But during Nomad restarts (systemctl restart nomad), the Nomad agent is still 100-400% CPU and consumes 13 G of RAM, and this takes ~20 minutes wall time, 35 minutes CPU time (bad).

I do not understand what it can possibly do with a 3 GB state that would take 20 minutes of full CPU.

I collected another log:

Log output from the start, all the way until CPU goes down to ~0%, and memory reduces to 9 GB:

Details
==> WARNING: mTLS is not configured - Nomad is not secure without mTLS!
==> Loaded configuration from /etc/nomad.json, /etc/nomad.custom.json
==> Starting Nomad agent...


==> Nomad agent configuration:
       Advertise Addrs: HTTP: 10.0.0.5:4646; RPC: 10.0.0.5:4647; Serf: 10.0.0.5:4648
            Bind Addrs: HTTP: [10.0.0.5:4646 127.0.0.1:4646]; RPC: 10.0.0.5:4647; Serf: 10.0.0.5:4648
                Client: false
             Log Level: INFO
               Node Id: a9538128-bd71-09b7-5ced-74dfc36044dd
                Region: eu (DC: hetzner)
                Server: true
               Version: 1.10.5
==> Nomad agent started! Log data will stream in below:
    2026-01-20T20:12:24.766Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2026-01-20T20:12:24.767Z [INFO]  nomad.raft: starting restore from snapshot: id=77964-3959432-1768932080725 last-index=3959432 last-term=77964 size-in-bytes=3734640279
    2026-01-20T20:12:37.210Z [INFO]  nomad.raft: snapshot restore progress: id=77964-3959432-1768932080725 last-index=3959432 last-term=77964 size-in-bytes=3734640279 read-bytes=763531579 percent-complete="20.44%"
    2026-01-20T20:12:47.210Z [INFO]  nomad.raft: snapshot restore progress: id=77964-3959432-1768932080725 last-index=3959432 last-term=77964 size-in-bytes=3734640279 read-bytes=1557533190 percent-complete="41.71%"
    2026-01-20T20:12:57.210Z [INFO]  nomad.raft: snapshot restore progress: id=77964-3959432-1768932080725 last-index=3959432 last-term=77964 size-in-bytes=3734640279 read-bytes=2363582880 percent-complete="63.29%"
    2026-01-20T20:13:07.210Z [INFO]  nomad.raft: snapshot restore progress: id=77964-3959432-1768932080725 last-index=3959432 last-term=77964 size-in-bytes=3734640279 read-bytes=3161056822 percent-complete="84.64%"
    2026-01-20T20:13:14.312Z [INFO]  nomad.raft: restored from snapshot: id=77964-3959432-1768932080725 last-index=3959432 last-term=77964 size-in-bytes=3734640279
    2026-01-20T20:13:14.646Z [INFO]  nomad.raft: initial configuration: index=3961242 servers="[{Suffrage:Voter ID:7d724e7d-03bd-5603-1b07-25c45301075c Address:10.0.0.4:4647} {Suffrage:Voter ID:64494c81-d562-dab7-1168-de2883a5f470 Address:10.0.0.6:4647} {Suffrage:Voter ID:e60cda86-cd2b-0931-5c49-8c67820267aa Address:10.0.0.5:4647}]"
    2026-01-20T20:13:14.646Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.0.5:4647 [Follower]" leader-address= leader-id=
    2026-01-20T20:13:14.646Z [INFO]  nomad: serf: EventMemberJoin: node-5.eu 10.0.0.5
    2026-01-20T20:13:14.646Z [INFO]  nomad: starting scheduling worker(s): num_workers=16 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2026-01-20T20:13:14.647Z [INFO]  nomad: started scheduling worker(s): num_workers=16 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2026-01-20T20:13:14.647Z [INFO]  nomad: serf: Attempting re-join to previously known node: node-6.eu: 10.0.0.6:4648
    2026-01-20T20:13:14.647Z [INFO]  nomad: adding server: server="node-5.eu (Addr: 10.0.0.5:4647) (DC: hetzner)"
    2026-01-20T20:13:14.647Z [WARN]  nomad.rpc: yamux: failed to send ping reply: session shutdown
    2026-01-20T20:13:14.648Z [WARN]  nomad: memberlist: Refuting a suspect message (from: node-5.eu)
    2026-01-20T20:13:14.648Z [INFO]  nomad: serf: EventMemberJoin: node-6.eu 10.0.0.6
    2026-01-20T20:13:14.648Z [INFO]  nomad: serf: EventMemberJoin: node-4.eu 10.0.0.4
    2026-01-20T20:13:14.648Z [INFO]  nomad: serf: Re-joined to previously known node: node-6.eu: 10.0.0.6:4648
    2026-01-20T20:13:14.648Z [INFO]  nomad: adding server: server="node-6.eu (Addr: 10.0.0.6:4647) (DC: hetzner)"
    2026-01-20T20:13:14.648Z [INFO]  nomad: adding server: server="node-4.eu (Addr: 10.0.0.4:4647) (DC: hetzner)"
    2026-01-20T20:13:15.148Z [WARN]  nomad.raft: failed to get previous log: previous-index=3962264 last-index=3962262 error="log not found"

==> Newer Nomad version available: 1.11.1 (currently running: 1.10.5)
    2026-01-20T20:13:30.938Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=21c0314d-3b32-7ad1-dc65-6c950d7ecf01 eval=ca9d9347-29b5-a8c5-bb99-f375a1df6f74 index=3962262 timeout=5s
    2026-01-20T20:13:30.938Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=8f5aa08e-29be-35f7-48bb-159efcb95009 eval=111ad9a5-30ae-a397-7633-df90df2785db index=3962262 timeout=5s
    2026-01-20T20:14:20.939Z [WARN]  worker: server is unable to catch up to last eval's index: worker_id=8f5aa08e-29be-35f7-48bb-159efcb95009 error="timed out after 50s waiting for index=3962262"
    2026-01-20T20:14:20.939Z [WARN]  worker: server is unable to catch up to last eval's index: worker_id=21c0314d-3b32-7ad1-dc65-6c950d7ecf01 error="timed out after 50s waiting for index=3962262"

...

    2026-01-20T20:31:13.257Z [WARN]  worker: server is unable to catch up to last eval's index: worker_id=0ee88751-5e1a-7c35-9777-58df92099f24 error="timed out after 50s waiting for index=3962283"
    2026-01-20T20:31:13.257Z [WARN]  worker: server is unable to catch up to last eval's index: worker_id=5008c817-a45e-c8c9-615a-36426c1d057d error="timed out after 50s waiting for index=3962310"
    2026-01-20T20:31:13.257Z [WARN]  worker: server is unable to catch up to last eval's index: worker_id=c034e957-2f08-79f4-780e-03e1f6808319 error="timed out after 50s waiting for index=3962080"
    2026-01-20T20:31:16.771Z [ERROR] nomad.fsm: ApplyPlan failed: error="alloc d029765e-54ea-4521-c0b4-161cba332921 doesn't exist"
    2026-01-20T20:32:01.792Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=db680144-7223-ae90-3018-8a444ded11b2 eval=9a432db3-e25e-00cf-7a12-3d761689b45a index=3962326 timeout=5s
    2026-01-20T20:32:07.793Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=c034e957-2f08-79f4-780e-03e1f6808319 eval=9a432db3-e25e-00cf-7a12-3d761689b45a index=3962326 timeout=5s
    2026-01-20T20:32:51.792Z [WARN]  worker: server is unable to catch up to last eval's index: worker_id=db680144-7223-ae90-3018-8a444ded11b2 error="timed out after 50s waiting for index=3962326"
    2026-01-20T20:32:57.795Z [WARN]  worker: server is unable to catch up to last eval's index: worker_id=c034e957-2f08-79f4-780e-03e1f6808319 error="timed out after 50s waiting for index=3962326"
    2026-01-20T20:33:30.938Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=21c0314d-3b32-7ad1-dc65-6c950d7ecf01 eval=e3fd3751-6059-89fd-e65f-fe412357a18d index=3962329 timeout=5s
    2026-01-20T20:33:30.938Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=8d4efc41-0836-1f8d-5bf4-1ede51b8bea4 eval=fb660bf1-1f4b-f3ca-9df4-b08c2eef967e index=3962329 timeout=5s
    2026-01-20T20:33:30.938Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=f05f5e58-bce7-5536-acc9-8e0ec4d29184 eval=d0de6133-05ef-0759-8440-3d3a6112d478 index=3962329 timeout=5s
    2026-01-20T20:33:30.938Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=49ad3cf9-65da-27f7-9e5a-6966e6cc1b90 eval=216d7087-1d2e-4514-ba6d-5e12e538f383 index=3962329 timeout=5s
    2026-01-20T20:33:30.938Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=5008c817-a45e-c8c9-615a-36426c1d057d eval=d2ddefff-4242-e878-4266-e0a787370232 index=3962329 timeout=5s
    2026-01-20T20:33:30.938Z [WARN]  worker: timeout waiting for Raft index required by eval: worker_id=0ee88751-5e1a-7c35-9777-58df92099f24 eval=cc5fa525-c528-7aaa-a6d1-9c844bae1a18 index=3962329 timeout=5s
    2026-01-20T20:33:51.327Z [ERROR] nomad.fsm: ApplyPlan failed: error="alloc 4edd4bcb-6b18-9e83-2a8b-4f49d9d35c52 doesn't exist"

I validated with strings /var/lib/nomad/server/raft/snapshots/77964-3959432-1768932080725/state.bin | grep name-of-my-task-group-1234 that despite disappearing from the GUI, the 10000 CLI invocations were still in there.

So Nomad didn't clean those up yet no matter how much I gc I ran.

I learned from some reading now that raft.db is the append-only Raft log, that sometimes gets then replayed into the state snapshots/.

So that last snapshot seemed to not have been recreated since I deleted all the old stuff with purge and gc and so on.

Then I finally found the solution to clean up the snapshots in a reasonable time:

I temporarily deployed raft_snapshot_threshold = 5; (default is 8192).

This caused a snapshot recreation, snapshots/ went to 1 MB instead of 3.5 GB, and raft.db was 32 MB at that time.
With this change, Nomad started instantly instead of in 30 minutes.

So in summary, the repro is:

  • Start 10000 command lines using 5 jobs of 2000 Task Groups each
  • Observe huge RAM usage, slow startup, and high CPU.
  • Try, and fail, to get it back into shape by using nomad commands.

Expected Result

  • Nomad logs what exactly it's doing while spinning CPU for 30 minutes.
  • nomad system gc, or some other documented command, actually forces a Raft cleanup (snapshot recreation) to GC the old stuff in there.
  • Nomad docs describe somewhere what the correct way is to "just launch a batch of a thousand CLI invocations that take a minute each".

Thanks!

I hope this report is useful to other Nomad users.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Triaging

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions