-
Notifications
You must be signed in to change notification settings - Fork 63
Description
From email
But here’s some feedback (which hopefully you find useful) from the last 6 days of running the full history (2005–2025). Even on a Mac Studio M3 Ultra (96GB RAM), I hit a few engineering bottlenecks that might be useful for you to know about for the library:
- My script kept hanging at 100% CPU on random batches (specifically July 2018). I dug in and found the archives contained massive 250MB+ XML files (mostly Exhibit 10 "Material Contracts"). Because the default extract_tar_files uses f.read() to check metadata, it tried to swallow them whole, causing the GC to thrash even with 96GB RAM.
The Fix: I patched it to use f.read(500000) to limit the read buffer. It now streams through them instantly.
When the script crashed (due to the memory issue above), the cleanup block wouldn't execute, leaving stale temp_ingest folders behind. I woke up to 1.94TB usage mostly from these "zombie" runs from failed batches.
Standard macOS issue, but I had to manually enforce ulimit -n 10240 to stop OSError: Too many open files due to the concurrency.
My patched extract_tar_files logic: I modified the extraction loop to prevent swallowing those massive files:
In extract_tar_files(temp_dir):
... inside the loop ...
if fname.endswith(('.htm', '.html', '.xml', '.txt')):
f = tar.extractfile(member)# OLD: content = f.read().decode('utf-8', errors='ignore') # NEW: Limit read to first 500KB to prevent memory crash on 200MB+ Exhibits content = f.read(500000).decode('utf-8', errors='ignore') parsed = parse_document_content(content, fname) if parsed: doc_filename = fname # ... rest of logic
low priority bug for now, will look into. tyvm.