Skip to content

performance issues july 2018 #105

@john-friedman

Description

@john-friedman

From email

But here’s some feedback (which hopefully you find useful) from the last 6 days of running the full history (2005–2025). Even on a Mac Studio M3 Ultra (96GB RAM), I hit a few engineering bottlenecks that might be useful for you to know about for the library:

  1. My script kept hanging at 100% CPU on random batches (specifically July 2018). I dug in and found the archives contained massive 250MB+ XML files (mostly Exhibit 10 "Material Contracts"). Because the default extract_tar_files uses f.read() to check metadata, it tried to swallow them whole, causing the GC to thrash even with 96GB RAM.

The Fix: I patched it to use f.read(500000) to limit the read buffer. It now streams through them instantly.

  1. When the script crashed (due to the memory issue above), the cleanup block wouldn't execute, leaving stale temp_ingest folders behind. I woke up to 1.94TB usage mostly from these "zombie" runs from failed batches.

  2. Standard macOS issue, but I had to manually enforce ulimit -n 10240 to stop OSError: Too many open files due to the concurrency.

My patched extract_tar_files logic: I modified the extraction loop to prevent swallowing those massive files:

In extract_tar_files(temp_dir):

... inside the loop ...

if fname.endswith(('.htm', '.html', '.xml', '.txt')):
f = tar.extractfile(member)

# OLD: content = f.read().decode('utf-8', errors='ignore')
# NEW: Limit read to first 500KB to prevent memory crash on 200MB+ Exhibits
content = f.read(500000).decode('utf-8', errors='ignore')

parsed = parse_document_content(content, fname)
if parsed:
    doc_filename = fname
    # ... rest of logic

low priority bug for now, will look into. tyvm.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions