performance issues july 2018

From email

> But here’s some feedback (which hopefully you find useful) from the last 6 days of running the full history (2005–2025). Even on a Mac Studio M3 Ultra (96GB RAM), I hit a few engineering bottlenecks that might be useful for you to know about for the library:
> 
> 1. My script kept hanging at 100% CPU on random batches (specifically July 2018). I dug in and found the archives contained massive 250MB+ XML files (mostly Exhibit 10 "Material Contracts"). Because the default extract_tar_files uses f.read() to check metadata, it tried to swallow them whole, causing the GC to thrash even with 96GB RAM.
> 
> The Fix: I patched it to use f.read(500000) to limit the read buffer. It now streams through them instantly.
> 
> 2. When the script crashed (due to the memory issue above), the cleanup block wouldn't execute, leaving stale temp_ingest folders behind. I woke up to 1.94TB usage mostly from these "zombie" runs from failed batches.
> 
> 3. Standard macOS issue, but I had to manually enforce ulimit -n 10240 to stop OSError: Too many open files due to the concurrency.
> 
> My patched extract_tar_files logic: I modified the extraction loop to prevent swallowing those massive files:
> 
> # In extract_tar_files(temp_dir):
> 
> # ... inside the loop ...
> if fname.endswith(('.htm', '.html', '.xml', '.txt')):
>     f = tar.extractfile(member)
>     
>     # OLD: content = f.read().decode('utf-8', errors='ignore')
>     # NEW: Limit read to first 500KB to prevent memory crash on 200MB+ Exhibits
>     content = f.read(500000).decode('utf-8', errors='ignore')
>     
>     parsed = parse_document_content(content, fname)
>     if parsed:
>         doc_filename = fname
>         # ... rest of logic


low priority bug for now, will look into. tyvm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

performance issues july 2018 #105

In extract_tar_files(temp_dir):

... inside the loop ...

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

performance issues july 2018 #105

Description

In extract_tar_files(temp_dir):

... inside the loop ...

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions