Slightly corrupted pdfs throw on parsing

We have an issue where we would like this library to allow us to handle slightly corrupted PDF:s better. We have gotten this error in production twice when parsing PDF:s:

```json
{
  "stack": "Error: AES ciphertext length must be multiple of 16, got 7204\n    at aesDecrypt (/dist/main.js:1473705:55)\n    at AES256Handler.decrypt (/dist/main.js:1473956:12)\n    at AES256Handler.decryptString (/dist/main.js:1473674:17)\n    at StandardSecurityHandler.decryptString (/dist/main.js:1474436:40)\n    at decryptObject (/dist/main.js:1475806:74)\n    at decryptObject (/dist/main.js:1475824:74)\n    at Object.getObject (/dist/main.js:1475842:55)\n    at ObjectRegistry.resolver (/dist/main.js:1481323:43)\n    at ObjectRegistry.resolve (/dist/main.js:1472836:24)\n    at PDFContext.resolve (/dist/main.js:1476967:26)",
  "message": "AES ciphertext length must be multiple of 16, got 7204"
}
```

After inspecting the source code with claude code I got this response:

```
 Root Cause

  The error originates at src/security/ciphers/aes.ts:74. The aesDecrypt function:

  1. Receives the raw data (IV + ciphertext)
  2. Extracts the 16-byte IV from the front
  3. Strictly validates that the remaining ciphertext is a multiple of 16 bytes (AES block size)
  4. Throws if it's not

  In your case, the ciphertext portion is 7204 bytes (7204 % 16 = 4 extra bytes), so the check fails.

  Why It Happens

  The full call chain has zero error handling:

  document-parser.ts:542  →  securityHandler.decryptString(obj.bytes, ...)   // no try-catch
    → standard-handler.ts:455  →  this.handlers.stringHandler.decryptString(...)
      → abstract.ts:36  →  this.decrypt(...)
        → aes256.ts:24  →  aesDecrypt(this.fileKey, data)
          → aes.ts:74   →  💥 throws Error

  No layer in this chain catches the error, so it propagates straight to the caller.

  Why the Data Is Misaligned

  This typically happens with real-world PDFs when:

  - Buggy PDF generators didn't properly PKCS#7-pad the data before encryption
  - String boundary parsing grabbed slightly too many or too few bytes from the hex/literal string
  - Corruption — a few bytes were lost or added during processing
  - Mixed encryption — the PDF claims AES but some objects were written with RC4 or no encryption

  Comparison to pdf.js / PDFBox

  Both reference libraries handle this gracefully rather than crashing:
  - pdf.js wraps decryption in try-catch and returns the original (undecrypted) bytes on failure
  - PDFBox catches exceptions, logs a warning, and continues parsing

  The Fix

  Two things should happen, following the project's "be super lenient" design principle:

  1. aesDecrypt should truncate to the nearest block boundary instead of throwing — this way the maximum amount of data can still be decrypted
  2. decryptObject should wrap decryption in try-catch as a safety net — if decryption fails for any reason, return the original object with a warning instead of crashing
```

Perhaps we should have the same approach as PDFBox to continue parsing the pdf even if there is a misalignment? It seems to align with what you want this library to become. 

Thanks for making an awesome PDF lib!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slightly corrupted pdfs throw on parsing #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slightly corrupted pdfs throw on parsing #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions