Skip to content

Slightly corrupted pdfs throw on parsing #28

@redsuperbat

Description

@redsuperbat

We have an issue where we would like this library to allow us to handle slightly corrupted PDF:s better. We have gotten this error in production twice when parsing PDF:s:

{
  "stack": "Error: AES ciphertext length must be multiple of 16, got 7204\n    at aesDecrypt (/dist/main.js:1473705:55)\n    at AES256Handler.decrypt (/dist/main.js:1473956:12)\n    at AES256Handler.decryptString (/dist/main.js:1473674:17)\n    at StandardSecurityHandler.decryptString (/dist/main.js:1474436:40)\n    at decryptObject (/dist/main.js:1475806:74)\n    at decryptObject (/dist/main.js:1475824:74)\n    at Object.getObject (/dist/main.js:1475842:55)\n    at ObjectRegistry.resolver (/dist/main.js:1481323:43)\n    at ObjectRegistry.resolve (/dist/main.js:1472836:24)\n    at PDFContext.resolve (/dist/main.js:1476967:26)",
  "message": "AES ciphertext length must be multiple of 16, got 7204"
}

After inspecting the source code with claude code I got this response:

 Root Cause

  The error originates at src/security/ciphers/aes.ts:74. The aesDecrypt function:

  1. Receives the raw data (IV + ciphertext)
  2. Extracts the 16-byte IV from the front
  3. Strictly validates that the remaining ciphertext is a multiple of 16 bytes (AES block size)
  4. Throws if it's not

  In your case, the ciphertext portion is 7204 bytes (7204 % 16 = 4 extra bytes), so the check fails.

  Why It Happens

  The full call chain has zero error handling:

  document-parser.ts:542  →  securityHandler.decryptString(obj.bytes, ...)   // no try-catch
    → standard-handler.ts:455  →  this.handlers.stringHandler.decryptString(...)
      → abstract.ts:36  →  this.decrypt(...)
        → aes256.ts:24  →  aesDecrypt(this.fileKey, data)
          → aes.ts:74   →  💥 throws Error

  No layer in this chain catches the error, so it propagates straight to the caller.

  Why the Data Is Misaligned

  This typically happens with real-world PDFs when:

  - Buggy PDF generators didn't properly PKCS#7-pad the data before encryption
  - String boundary parsing grabbed slightly too many or too few bytes from the hex/literal string
  - Corruption — a few bytes were lost or added during processing
  - Mixed encryption — the PDF claims AES but some objects were written with RC4 or no encryption

  Comparison to pdf.js / PDFBox

  Both reference libraries handle this gracefully rather than crashing:
  - pdf.js wraps decryption in try-catch and returns the original (undecrypted) bytes on failure
  - PDFBox catches exceptions, logs a warning, and continues parsing

  The Fix

  Two things should happen, following the project's "be super lenient" design principle:

  1. aesDecrypt should truncate to the nearest block boundary instead of throwing — this way the maximum amount of data can still be decrypted
  2. decryptObject should wrap decryption in try-catch as a safety net — if decryption fails for any reason, return the original object with a warning instead of crashing

Perhaps we should have the same approach as PDFBox to continue parsing the pdf even if there is a misalignment? It seems to align with what you want this library to become.

Thanks for making an awesome PDF lib!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions