-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Is the feature relevant to the Firefox PDF Viewer?
No
Feature description
Background & Motivation
I'm developing a tool to help users translate PDF documents using browser built-in translation features (e.g., Chrome Translate, Edge Translator, Firefox Translations). However, I encountered a critical limitation:
The Problem:
PDF.js renders text content as highly fragmented <span> elements, which breaks browser translation engines.
Here's what happens when you try to translate an MIT course PDF using Edge Translator:
Original text (readable):

Translation result (completely unusable):

When I first encountered these translation issues, the obvious solution would have been to abandon PDF.js and use specialized PDF translation tools (e.g., Adobe Acrobat with built-in translation, dedicated PDF translators, or OCR-based solutions).
However, I made a deliberate choice to invest time in fixing this because:
PDF.js's Layout Fidelity is Unmatched
PDF.js provides pixel-perfect rendering that preserves:
- Original typography and font rendering
- Precise positioning of diagrams and figures
- Complex multi-column layouts
- Mathematical notation positioning
- Embedded vector graphics
My Solution
After studying the PDF.js architecture, I developed an algorithm that:
Merges fragmented text items into natural text blocks before DOM rendering
Detects superscripts/subscripts by analyzing Y-offset and font scale
Marks them with zero-width Unicode characters (\u200B for superscript, \u200C for subscript)
Converts to HTML / tags during rendering
After Merged spans:
Here is the repository for my project: https://github.com/021gink/PDFtranslate
Although my project still has several shortcomings—such as text expansion when translating Chinese into English, and changes to styles like hyperlinks and italics after line-level merging—I believe this line-level span merging algorithm offers a new approach. If the span merging fits the requirements of the official library, I plan to fork it later, integrate this algorithm into the appropriate place, and submit a PR.
Other PDF viewers
No response
