Skip to content

[Feature Request] Optimize fragmented text spans for better browser translation compatibility #20533

@021gink

Description

@021gink

Is the feature relevant to the Firefox PDF Viewer?

No

Feature description

Background & Motivation

I'm developing a tool to help users translate PDF documents using browser built-in translation features (e.g., Chrome Translate, Edge Translator, Firefox Translations). However, I encountered a critical limitation:

The Problem:

PDF.js renders text content as highly fragmented <span> elements, which breaks browser translation engines.

Here's what happens when you try to translate an MIT course PDF using Edge Translator:
Original text (readable):
Image

Translation result (completely unusable):
Image

When I first encountered these translation issues, the obvious solution would have been to abandon PDF.js and use specialized PDF translation tools (e.g., Adobe Acrobat with built-in translation, dedicated PDF translators, or OCR-based solutions).
However, I made a deliberate choice to invest time in fixing this because:

PDF.js's Layout Fidelity is Unmatched
PDF.js provides pixel-perfect rendering that preserves:

  • Original typography and font rendering
  • Precise positioning of diagrams and figures
  • Complex multi-column layouts
  • Mathematical notation positioning
  • Embedded vector graphics

My Solution
After studying the PDF.js architecture, I developed an algorithm that:

Merges fragmented text items into natural text blocks before DOM rendering
Detects superscripts/subscripts by analyzing Y-offset and font scale
Marks them with zero-width Unicode characters (\u200B for superscript, \u200C for subscript)
Converts to HTML / tags during rendering

After Merged spans:

Image

After translate:
Image

Here is the repository for my project: https://github.com/021gink/PDFtranslate

Although my project still has several shortcomings—such as text expansion when translating Chinese into English, and changes to styles like hyperlinks and italics after line-level merging—I believe this line-level span merging algorithm offers a new approach. If the span merging fits the requirements of the official library, I plan to fork it later, integrate this algorithm into the appropriate place, and submit a PR.

Other PDF viewers

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions