[Feature Request] Optimize fragmented text spans for better browser translation compatibility

### Is the feature relevant to the Firefox PDF Viewer?

No

### Feature description

## Background & Motivation
 
I'm developing a tool to help users translate PDF documents using **browser built-in translation features** (e.g., Chrome Translate, Edge Translator, Firefox Translations). However, I encountered a critical limitation:
 
### The Problem:
PDF.js renders text content as highly fragmented `<span>` elements, which **breaks browser translation engines**. 
 
Here's what happens when you try to translate an MIT course PDF using Edge Translator:
**Original text (readable):**
<img width="1147" height="617" alt="Image" src="https://github.com/user-attachments/assets/d30f9305-01fb-4fe3-85f2-be976e27774d" />

**Translation result (completely unusable):**
<img width="1129" height="607" alt="Image" src="https://github.com/user-attachments/assets/89a487d9-9e06-45ed-bc68-6ce278ebe1ed" />

When I first encountered these translation issues, the obvious solution would have been to **abandon PDF.js and use specialized PDF translation tools** (e.g., Adobe Acrobat with built-in translation, dedicated PDF translators, or OCR-based solutions).
However, I made a deliberate choice to invest time in fixing this because:
 
 **PDF.js's Layout Fidelity is Unmatched**
PDF.js provides **pixel-perfect rendering** that preserves:
- Original typography and font rendering
- Precise positioning of diagrams and figures
- Complex multi-column layouts
- Mathematical notation positioning
- Embedded vector graphics
 
My Solution
After studying the PDF.js architecture, I developed an algorithm that:

Merges fragmented text items into natural text blocks before DOM rendering
Detects superscripts/subscripts by analyzing Y-offset and font scale
Marks them with zero-width Unicode characters (\u200B for superscript, \u200C for subscript)
Converts to HTML <sub>/<sup> tags during rendering

After Merged spans：
 
<img width="1115" height="573" alt="Image" src="https://github.com/user-attachments/assets/548ca6a0-9295-400f-8267-9f67b089b2ef" />

After translate:
<img width="1127" height="544" alt="Image" src="https://github.com/user-attachments/assets/f1671e5f-ea36-4e2d-a7b4-b1a4117791bf" />

Here is the repository for my project: https://github.com/021gink/PDFtranslate

Although my project still has several shortcomings—such as text expansion when translating Chinese into English, and changes to styles like hyperlinks and italics after line-level merging—I believe this line-level span merging algorithm offers a new approach. If the span merging fits the requirements of the official library, I plan to fork it later, integrate this algorithm into the appropriate place, and submit a PR.



### Other PDF viewers

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Optimize fragmented text spans for better browser translation compatibility #20533

Is the feature relevant to the Firefox PDF Viewer?

Feature description

Background & Motivation

The Problem:

Other PDF viewers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Optimize fragmented text spans for better browser translation compatibility #20533

Description

Is the feature relevant to the Firefox PDF Viewer?

Feature description

Background & Motivation

The Problem:

Other PDF viewers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions