A Token-level Text Image Foundation Model for Document Understanding

[📂 Project Pages] [📖 Paper] [🤗 Weights] [🤗 Demo] [🤗 Dataset] [🚀 Quick Start]

📖 Table of Contents

Introduction
Installation
Quick Start
Streamlit Demo
BPE Token Visualization
Weights
Token Family
Release Plans

📝 Introduction

We are excited to announce the release of TokenFD, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks.

In summary:

(1) The first token-level image text dataset (TokenIT) is proposed;

(2) The first token-level text image foundation model, TokenFD, is proposed to support downstream tasks.

(3) The image-as-text semantic capability inspires us to develop TokenVL, a VQA-based MLLM tailored for document perception, understanding, and reasoning.

🛠️ Installation

conda create -n TokenFD python=3.9
conda activate TokenFD
pip install -r requirements.txt

Install flash-attn==2.3.6 (optional):

pip install flash-attn==2.3.6 --no-build-isolation

Alternatively you can compile from source:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install

If you don't use flash-attn, please modify the configs of weights, referring to this

🚀 Quick Start

import os
import torch
from transformers import AutoTokenizer
from internvl.model.internvl_chat import InternVLChatModel
from utils import post_process, generate_similiarity_map, load_image

checkpoint = './TokenFD_4096_English_seg'
image_path = './demo_images/0000000.png'
input_query = '11/12/2020'
out_dir = 'results'

if not os.path.exists(out_dir):
    os.makedirs(out_dir, exist_ok=True)

"""loading model, tokenizer, tok_embeddings """
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, use_fast=False)
model = InternVLChatModel.from_pretrained(checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).eval()
model = model.cuda()

"""loading image """
pixel_values, images, target_aspect_ratio = load_image(image_path)
 

"""loading query texts """
if input_query[0] in '!"#$%&\'()*+,-./0123456789:;<=>?@^_{|}~0123456789':
    input_ids = tokenizer(input_query)['input_ids'][1:]
else:
    input_ids = tokenizer(' '+input_query)['input_ids'][1:]
input_ids = torch.Tensor(input_ids).long().to(model.device)
input_embeds = model.tok_embeddings(input_ids).clone()
all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]


"""Obtaining similarity """
with torch.no_grad():
  vit_embeds, _ = model.forward_tokenocr(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
  vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
  token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
  input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
  similarity = input_embedings @ token_features.t()
  attn_map = similarity.reshape(len(input_embedings), resized_size[0], resized_size[1])

"""generate map locally """
generate_similiarity_map(images, attn_map, all_bpe_strings, out_dir, target_aspect_ratio)


"""user command """
# python quick_start.py

✨ Streamlit Demo

We are excited to present an interactive demo of our project using Streamlit. This demo allows users to explore the capabilities of our model——TokenFD.

To run the Streamlit demo, you need to wrap the dependencies and then run:

cd streamlit_demo
pip install requirement_app.txt
sh run.sh

Features

Interactive Interface: Easily upload the image, enter the bpe you want to query, and click the RUN button to view the results of TokenFD's processing.
Real-time Results: Both models, based on internvl and resnet50, give users instant feedback in bpe.
User-Friendly: Designed to be intuitive, even for users without a technical background.

How to Use

Access the Demo: [Link to your Streamlit demo]
Upload a Document or Image: Use the interface to upload your files.
Text input: Input your text related to the content of the images.
View Results: See How models generate bpe visualizations in real time.

Then a simple Web-UI to interactive:

Feedback

We welcome any feedback or suggestions to improve the demo. Please feel free to reach out via [contact information or GitHub issues].

📺 BPE Token Visualization

Scene|Document|Code

Chart|Table|GUI

Chinese|Punctuation

🤗 Weights

我们在当前开源的基座模型上进行了定制适配，模型地址：

TokenFD-ResNet50-bilingual

TokenFD-InternViT2.5-bilingual

TokenFD-InternViT2.5-english

TokenFD-QwenViT2.5-bilingual

🏠 Token Family

TokenIT

In the following picture, we provide an overview of the self-constructed token-level TokenIT dataset, comprising 20 million images and 1.8 billion text-mask pairs.

As depicted in Figure 2 (a), each sample in this dataset includes a raw image, a mask image, and a JSON file. The JSON file provides the question-answer pairs and several BPE tokens randomly selected from the answer, along with the ordinal number of each BPE token in the answer and its corresponding pixel value on the mask image. Consequently, each BPE token corresponds one-to-one with a pixel-level mask. The data ratios are summarized in Figure 2 (b). Figure 2 (c) and (d) further provide the number distribution of tokens per image type and a word cloud of the top 100 tokens, respectively.

The comparisons with other visual foundation models:

VFM	Granularity	Dataset	#Image	#Pairs
CLIP	image-level	WIT400M	400M	0.4B
DINO	image-level	ImageNet	14M	-
SAM	pixel-level	SA1B	11M	1.1B
TokenFD	token-level	TokenIT	20M	1.8B

Since this dataset is very large, we will temporarily open source some data here! TokenIT demo link

TokenFD

Model Architecture

An overview of the proposed TokenFD, where the token-level image features and token-level language features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive applications, including text segmentation, retrieval, and visual question answering.

Model Cards

In the following table, we provide all models 🤗 link of the TokenFD series. You can use prompt ' ' to get a highlight background.

Model Name	Description
TokenFD_2048_Bilingual_seg	Backbone is ViT；feature dimension is 2048; support interactive with English and Chinese texts.
TokenFD_4096_English_seg	(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts.

Evaluation on Vision Capability

We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks. The evaluation is divided into two key categories:

(1) text retrial; (2) image segmentation; (3) visual question answering;

This approach allows us to assess the representation quality of TokenFD. Please refer to our technical report for more details.

text retrial

image segmentation

visual question answering

TokenVL

we employ the TokenFD as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding. Following the previous training paradigm, TokenVL also includes two stages:

Stage 1: LLM-guided Token Alignment Training for text parsing tasks.

The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit method makes it difficult for these models to have a precise understanding. In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.

Stage 2: Supervised Instruction Tuning for VQA tasks.

During the Supervised Instruction Tuning stage, we cancel the token alignment branch as answers may not appear in the image for some reasoning tasks (e.g., How much taller is the red bar compared to the green bar?). This also ensures no computational overhead during inference to improve the document understanding capability. Finally, we inherit the remaining weights from the LLM-guided Token Alignment and unfreeze all parameters to facilitate comprehensive parameter updates.

OCRBench Results

Document Understanding Results

🤚 Release Plans

✅ Inference code and weights for TokenFD ✅ Code & model checkpoint for TokenVL

Release Character-level Text Image Foundation Model (CharOCR)
Data for the Pre-training and Fine-tuning of TokenVL
TokenIT data and script

More Applications

Text Tracking

Please refer https://github.com/Token-family/TokenFD/blob/main/demo_images/text_tracking.mp4

🏛 License

This project is released under the MIT License.

📎 Citation

If you find this project useful in your research, please consider citing:

@article{guan2025token,
  title={A Token-level Text Image Foundation Model for Document Understanding},
  author={Guan, Tongkun and Wang, Zining and Fu, Pei and Guo, Zhengtao and Shen, Wei and Zhou, Kai and Yue, Tiezhu and Duan, Chen and Sun, Hao and Jiang, Qianyi and others},
  journal={arXiv preprint arXiv:2503.02304},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
__pycache__		__pycache__
demo_images		demo_images
ds_configs		ds_configs
internvl		internvl
load_VFM		load_VFM
resnet50		resnet50
shell		shell
streamlit_demo		streamlit_demo
visualization_dir		visualization_dir
.DS_Store		.DS_Store
README.md		README.md
demo.py		demo.py
demo_qw.py		demo_qw.py
demo_r50.py		demo_r50.py
demo_vit.py		demo_vit.py
extract_frame.py		extract_frame.py
model_path.txt		model_path.txt
quick_start.py		quick_start.py
requirement.txt		requirement.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Token-level Text Image Foundation Model for Document Understanding

📖 Table of Contents

📝 Introduction

🛠️ Installation

🚀 Quick Start

✨ Streamlit Demo

Features

How to Use

Feedback

📺 BPE Token Visualization

🤗 Weights

🏠 Token Family

TokenIT

TokenFD

Model Architecture

Model Cards

Evaluation on Vision Capability

text retrial

image segmentation

visual question answering

TokenVL

OCRBench Results

Document Understanding Results

🤚 Release Plans

More Applications

Text Tracking

🏛 License

📎 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Token-family/TokenFD

Folders and files

Latest commit

History

Repository files navigation

A Token-level Text Image Foundation Model for Document Understanding

📖 Table of Contents

📝 Introduction

🛠️ Installation

🚀 Quick Start

✨ Streamlit Demo

Features

How to Use

Feedback

📺 BPE Token Visualization

🤗 Weights

🏠 Token Family

TokenIT

TokenFD

Model Architecture

Model Cards

Evaluation on Vision Capability

text retrial

image segmentation

visual question answering

TokenVL

OCRBench Results

Document Understanding Results

🤚 Release Plans

More Applications

Text Tracking

🏛 License

📎 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages