WiCo: Window Token Concatenation for Efficient Visual Large Language Models

Exploring efficient projector designs for VLLMs.

Window Token Concatenation for Efficient Visual Large Language Models [Paper]
Authors: Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong

Abstract: To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors.

Figure 1. Framework of our WiCo+. WiCo+ consists of two main components, i.e., a dynamic window token concatenation projector (WiCo) and the token decomposition strategy in the later layers of the LLM decoder. WiCo first learns similar local token representations by kv self-attention layers from the last kv layers of a pretrained vision encoder (say CLIP). Then, a sliding window is adopted on the 2-D token map to perform concatenation, and an MLP is utilized to project these visual tokens into language space. To further enhance the perception field of the rest visual tokens, we decompose the visual tokens in the later layers (say the last Kl layers) of the LLM decoder, which will benefit the fine-grained understanding tasks.

Note: Our repository support multiple baselines for visual token reduction:

Name	Type	Paper	Venue
Token-mixer	Mixing tokens	MLP-Mixer: An all-MLP Architecture for Vision	NeurlPS 2021
Concatenation	Concatenate neighbor tokens	MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-Task Learning	ICLR 2024
Perceiver	Cross-attention	Flamingo: A Visual Language Model for Few-Shot Learning	NeurIPS 2022
C-Abstractor	Convolution + pooling	Honeybee: Locality-Enhanced Projector for Multimodal LLM	CVPR 2024
Tokenfilter	Token selection	TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	CVPR 2024
LLaVA-Prumerge	Selection + merging	LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models	arXiv 2024
ToMe	Token Merging	Token Merging: Your ViT But Faster	ICLR 2023

Figure 2. (a) Current projector types (left) and ours (right) for VLLM token reduction. Existing token reduction projectors are mainly based on (i) selection, (ii) merging, (iii) concatenation and (iv) cross-attention.
(b) This figure illustrates that the performance of VLLMs is sensitive to the types of downstream tasks when changing the number of visual tokens. Specifically, the performance of VLLMs drops more significantly for fine-grained understanding tasks compared to coarse-grained selection/merging tasks when reducing visual tokens.

Install

Clone this repository and navigate to LLaVA folder

git clone https://github.com/JackYFL/WiCo.git
cd WiCo/LLaVA

Execute the install.sh bash file.

. install.sh

Note: The install bash file includes two steps:
Step:1 Install Packages

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Step:2 Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn ==2.5.3 --no-build-isolation

Train

The training process of LLaVA includes two steps: pretraining (only tuning projector) and finetuning (both projector and LLM).

Before training, all datasets used in LLaVA should be prepared according to the instructions in the Data document.

We provide bash scripts in scripts/v1_5. For instance to train WiCo, just execute:

cd LLaVA
. scripts/v1_5/wico/pretrain.sh
. scripts/v1_5/wico/finetune.sh

You can refer to scripts/v1_5/pretrain.sh and scripts/v1_5/finetune.sh for more information.

Evaluation

Before evaluation, all the datasets should be prepared according to the Evaluation document.

We also provide bash script for convenience. To evaluate WiCo, just execute:

cd LLaVA
. scripts/v1_5/wico/eval.sh

Citation

If you find WiCo useful for your research and application, please cite this BibTex:

@inproceedings{
  title={Window Token Concatenation for Efficient Visual Large Language Models},
  author={Li, Yifan and Bao, Wentao and Ye, Botao and Tan, Zhen and Chen, Tianlong and Huan, Liu and Kong, Yu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop},
  year={2025}
}

Acknowledgement

Thanks for these insightful codebases!

LLaVA: The codebase we built on for general VQA task.
Shikra: The codebase we built on for grounding task.
MiniGPT4: The token concatenation projector.
Honeybee: The C-Abstractor projector.
Monkey: The tokenfilter projector.
LLaVA-Prumerge: The LLaVA-Prumerge projector.
ToME: The ToME projector.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LLaVA		LLaVA
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WiCo: Window Token Concatenation for Efficient Visual Large Language Models

Contents

Install

Train

Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

JackYFL/WiCo

Folders and files

Latest commit

History

Repository files navigation

WiCo: Window Token Concatenation for Efficient Visual Large Language Models

Contents

Install

Train

Evaluation

Citation

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages