Skip to content
/ WiCo Public

The official implementation of CVPR Workshop 2025 paper: Window Token Concatenation for Efficient Visual Large Language Models.

Notifications You must be signed in to change notification settings

JackYFL/WiCo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

WiCo: Window Token Concatenation for Efficient Visual Large Language Models

Exploring efficient projector designs for VLLMs.

Window Token Concatenation for Efficient Visual Large Language Models [Paper]
Authors: Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong

Abstract: To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors.

Loading failed Figure 1. Framework of our WiCo+. WiCo+ consists of two main components, i.e., a dynamic window token concatenation projector (WiCo) and the token decomposition strategy in the later layers of the LLM decoder. WiCo first learns similar local token representations by kv self-attention layers from the last kv layers of a pretrained vision encoder (say CLIP). Then, a sliding window is adopted on the 2-D token map to perform concatenation, and an MLP is utilized to project these visual tokens into language space. To further enhance the perception field of the rest visual tokens, we decompose the visual tokens in the later layers (say the last Kl layers) of the LLM decoder, which will benefit the fine-grained understanding tasks.

Note: Our repository support multiple baselines for visual token reduction:

Name Type Paper Venue
Token-mixer Mixing tokens MLP-Mixer: An all-MLP Architecture for Vision NeurlPS 2021
Concatenation Concatenate neighbor tokens MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-Task Learning ICLR 2024
Perceiver Cross-attention Flamingo: A Visual Language Model for Few-Shot Learning NeurIPS 2022
C-Abstractor Convolution + pooling Honeybee: Locality-Enhanced Projector for Multimodal LLM CVPR 2024
Tokenfilter Token selection TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document CVPR 2024
LLaVA-Prumerge Selection + merging LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models arXiv 2024
ToMe Token Merging Token Merging: Your ViT But Faster ICLR 2023

Current projector types

Figure 2. (a) Current projector types (left) and ours (right) for VLLM token reduction. Existing token reduction projectors are mainly based on (i) selection, (ii) merging, (iii) concatenation and (iv) cross-attention.
(b) This figure illustrates that the performance of VLLMs is sensitive to the types of downstream tasks when changing the number of visual tokens. Specifically, the performance of VLLMs drops more significantly for fine-grained understanding tasks compared to coarse-grained selection/merging tasks when reducing visual tokens.

Contents

Install

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/JackYFL/WiCo.git
cd WiCo/LLaVA
  1. Execute the install.sh bash file.
. install.sh

Note: The install bash file includes two steps:
Step:1 Install Packages

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Step:2 Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn ==2.5.3 --no-build-isolation

Train

The training process of LLaVA includes two steps: pretraining (only tuning projector) and finetuning (both projector and LLM).

Before training, all datasets used in LLaVA should be prepared according to the instructions in the Data document.

We provide bash scripts in scripts/v1_5. For instance to train WiCo, just execute:

cd LLaVA
. scripts/v1_5/wico/pretrain.sh
. scripts/v1_5/wico/finetune.sh

You can refer to scripts/v1_5/pretrain.sh and scripts/v1_5/finetune.sh for more information.

Evaluation

Before evaluation, all the datasets should be prepared according to the Evaluation document.

We also provide bash script for convenience. To evaluate WiCo, just execute:

cd LLaVA
. scripts/v1_5/wico/eval.sh

Citation

If you find WiCo useful for your research and application, please cite this BibTex:

@inproceedings{
  title={Window Token Concatenation for Efficient Visual Large Language Models},
  author={Li, Yifan and Bao, Wentao and Ye, Botao and Tan, Zhen and Chen, Tianlong and Huan, Liu and Kong, Yu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop},
  year={2025}
}

Acknowledgement

Thanks for these insightful codebases!

  • LLaVA: The codebase we built on for general VQA task.
  • Shikra: The codebase we built on for grounding task.
  • MiniGPT4: The token concatenation projector.
  • Honeybee: The C-Abstractor projector.
  • Monkey: The tokenfilter projector.
  • LLaVA-Prumerge: The LLaVA-Prumerge projector.
  • ToME: The ToME projector.

About

The official implementation of CVPR Workshop 2025 paper: Window Token Concatenation for Efficient Visual Large Language Models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published