Exploring efficient projector designs for VLLMs.
Window Token Concatenation for Efficient Visual Large Language Models [Paper]
Authors: Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong
Abstract: To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors.
Figure 1. Framework of our WiCo+. WiCo+ consists of two main components, i.e., a dynamic window token concatenation projector (WiCo) and the token decomposition strategy in the later layers of the LLM decoder. WiCo first learns similar local token representations
by kv self-attention layers from the last kv layers of a pretrained vision encoder (say CLIP). Then, a sliding window is adopted on the 2-D
token map to perform concatenation, and an MLP is utilized to project these visual tokens into language space. To further enhance the
perception field of the rest visual tokens, we decompose the visual tokens in the later layers (say the last Kl layers) of the LLM decoder,
which will benefit the fine-grained understanding tasks.
Note: Our repository support multiple baselines for visual token reduction:
| Name | Type | Paper | Venue |
|---|---|---|---|
| Token-mixer | Mixing tokens | MLP-Mixer: An all-MLP Architecture for Vision | NeurlPS 2021 |
| Concatenation | Concatenate neighbor tokens | MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-Task Learning | ICLR 2024 |
| Perceiver | Cross-attention | Flamingo: A Visual Language Model for Few-Shot Learning | NeurIPS 2022 |
| C-Abstractor | Convolution + pooling | Honeybee: Locality-Enhanced Projector for Multimodal LLM | CVPR 2024 |
| Tokenfilter | Token selection | TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document | CVPR 2024 |
| LLaVA-Prumerge | Selection + merging | LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models | arXiv 2024 |
| ToMe | Token Merging | Token Merging: Your ViT But Faster | ICLR 2023 |
Figure 2. (a) Current projector types (left) and ours (right) for VLLM token reduction. Existing token reduction projectors are mainly based on (i) selection, (ii) merging, (iii) concatenation and (iv) cross-attention.
(b) This figure illustrates that the performance of VLLMs is sensitive to the types of downstream tasks when changing the number of visual tokens. Specifically, the performance of VLLMs drops more significantly for fine-grained understanding tasks compared to coarse-grained selection/merging tasks when reducing visual tokens.
- Clone this repository and navigate to LLaVA folder
git clone https://github.com/JackYFL/WiCo.git
cd WiCo/LLaVA- Execute the install.sh bash file.
. install.sh
Note: The install bash file includes two steps:
Step:1 Install Packages
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .Step:2 Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn ==2.5.3 --no-build-isolationThe training process of LLaVA includes two steps: pretraining (only tuning projector) and finetuning (both projector and LLM).
Before training, all datasets used in LLaVA should be prepared according to the instructions in the Data document.
We provide bash scripts in scripts/v1_5. For instance to train WiCo, just execute:
cd LLaVA
. scripts/v1_5/wico/pretrain.sh
. scripts/v1_5/wico/finetune.shYou can refer to scripts/v1_5/pretrain.sh and scripts/v1_5/finetune.sh for more information.
Before evaluation, all the datasets should be prepared according to the Evaluation document.
We also provide bash script for convenience. To evaluate WiCo, just execute:
cd LLaVA
. scripts/v1_5/wico/eval.shIf you find WiCo useful for your research and application, please cite this BibTex:
@inproceedings{
title={Window Token Concatenation for Efficient Visual Large Language Models},
author={Li, Yifan and Bao, Wentao and Ye, Botao and Tan, Zhen and Chen, Tianlong and Huan, Liu and Kong, Yu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop},
year={2025}
}Thanks for these insightful codebases!
