[AAAI 2026] Official Code for the Paper "Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning"
We propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out \textit{where and what} to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert.
More details can be found in our paper.
Overview of the STV framework. It has two stages: (1) identifying context-sensitive heads via activation deltas between query–context and query-only inputs; and (2) selecting task vectors from a pre-computed activation bank using reinforcement learning for those locations.
- Python 3.8+
- CUDA-capable GPU (recommended)
- PyTorch 1.12+
-
Create a conda environment (recommended)
conda create -n stv python=3.8 conda activate stv
-
Install required packages
pip install torch torchvision pip install transformers==4.32.0 pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 pip install scipy scikit-learn matplotlib seaborn tqdm pillow numpy tensorboard pip install git+https://github.com/davidbau/baukit@main#egg=baukit
-
Install model-specific dependencies
- For Qwen-VL: Follow the installation instructions in the Qwen-VL repository
- For Idefics2: The model will be automatically downloaded from HuggingFace when first used
- VizWiz & OKVQA: Please follow the instructions in the Qwen-VL repository to prepare the datasets.
- Flower, CUB, and DTD: Download the images from their respective official websites. We provide the 2-way 1-shot text annotations in the data files.
The expected data format is JSONL (for VizWiz and OKVQA) or JSON (for Flower, CUB, DTD) with the following structure:
{
"image": "path/to/image.jpg",
"question": "Your question here",
"answer": "Expected answer",
"question_id": "unique_id"
}The framework currently supports:
- Qwen-VL: Qwen/Qwen-VL from HuggingFace
- Idefics2: HuggingFaceM4/Idefics3-8B-Llama3
To use custom models, please refer to models.py and implement a ModelHelper class following the interface defined there.
Run evaluation with the STV framework:
python stv_eval.py \
--model_name Qwen-VL \
--data_name vizwiz \
--train_path /path/to/train.jsonl \
--val_path /path/to/val.jsonl \
--num_example 100 \
--num_shot 8 \
--topk 96 \
--num_clusters 64 \
--n_epochs 600 \
--cur_mode both \
--experiment_name my_experimentWe provide ready-to-use evaluation scripts in eval_scripts/:
# Example: Evaluate on VizWiz
bash eval_scripts/eval_vizwiz.sh
# Example: Evaluate on OKVQA
bash eval_scripts/eval_okvqav3.shMake sure to update the paths in the scripts to point to your dataset locations.
STV/
├── stv_eval.py # Main evaluation script
├── stv_utils.py # Core STV utilities and functions
├── models.py # Model helper classes
├── preprocess.py # Data preprocessing and formatting
├── eval_scripts/ # Evaluation scripts for different datasets
│ ├── eval_vizwiz.sh
│ ├── eval_okvqa.sh
│ ├── eval_okvqav3.sh
│ └── ...
└── llava/ # LLaVA model implementation (if needed)
If you found our work useful, please consider starring and citing. Thank you!
@article{ma2025and,
title={Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning},
author={Ma, Ziyu and Gou, Chenhui and Hu, Yiming and Wang, Yong and Chu, Xiangxiang and Zhuang, Bohan and Cai, Jianfei},
journal={arXiv preprint arXiv:2511.08246},
year={2025}
}