[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets 🚀
Authors
Zhixiang Wei12, Guangting Wang2 et al.
1 University of Science and Technology of China
2 WeChat Vision, Tencent Inc.
- 🏭 Efficient Data Generation Pipeline
Multi-grained annotation pipeline using Large Vision-Language Models (LVLMs) - 🗂️ High-Quality Image-Text Datasets
Generated by state-of-the-art LVLMs with positive/negative examples and rich text descriptions: - 🧠 HQ-CLIP Training Framework
Novel CLIP training paradigm extending contrastive learning with:- Negative description supervision
- Short tag augmentation
| Model | Pretrained | ImageNet Top-1 | DataComp Score |
|---|---|---|---|
| CLIP-B-16 | VLM-150M-Medium | 70.6 | 58.6 |
| CLIP-L-14-CLIPA | VLM-1B | 78.6 | 63.8 |
| CLIP-L-14-OPENAI | VLM-1B | 76.5 | 63.7 |
Recaption Model: Qwen2VL
| Dataset | Samples | URL |
|---|---|---|
| VLM-150M | 147M | https://huggingface.co/datasets/zhixiangwei/VLM-150M |
| VLM-1B | 1.37B | https://huggingface.co/datasets/zhixiangwei/VLM-1B |
-
(Optional) Download CommonPool Foundation Datasets
Access CommonPool Large and XLarge versions via:
CommonPool GitHub Repository -
Acquire DFN Base Datasets
Download DFN Large and XLarge from:
DFN Hugging Face Datasets -
Download HQ-CLIP Datasets
Obtain our enhanced datasets:- VLM-150M
- VLM-1B
Each JSON entry in VLM-150M and VLM-1B corresponds directly to a DFN dataset UID through matching filenames. To utilize our enhanced annotations:
-
Option 1: Direct Caption Replacement
Substitute the original DFN captions with our generated annotations -
Option 2: Dynamic Data Loading
Modify the Open CLIP dataloader to load our annotations during training runtime
🔜 Detailed implementation guidance will be published in future releases.
Our uploaded weights are compatible with both open_clip and huggingface transformers.
import open_clip
Initialize model with transforms
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)
tokenizer = open_clip.get_tokenizer(
'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)from transformers import AutoModel
Load model directly from hub
model = AutoModel.from_pretrained(
'zhixiangwei/vlm150m-hqclip-large-vitb16'
)@misc{hqclip,
title={HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models},
author={Zhixiang Wei and Guangting Wang and Xiaoxiao Ma and Ke Mei and Huaian Chen and Yi Jin and Fengyun Rao},
year={2025},
eprint={2507.22431},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.22431},
}
These works have greatly inspired us, providing us with codebases, data, and support. We thank their authors!
