Skip to content

w1oves/hqclip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 

Repository files navigation

[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets 🚀

Paper PDF Project Page Demo Dataset VLM-150M Dataset VLM-1B

Authors
Zhixiang Wei12, Guangting Wang2 et al.
1 University of Science and Technology of China
2 WeChat Vision, Tencent Inc.


🔍 Key Contributions

  • 🏭 Efficient Data Generation Pipeline
    Multi-grained annotation pipeline using Large Vision-Language Models (LVLMs)
  • 🗂️ High-Quality Image-Text Datasets
    Generated by state-of-the-art LVLMs with positive/negative examples and rich text descriptions:
  • 🧠 HQ-CLIP Training Framework
    Novel CLIP training paradigm extending contrastive learning with:
    • Negative description supervision
    • Short tag augmentation

Model Overview


Model Zoo

Model Pretrained ImageNet Top-1 DataComp Score
CLIP-B-16 VLM-150M-Medium 70.6 58.6
CLIP-L-14-CLIPA VLM-1B 78.6 63.8
CLIP-L-14-OPENAI VLM-1B 76.5 63.7

Recaption Model: Qwen2VL

Datasets

Dataset Samples URL
VLM-150M 147M https://huggingface.co/datasets/zhixiangwei/VLM-150M
VLM-1B 1.37B https://huggingface.co/datasets/zhixiangwei/VLM-1B

Dataset Usage Guide

Preparation Steps

  1. (Optional) Download CommonPool Foundation Datasets
    Access CommonPool Large and XLarge versions via:
    CommonPool GitHub Repository

  2. Acquire DFN Base Datasets
    Download DFN Large and XLarge from:
    DFN Hugging Face Datasets

  3. Download HQ-CLIP Datasets
    Obtain our enhanced datasets:

    • VLM-150M
    • VLM-1B

Integration Instructions

Each JSON entry in VLM-150M and VLM-1B corresponds directly to a DFN dataset UID through matching filenames. To utilize our enhanced annotations:

  • Option 1: Direct Caption Replacement
    Substitute the original DFN captions with our generated annotations

  • Option 2: Dynamic Data Loading
    Modify the Open CLIP dataloader to load our annotations during training runtime

🔜 Detailed implementation guidance will be published in future releases.

Model Loading Instructions

Our uploaded weights are compatible with both open_clip and huggingface transformers.

For open_clip users:

import open_clip

Initialize model with transforms

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)
tokenizer = open_clip.get_tokenizer(
    'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)

For Hugging Face Transformers users:

from transformers import AutoModel

Load model directly from hub

model = AutoModel.from_pretrained(
    'zhixiangwei/vlm150m-hqclip-large-vitb16'
)

📝 Citation

@misc{hqclip,
      title={HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models}, 
      author={Zhixiang Wei and Guangting Wang and Xiaoxiao Ma and Ke Mei and Huaian Chen and Yi Jin and Fengyun Rao},
      year={2025},
      eprint={2507.22431},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.22431}, 
}

🙏 Acknowledgments

These works have greatly inspired us, providing us with codebases, data, and support. We thank their authors!

About

[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published