Concept-Aware Batch Sampling Improves Language-Image Pretraining

1University of Tübingen, 2University of Cambridge, 3University of Washington, 4University of Trento, 5LAION, 6Stanford University
CVPR 2026
CABS teaser figure

We introduce DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained concept compositions, and CABS, a task-adaptive, steerable batch sampling framework for vision-language pretraining. By modifying a simple scoring function, CABS flexibly adapts to different target tasks — our classification-optimized variant (CABS-DM) and retrieval-optimized variant (CABS-FM) both outperform IID sampling by large margins.

Abstract

What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases.

In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch-sampling framework that flexibly constructs batches on-the-fly based on specific target distributions.

We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that CABS significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.

🎨 DataConcept: Concept-Aware Dataset Augmentation

We introduce DataConcept, a large-scale pretraining dataset with 128M image-text pairs fully annotated with grounded concept information. Each sample comes with: semantic concepts, bounding boxes, per-concept confidence scores, and concept-driven synthetic captions.

Our multi-stage pipeline consists of three steps:

  1. Concept Tagging — Open-set image tagging using RAM++.
  2. Concept Grounding — Grounded object detection using GroundingDINO conditioned on the RAM++ tags, with multi-resolution ensembling via Weighted Box Fusion for high-quality bounding boxes.
  3. Concept-Aware Recaptioning — Recaptioning images using Qwen2-VL-7B, prompted by original alt-texts and detected concepts.

DataConcept is publicly available on HuggingFace.

🚕 CABS: Concept-Aware Batch Sampling

CABS is a parameterised sampling framework. Given a super-batch of size B drawn IID from the data pool, we define a target batch size b < B controlled by filter ratio f, such that b = (1 − f)B. For each sample with concept annotations, CABS computes a score using a concept-aware heuristic gain function and selects the top-scoring samples for training. By allowing the gain function to be flexible, practitioners can instantiate different batch sampling strategies and induce different concept distributions on-the-fly during training. We provide two such instantiations in this work.

CABS Algorithm

CABS-DM: Diversity Maximization

Designed for zero-shot classification. CABS-DM scores samples iteratively so that the filtered batch approximates a uniform concept distribution. It assigns higher scores to samples containing under-represented concepts and selects them greedily. An average CABS-DM sub-batch contains 1.5× more unique concepts than an IID-sampled batch, with a near-flat concept distribution.

Sub-batch compositions: IID vs CABS-DM

CABS-DM induces a near-uniform concept frequency distribution, de-biasing the distributional skew of IID sampling. CABS-DM incorporates nearly double the unique concepts in the curated sub-batch.

CABS-FM: Frequency Maximization

Designed for image-text retrieval. CABS-FM uses a simple gain function based on concept count — it selects samples with the highest number of annotated concepts, yielding sub-batches enriched with complex, multi-object scenes that mirror the compositional nature of retrieval benchmarks.

Results

CABS-DM: Zero-Shot Classification

CABS-DM consistently delivers substantial improvements over IID sampling across all settings. On ImageNet, CABS-DM yields absolute improvements of +5.0% for CLIP ViT-B/32 and +6.9% for SigLIP ViT-B/16. It also outperforms MetaCLIP-style offline curation and other online batch sampling methods (GRIT-VLP, MAFA).

Method Caption Avg (Clf) IN-Val IN-shift Obj Scene Let-It-Wag!
ViT-B/32 CLIP
IIDalt 17.315.232.336.45.128.2
CABS-DMalt 21.918.634.538.07.530.7
IIDrecap 21.720.836.443.15.933.0
CABS-DMrecap 26.725.439.642.87.135.5
ViT-B/16 SigLIP
IIDalt 17.215.329.635.95.226.4
CABS-DMalt 24.120.833.539.67.030.9
IIDrecap 28.827.441.548.96.638.6
CABS-DMrecap 34.732.343.250.67.641.1

CABS-FM: Image-Text Retrieval

CABS-FM consistently outperforms IID sampling on retrieval benchmarks, yielding gains of up to +9.0% on average retrieval when training with recaptions. It also significantly outperforms existing online batch sampling methods.

Method Caption COCO Flickr Avg (Ret)
ViT-B/32 CLIP
IIDalt 9.716.212.9
CABS-FMalt 11.021.916.4
IIDrecap 24.041.332.6
CABS-FMrecap 30.452.941.6
ViT-B/16 SigLIP
IIDalt 11.118.915.0
CABS-FMalt 12.323.918.1
IIDrecap 37.157.047.0
CABS-FMrecap 39.763.551.6

Model Checkpoints

We release CABS-trained model checkpoints on HuggingFace. All CABS variants are trained with a filter ratio of 0.8.

Model Architecture CABS Variant Caption Download
CLIPViT-B/32CABS-DM (0.8)Alt-text 🤗
CLIPViT-B/32CABS-DM (0.8)Recap 🤗
CLIPViT-B/32CABS-FM (0.8)Alt-text 🤗
CLIPViT-B/32CABS-FM (0.8)Recap 🤗

BibTeX

@article{ghosh2025concept,
  title={Concept-Aware Batch Sampling Improves Language-Image Pretraining},
  author={Ghosh, Adhiraj and Udandarao, Vishaal and Nguyen, Thao and Farina, Matteo and Cherti, Mehdi and Jitsev, Jenia and Oh, Sewoong and Ricci, Elisa and Schmidt, Ludwig and Bethge, Matthias},
  journal={arXiv preprint arXiv:2511.20643},
  year={2025}
}