Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness.
We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning.
Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning.
CIRCLES combines two complementary sources of evidence. Standard image retrieval provides correlational understanding by finding visually similar neighbors, while attribute-guided composed retrieval provides causal understanding by changing the most relevant attribute and retrieving counterfactual-style examples. These two sets of demonstrations are then used together for retrieval-augmented in-context inference.
In practice, the method has three core steps: identify the key attribute, retrieve composed counterfactual examples for that attribute, and combine them with standard neighbors in the final context. This gives the model both high-similarity context and attribute-focused counterfactual evidence. The pipeline below shows how these two retrieval streams are combined in the CIRCLES framework.
The illustration below directly shows the intuition: standard retrieval keeps finding lookalikes, while composed retrieval adds the contrasting examples that make the decision boundary clearer. In other words, standard retrieval preserves local visual context, while composed retrieval contributes the counterfactual evidence needed to highlight which attribute actually changes the answer.
The main benchmark results below show that CIRCLES consistently outperforms zero-shot prompting, random example selection, and prior retrieval-based baselines across CUB, Flowers, OK-VQA, and VizWiz. The improvements are especially notable for smaller vision-language models, showing that better demonstrations can compensate for weaker internal knowledge. More importantly, the same retrieval recipe transfers across both classification and VQA, suggesting that the benefit comes from better example construction rather than a task-specific trick.
The Magnolia Warbler example below illustrates why CIRCLES helps. Standard retrieval returns lookalike neighbors that push the model toward the wrong species, while CIRCLES retrieves counterfactual examples that explicitly vary discriminative cues such as head markings and wing patches. This makes the decisive evidence much easier to infer from the context.
CIRCLES is especially effective when the training pool becomes small. On the CUB scarcity study shown in the figure below, the gap between CIRCLES and standard retrieval widens as the amount of available training data shrinks, showing that counterfactual examples are most valuable when relevant demonstrations are hardest to find.
We further conduct ablation studies to understand the contribution of different components and design choices in CIRCLES. The analyses below explore how different implementations of composed image retrieval, the inclusion of question-question similarity, and retrieval budget decisions affect performance. These studies provide insights into why CIRCLES works and how to best configure it for different settings.
In CIRCLES, we augment the original CIR method with a question-question similarity component to make the retrieval aware of task similarity. We evaluate the impact of this proposed enhancement on OK-VQA and VizWiz, where questions are diverse. The figure below demonstrates the effectiveness of incorporating question-question similarity in CIRCLES, showing that it leads to improved performance on both datasets.
Our retrieval-budget study on CUB shows how CIRCLES trades off the number of standard retrieval examples, composed retrieval examples, and intervened attributes. The results show that increasing the number of composed retrieval examples (#CIR) consistently improves accuracy across all #IR configurations, highlighting the benefit of CIR for providing targeted and informative context. Under tight CIR budgets, spreading interventions across more attributes works better than focusing on a single attribute. As the composed-retrieval budget grows, the best strategy shifts toward concentrating more compositions on fewer attributes, moving from breadth to depth.
@inproceedings{xiong2026retrieving,
title = {Retrieving Counterfactuals Improves Visual In-Context Learning},
author = {Guangzhi Xiong and Sanchit Sinha and Zhenghao He and Aidong Zhang},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026}
}