Our key innovation lies in defining a task that requires both instance-level image relation and deep scene understanding. As shown above, IDMR carves out a unique and challenging space compared to existing benchmarks. It is the first to combine instance-level precision with complex, text-described contextual requirements, directly addressing the needs of advanced applications like embodied AI memory.
Current multimodal retrieval systems often fail at a critical real-world task: identifying a specific object instance in a new visual context based on text. Most benchmarks focus on retrieving images that are globally similar or contain objects of the same category, not the exact same instance.
Instance-Driven Multimodal Retrieval (IDMR) introduces a new task to bridge this gap. Given an image of a specific instance (e.g., this particular blue mug) and a text query (e.g., "find it next to the coffee machine"), the goal is to retrieve an image containing that exact same mug in the described scene.
To overcome the scarcity of instance-level retrieval data, we propose a scalable synthesis pipeline. We leverage large-scale object detection datasets to create training pairs by treating cropped objects as queries and their original images as targets. An MLLM then generates rich, context-aware text descriptions. This allows us to automatically generate over 557K high-quality training samples, while our zero-shot benchmark (IDMR-Bench) is constructed from distinct, real-world video sources to ensure fair evaluation of model generalization.
IDMR is a foundational technology for embodied AI, enabling agents to build long-term, instance-level visual memory. This capability is crucial for robots and AR assistants to recall where a specific object was seen, under what context, and how it relates to user instructions.
By focusing on fine-grained, instance-level correspondence, IDMR provides the mechanism for precise recall and reasoning in complex, dynamic environments. This complements ongoing research in short-term memory and navigation, paving the way for a comprehensive embodied intelligence system.
@article{liu2024idmr,
author = {Bangwei Liu and Yicheng Bao and Shaohui Lin and Xuhong Wang and Xin Tan and Yingchun Wang and Yuan Xie and Chaochao Lu},
title = {IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval},
journal = {arXiv preprint arXiv:2504.00954},
year = {2025},
}