IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval

Bangwei Liu1, Yicheng Bao1, Shaohui Lin1, Xuhong Wang2, Xin Tan1, Yingchun Wang2, Yuan Xie1, Chaochao Lu2
1East China Normal University, 2Shanghai AI Laboratory
IDMR Benchmark Comparison

Our key innovation lies in defining a task that requires both instance-level image relation and deep scene understanding. As shown above, IDMR carves out a unique and challenging space compared to existing benchmarks. It is the first to combine instance-level precision with complex, text-described contextual requirements, directly addressing the needs of advanced applications like embodied AI memory.

A New Frontier for Multimodal Retrieval

Current multimodal retrieval systems often fail at a critical real-world task: identifying a specific object instance in a new visual context based on text. Most benchmarks focus on retrieving images that are globally similar or contain objects of the same category, not the exact same instance.

Instance-Driven Multimodal Retrieval (IDMR) introduces a new task to bridge this gap. Given an image of a specific instance (e.g., this particular blue mug) and a text query (e.g., "find it next to the coffee machine"), the goal is to retrieve an image containing that exact same mug in the described scene.

Demos

Example 1: Instance retrieval in a kitchen scenario
Example 2: Specific frame retrieval in a movie
Example 3: Precise question answering with video memory
Example 4: Image retrieval in a photo album

Method: A Scalable Data Synthesis Pipeline

To overcome the scarcity of instance-level retrieval data, we propose a scalable synthesis pipeline. We leverage large-scale object detection datasets to create training pairs by treating cropped objects as queries and their original images as targets. An MLLM then generates rich, context-aware text descriptions. This allows us to automatically generate over 557K high-quality training samples, while our zero-shot benchmark (IDMR-Bench) is constructed from distinct, real-world video sources to ensure fair evaluation of model generalization.

IDMR Data Pipeline

Core Application: Long-Term Memory for Embodied AI

IDMR is a foundational technology for embodied AI, enabling agents to build long-term, instance-level visual memory. This capability is crucial for robots and AR assistants to recall where a specific object was seen, under what context, and how it relates to user instructions.

By focusing on fine-grained, instance-level correspondence, IDMR provides the mechanism for precise recall and reasoning in complex, dynamic environments. This complements ongoing research in short-term memory and navigation, paving the way for a comprehensive embodied intelligence system.

Embodied AI Memory
Example: An embodied AI agent retrieves a specific instance (e.g., a blue mug) based on a text query, demonstrating long-term memory capabilities.

BibTeX

@article{liu2024idmr,
  author    = {Bangwei Liu and Yicheng Bao and Shaohui Lin and Xuhong Wang and Xin Tan and Yingchun Wang and Yuan Xie and Chaochao Lu},
  title     = {IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval},
  journal   = {arXiv preprint arXiv:2504.00954},
  year      = {2025},
}