IDMR: Instance-Driven Multimodal Retrieval

Our key innovation lies in defining a task that requires both instance-level image relation and deep scene understanding. As shown above, IDMR carves out a unique and challenging space compared to existing benchmarks. It is the first to combine instance-level precision with complex, text-described contextual requirements, directly addressing the needs of advanced applications like embodied AI memory.

A New Frontier for Multimodal Retrieval

Current multimodal retrieval systems often fail at a critical real-world task: identifying a specific object instance in a new visual context based on text. Most benchmarks focus on retrieving images that are globally similar or contain objects of the same category, not the exact same instance.

Instance-Driven Multimodal Retrieval (IDMR) introduces a new task to bridge this gap. Given an image of a specific instance (e.g., this particular blue mug) and a text query (e.g., "find it next to the coffee machine"), the goal is to retrieve an image containing that exact same mug in the described scene.

Demos

Example 1: Instance retrieval in a kitchen scenario

Example 2: Specific frame retrieval in a movie

Example 3: Precise question answering with video memory

Example 4: Image retrieval in a photo album

Method: A Scalable Data Synthesis Pipeline

To overcome the scarcity of instance-level retrieval data, we propose a scalable synthesis pipeline. We leverage large-scale object detection datasets to create training pairs by treating cropped objects as queries and their original images as targets. An MLLM then generates rich, context-aware text descriptions. This allows us to automatically generate over 557K high-quality training samples, while our zero-shot benchmark (IDMR-Bench) is constructed from distinct, real-world video sources to ensure fair evaluation of model generalization.

Core Application: Long-Term Memory for Embodied AI

IDMR is a foundational technology for embodied AI, enabling agents to build long-term, instance-level visual memory. This capability is crucial for robots and AR assistants to recall where a specific object was seen, under what context, and how it relates to user instructions.

By focusing on fine-grained, instance-level correspondence, IDMR provides the mechanism for precise recall and reasoning in complex, dynamic environments. This complements ongoing research in short-term memory and navigation, paving the way for a comprehensive embodied intelligence system.

Embodied AI Memory — Example: An embodied AI agent retrieves a specific instance (e.g., a blue mug) based on a text query, demonstrating long-term memory capabilities.

BibTeX

@article{liu2024idmr,
  author    = {Bangwei Liu and Yicheng Bao and Shaohui Lin and Xuhong Wang and Xin Tan and Yingchun Wang and Yuan Xie and Chaochao Lu},
  title     = {IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval},
  journal   = {arXiv preprint arXiv:2504.00954},
  year      = {2025},
}