MV-RAG: Retrieval Augmented Multiview Diffusion

Yosef Dayani, Omer Benishu, Sagie Benaim
The Hebrew University of Jerusalem

Abstract

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

Teaser visualization
MV-RAG extends the strengths of RAG by addressing challenges such as out-of-domain generations (e.g., ‘Bolognese dog’) and emerging concepts introduced after training (e.g., ‘Labubu doll’).

Method

MV-RAG Method Diagram

MV-RAG advances multiview generation by combining a pretrained multiview model’s internal knowledge with external visual cues retrieved from a large image database. At inference, the retrieved 2D images are encoded into tokens using an image encoder followed by a learned resampler. Within the multiview diffusion model, 3D self-attention layers enforce consistency across the generated views. Each cross-attention layer then operates in two parallel branches: one conditioned on text tokens and the other on retrieved image tokens. Their outputs are fused using a fusion coefficient predicted by the Prior-Guided Attention module.

Prior-Guided Attention: This mechanism adaptively balances what the base model already knows with what the retrieved images contribute in the Decoupled Cross-Attention layers. When the concept is familiar to the model, the model leans more on its internal prior during generation, emphasizing the text tokens. For rare or out-of-distribution concepts, it gives greater weight to the retrieved image tokens. To achieve this, the model first generates a candidate output using its prior knowledge, compares it with the retrieved images to get a confidence score, and then fuses the two feature maps accordingly to guide the final multiview reconstruction

Training

Training Schemes

To achieve both fidelity and 3D consistency when conditioning on multiple real-world images, MV-RAG is trained with a hybrid strategy that alternates between two modes.

3D Mode: In 3D mode, the model trains on synthetic 3D datasets. Given several augmented renderings of the same 3D object, it predicts target multiview while enforcing consistency across them. This teaches the model to distribute visual features from retrieved images in a geometrically correct way, resulting in accurate and coherent 3D reconstructions.

2D Mode: n 2D mode, the model trains on a real-world image dataset. Here, K+1 images of the same concept are retrieved: K images are provided as the retrieval condition, and the model is tasked with generating the held-out (K+1)th view. This trains the model to generalize from diverse real-world images, a key ability for inference in the retrieval-augmented setting.

Comparison Results

Comparison of our method against baselines
MV-RAG achieves state-of-the-art results on out-of-distribution (OOD) objects, outperforming both text-conditioned approaches (middle row) and image(s)-conditioned approaches (bottom row).

BibTeX

@misc{dayani2025mvragretrievalaugmentedmultiview,
      title={MV-RAG: Retrieval Augmented Multiview Diffusion},
      author={Yosef Dayani and Omer Benishu and Sagie Benaim},
      year={2025},
      eprint={2508.16577},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.16577},
}