Imagining Alternatives: Towards High-Resolution 3D Counterfactual
                Medical Image Generation via Language Guidance

M. Mohamed et al.

Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance

Mohamed Mohamed, Brennan Nichyporuk, Douglas L. Arnold Tal Arbel McGill University, MILA Quebec AI Institute
MICCAI 2025 ELAMI Workshop (Oral)

Paper Code (Beta) arXiv

CF Volumes

Slice: 32

Abstract

Vision-language models have shown strong performance in 2D image generation due to extensive pretrained foundation models, but comparable models for 3D are lacking, limiting progress in medical imaging. This absence restricts applications such as counterfactual explanations, disease progression simulations, and medical training. We present a framework that generates high-resolution 3D counterfactual medical images from free-form language prompts by adapting state-of-the-art 3D diffusion models with enhanced text conditioning. To our knowledge, this is the first language-guided native-3D diffusion model applied to neurological imaging, enabling faithful representation of brain structures. Experiments on MRI datasets demonstrate the ability to simulate lesion loads in Multiple Sclerosis and cognitive states in Alzheimer’s disease, producing realistic images while preserving synthesized subject fidelity. Our work establishes a foundation for prompt-driven disease progression analysis in 3D medical imaging.

Method

The first framework capable of generating high-resolution, text-guided 3D counterfactual medical images of synthetic subjects

Fig. 1. Proposed Framework. A pretrained BiomedCLIP text encoder encodes the text prompt (e.g. “Subject has high lesion load”) as conditioning for the diffusion model. During inference, the model generates counterfactuals by sampling from the same fixed noise while varying the text condition.

Results

Fig. 2. Qualitative comparison of generated counterfactuals for synthesized subjects on the MS dataset for different lesion loads.

Fig. 3. Qualitative comparison of generated counterfactuals for synthesized subjects on the ADNI dataset for different cognitive states.

Conclusion

In this work, we introduced a novel vision-language framework designed specifically for generating high-resolution, text-guided 3D counterfactual medical images of synthetic neurological subjects. Our approach addresses critical limitations of existing methods by integrating advanced diffusion architectures with medically-informed semantic embeddings derived from BiomedCLIP. The results demonstrate that our language-guided wavelet-based diffusion model (WDM), operating directly in voxel space, delivers superior subject preservation, image quality, and text alignment compared to conventional latent diffusion approaches. Additionally, the MAISI RFlow model, which incorporates a Rectified Flow noise schedule, significantly improves anatomical consistency and image fidelity while achieving computational efficiency. Qualitative and quantitative analyses clearly indicate the effectiveness of these models in simulating nuanced disease-progression scenarios in synthetic patients, and our ablation studies on classifier-free guidance underscore the explicit trade-offs between prompt fidelity and anatomical accuracy.

BibTeX

@misc{mohamed2025imaginingalternativeshighresolution3d,
      title={Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance}, 
      author={Mohamed Mohamed and Brennan Nichyporuk and Douglas L. Arnold and Tal Arbel},
      year={2025},
      eprint={2509.05978},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2509.05978},
}