Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance

CF Volumes

Patient Frame
Slice: 32

Abstract

Vision-language models have shown strong performance in 2D image generation due to extensive pretrained foundation models, but comparable models for 3D are lacking, limiting progress in medical imaging. This absence restricts applications such as counterfactual explanations, disease progression simulations, and medical training. We present a framework that generates high-resolution 3D counterfactual medical images from free-form language prompts by adapting state-of-the-art 3D diffusion models with enhanced text conditioning. To our knowledge, this is the first language-guided native-3D diffusion model applied to neurological imaging, enabling faithful representation of brain structures. Experiments on MRI datasets demonstrate the ability to simulate lesion loads in Multiple Sclerosis and cognitive states in Alzheimer’s disease, producing realistic images while preserving synthesized subject fidelity. Our work establishes a foundation for prompt-driven disease progression analysis in 3D medical imaging.

Method

  • The first framework capable of generating high-resolution, text-guided 3D counterfactual medical images of synthetic subjects
  • First research result visualization Fig. 1. Proposed Framework. A pretrained BiomedCLIP text encoder encodes the text prompt (e.g. “Subject has high lesion load”) as conditioning for the diffusion model. During inference, the model generates counterfactuals by sampling from the same fixed noise while varying the text condition.

    Results

    Conclusion

    In this work, we introduced a novel vision-language framework designed specifically for generating high-resolution, text-guided 3D counterfactual medical images of synthetic neurological subjects. Our approach addresses critical limitations of existing methods by integrating advanced diffusion architectures with medically-informed semantic embeddings derived from BiomedCLIP. The results demonstrate that our language-guided wavelet-based diffusion model (WDM), operating directly in voxel space, delivers superior subject preservation, image quality, and text alignment compared to conventional latent diffusion approaches. Additionally, the MAISI RFlow model, which incorporates a Rectified Flow noise schedule, significantly improves anatomical consistency and image fidelity while achieving computational efficiency. Qualitative and quantitative analyses clearly indicate the effectiveness of these models in simulating nuanced disease-progression scenarios in synthetic patients, and our ablation studies on classifier-free guidance underscore the explicit trade-offs between prompt fidelity and anatomical accuracy.

    BibTeX

    @misc{mohamed2025imaginingalternativeshighresolution3d,
          title={Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance}, 
          author={Mohamed Mohamed and Brennan Nichyporuk and Douglas L. Arnold and Tal Arbel},
          year={2025},
          eprint={2509.05978},
          archivePrefix={arXiv},
          primaryClass={eess.IV},
          url={https://arxiv.org/abs/2509.05978},
    }