TY - GEN
T1 - MindDiffuser
T2 - 31st ACM International Conference on Multimedia, MM 2023
AU - Lu, Yizhuo
AU - Du, Changde
AU - Zhou, Qiongyi
AU - Wang, Dianpeng
AU - He, Huiguang
N1 - Publisher Copyright:
© 2023 Owner/Author.
PY - 2023/10/26
Y1 - 2023/10/26
N2 - Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Despite the advancements in complex image reconstruction techniques, the challenge persists in achieving a cohesive alignment of both semantic (concepts and objects) and structure (position, orientation, and size) with the image stimuli. To address the aforementioned issue, we propose a two-stage image reconstruction model called MindDiffuser1. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into Stable Diffusion, which yields a preliminary image that contains semantic information. In Stage 2, we utilize the CLIP visual feature decoded from fMRI as supervisory information, and continually adjust the two feature vectors decoded in Stage 1 through backpropagation to align the structural information. The results of both qualitative and quantitative analyses demonstrate that our model has surpassed the current state-of-the-art models on Natural Scenes Dataset (NSD). The subsequent experimental findings corroborate the neurobiological plausibility of the model, as evidenced by the interpretability of the multimodal feature employed, which align with the corresponding brain responses.
AB - Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Despite the advancements in complex image reconstruction techniques, the challenge persists in achieving a cohesive alignment of both semantic (concepts and objects) and structure (position, orientation, and size) with the image stimuli. To address the aforementioned issue, we propose a two-stage image reconstruction model called MindDiffuser1. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into Stable Diffusion, which yields a preliminary image that contains semantic information. In Stage 2, we utilize the CLIP visual feature decoded from fMRI as supervisory information, and continually adjust the two feature vectors decoded in Stage 1 through backpropagation to align the structural information. The results of both qualitative and quantitative analyses demonstrate that our model has surpassed the current state-of-the-art models on Natural Scenes Dataset (NSD). The subsequent experimental findings corroborate the neurobiological plausibility of the model, as evidenced by the interpretability of the multimodal feature employed, which align with the corresponding brain responses.
KW - brain-computer interface (bci)
KW - controlled image reconstruction
KW - fmri
KW - probabilistic diffusion model
UR - http://www.scopus.com/inward/record.url?scp=85179558469&partnerID=8YFLogxK
U2 - 10.1145/3581783.3613832
DO - 10.1145/3581783.3613832
M3 - Conference contribution
AN - SCOPUS:85179558469
T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
SP - 5899
EP - 5908
BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 29 October 2023 through 3 November 2023
ER -