TY - JOUR
T1 - Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction
AU - Zhang, Yunzuo
AU - Zhang, Tian
AU - Wu, Cunyu
AU - Tao, Ran
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Recently, video saliency prediction has attracted increasing attention, yet the improvement of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To address this issue, we propose a 3D convolutional Multi-scale Spatiotemporal Feature Fusion Network (MSFF-Net) to achieve the full utilization of spatiotemporal features. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the basis of the previous flow of deep semantic information. Then, different from simple addition and concatenation, we design an Attention-Guided Fusion (AGF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, a Frame-wise Attention (FA) module is introduced to selectively emphasize the useful frames, augmenting the multi-scale temporal features to be fused. Our model is simple but effective, and it can run in real-time. Experimental results on the DHF1K, Hollywood-2, and UCF-sports datasets demonstrate that the proposed MSFF-Net outperforms existing state-of-the-art methods in accuracy.
AB - Recently, video saliency prediction has attracted increasing attention, yet the improvement of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To address this issue, we propose a 3D convolutional Multi-scale Spatiotemporal Feature Fusion Network (MSFF-Net) to achieve the full utilization of spatiotemporal features. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the basis of the previous flow of deep semantic information. Then, different from simple addition and concatenation, we design an Attention-Guided Fusion (AGF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, a Frame-wise Attention (FA) module is introduced to selectively emphasize the useful frames, augmenting the multi-scale temporal features to be fused. Our model is simple but effective, and it can run in real-time. Experimental results on the DHF1K, Hollywood-2, and UCF-sports datasets demonstrate that the proposed MSFF-Net outperforms existing state-of-the-art methods in accuracy.
KW - Video saliency prediction
KW - attention mechanism
KW - feature fusion
KW - multi-scale spatiotemporal features
UR - http://www.scopus.com/inward/record.url?scp=85174803831&partnerID=8YFLogxK
U2 - 10.1109/TMM.2023.3321394
DO - 10.1109/TMM.2023.3321394
M3 - Article
AN - SCOPUS:85174803831
SN - 1520-9210
VL - 26
SP - 4183
EP - 4193
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -