Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction

Yunzuo Zhang; Tian Zhang; Cunyu Wu; Ran Tao

doi:10.1109/TMM.2023.3321394

Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction

Yunzuo Zhang^*, Tian Zhang, Cunyu Wu, Ran Tao

^*Corresponding author for this work

School of Information and Electronics

Shijiazhuang Tiedao University

Research output: Contribution to journal › Article › peer-review

11 Citations (Scopus)

Abstract

Recently, video saliency prediction has attracted increasing attention, yet the improvement of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To address this issue, we propose a 3D convolutional Multi-scale Spatiotemporal Feature Fusion Network (MSFF-Net) to achieve the full utilization of spatiotemporal features. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the basis of the previous flow of deep semantic information. Then, different from simple addition and concatenation, we design an Attention-Guided Fusion (AGF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, a Frame-wise Attention (FA) module is introduced to selectively emphasize the useful frames, augmenting the multi-scale temporal features to be fused. Our model is simple but effective, and it can run in real-time. Experimental results on the DHF1K, Hollywood-2, and UCF-sports datasets demonstrate that the proposed MSFF-Net outperforms existing state-of-the-art methods in accuracy.

Original language	English
Pages (from-to)	4183-4193
Number of pages	11
Journal	IEEE Transactions on Multimedia
Volume	26
DOIs	https://doi.org/10.1109/TMM.2023.3321394
Publication status	Published - 2024

Keywords

Video saliency prediction
attention mechanism
feature fusion
multi-scale spatiotemporal features

Access to Document

10.1109/TMM.2023.3321394

Cite this

@article{1069c055b1f34179a0e97b9ad1af6e23,

title = "Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction",

abstract = "Recently, video saliency prediction has attracted increasing attention, yet the improvement of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To address this issue, we propose a 3D convolutional Multi-scale Spatiotemporal Feature Fusion Network (MSFF-Net) to achieve the full utilization of spatiotemporal features. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the basis of the previous flow of deep semantic information. Then, different from simple addition and concatenation, we design an Attention-Guided Fusion (AGF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, a Frame-wise Attention (FA) module is introduced to selectively emphasize the useful frames, augmenting the multi-scale temporal features to be fused. Our model is simple but effective, and it can run in real-time. Experimental results on the DHF1K, Hollywood-2, and UCF-sports datasets demonstrate that the proposed MSFF-Net outperforms existing state-of-the-art methods in accuracy.",

keywords = "Video saliency prediction, attention mechanism, feature fusion, multi-scale spatiotemporal features",

author = "Yunzuo Zhang and Tian Zhang and Cunyu Wu and Ran Tao",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2024",

doi = "10.1109/TMM.2023.3321394",

language = "English",

volume = "26",

pages = "4183--4193",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction

AU - Zhang, Yunzuo

AU - Zhang, Tian

AU - Wu, Cunyu

AU - Tao, Ran

PY - 2024

Y1 - 2024

N2 - Recently, video saliency prediction has attracted increasing attention, yet the improvement of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To address this issue, we propose a 3D convolutional Multi-scale Spatiotemporal Feature Fusion Network (MSFF-Net) to achieve the full utilization of spatiotemporal features. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the basis of the previous flow of deep semantic information. Then, different from simple addition and concatenation, we design an Attention-Guided Fusion (AGF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, a Frame-wise Attention (FA) module is introduced to selectively emphasize the useful frames, augmenting the multi-scale temporal features to be fused. Our model is simple but effective, and it can run in real-time. Experimental results on the DHF1K, Hollywood-2, and UCF-sports datasets demonstrate that the proposed MSFF-Net outperforms existing state-of-the-art methods in accuracy.

AB - Recently, video saliency prediction has attracted increasing attention, yet the improvement of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To address this issue, we propose a 3D convolutional Multi-scale Spatiotemporal Feature Fusion Network (MSFF-Net) to achieve the full utilization of spatiotemporal features. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the basis of the previous flow of deep semantic information. Then, different from simple addition and concatenation, we design an Attention-Guided Fusion (AGF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, a Frame-wise Attention (FA) module is introduced to selectively emphasize the useful frames, augmenting the multi-scale temporal features to be fused. Our model is simple but effective, and it can run in real-time. Experimental results on the DHF1K, Hollywood-2, and UCF-sports datasets demonstrate that the proposed MSFF-Net outperforms existing state-of-the-art methods in accuracy.

KW - Video saliency prediction

KW - attention mechanism

KW - feature fusion

KW - multi-scale spatiotemporal features

UR - http://www.scopus.com/inward/record.url?scp=85174803831&partnerID=8YFLogxK

U2 - 10.1109/TMM.2023.3321394

DO - 10.1109/TMM.2023.3321394

M3 - Article

AN - SCOPUS:85174803831

SN - 1520-9210

VL - 26

SP - 4183

EP - 4193

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this