TY - GEN
T1 - SHTVS
T2 - 5th International Conference on Image and Graphics Processing, ICIGP 2022
AU - An, Yubo
AU - Zhao, Shenghui
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/1/7
Y1 - 2022/1/7
N2 - In this paper, a Shot-level based Hierarchical Transformer for Video Summarization (SHTVS) is proposed for supervised video summarization. Different from most existing methods that employ bidirectional long short-term memory or use self-attention to replace certain components while keeping their overall structure in place, our methods show that a pure Transformer with video feature sequences as its input can achieve competitive performance in video summarization. In addition, to make better use of the multi-shot characteristic in a video, each video feature sequence is firstly split into shot-level feature sequences with kernel temporal segmentation, and then fed into shot-level Transformer encoder to learn shot-level representations. Finally, shot-level representations and original video feature sequence are integrated for the frame-level Transformer encoder to predict frame-level importance scores. Extensive experimental results on two benchmark datasets (SumMe and TVSum) prove the effectiveness of our methods.
AB - In this paper, a Shot-level based Hierarchical Transformer for Video Summarization (SHTVS) is proposed for supervised video summarization. Different from most existing methods that employ bidirectional long short-term memory or use self-attention to replace certain components while keeping their overall structure in place, our methods show that a pure Transformer with video feature sequences as its input can achieve competitive performance in video summarization. In addition, to make better use of the multi-shot characteristic in a video, each video feature sequence is firstly split into shot-level feature sequences with kernel temporal segmentation, and then fed into shot-level Transformer encoder to learn shot-level representations. Finally, shot-level representations and original video feature sequence are integrated for the frame-level Transformer encoder to predict frame-level importance scores. Extensive experimental results on two benchmark datasets (SumMe and TVSum) prove the effectiveness of our methods.
KW - Hierarchical Transformer
KW - Sequence labeling
KW - Shot-level
KW - Video summarization
UR - http://www.scopus.com/inward/record.url?scp=85127597662&partnerID=8YFLogxK
U2 - 10.1145/3512388.3512427
DO - 10.1145/3512388.3512427
M3 - Conference contribution
AN - SCOPUS:85127597662
T3 - ACM International Conference Proceeding Series
SP - 268
EP - 274
BT - ICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing
PB - Association for Computing Machinery
Y2 - 7 January 2022 through 9 January 2022
ER -