TY - JOUR
T1 - Combining optical flow and Swin Transformer for Space-Time video super-resolution
AU - Wang, Xin
AU - Wang, Hua
AU - Zhang, Mingli
AU - Zhang, Fan
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/11
Y1 - 2024/11
N2 - Space–time video super-resolution is a task that aims to interpolate low frame rate, low resolution videos to high frame rate, high resolution ones. While existing Transformer-based methods have achieved results comparable to convolutional neural networks-based methods, the computational cost of Transformer limits its performance with constrained computational resources. Moreover, Swin Transformer may fail to fully exploit the spatio-temporal information of video frames due to the limitation of window size, impeding its effectiveness in handling large motions. To address these limitations, we propose an end-to-end space–time video super-resolution architecture based on optical flow alignment and Swin Transformer. The alignment module is introduced to extract spatio-temporal information from adjacent frames without significantly increasing the computational burden. Additionally, we design a residual convolution layer to enhance the translational invariance of the features extracted by Swin Transformer and introduces additional nonlinear transformations. Experimental results demonstrate that our proposed method achieves superior performance on various benchmark datasets compared to state-of-the-art methods. In terms of Peak Signal-to-Noise Ratio, our method outperforms the state-of-the-art methods by at least 0.15 dB on Vimeo-Medium dataset.
AB - Space–time video super-resolution is a task that aims to interpolate low frame rate, low resolution videos to high frame rate, high resolution ones. While existing Transformer-based methods have achieved results comparable to convolutional neural networks-based methods, the computational cost of Transformer limits its performance with constrained computational resources. Moreover, Swin Transformer may fail to fully exploit the spatio-temporal information of video frames due to the limitation of window size, impeding its effectiveness in handling large motions. To address these limitations, we propose an end-to-end space–time video super-resolution architecture based on optical flow alignment and Swin Transformer. The alignment module is introduced to extract spatio-temporal information from adjacent frames without significantly increasing the computational burden. Additionally, we design a residual convolution layer to enhance the translational invariance of the features extracted by Swin Transformer and introduces additional nonlinear transformations. Experimental results demonstrate that our proposed method achieves superior performance on various benchmark datasets compared to state-of-the-art methods. In terms of Peak Signal-to-Noise Ratio, our method outperforms the state-of-the-art methods by at least 0.15 dB on Vimeo-Medium dataset.
KW - Deep learning
KW - Feature alignment
KW - Optical flow
KW - Residual convolution
KW - Space-Time video super-resolution
KW - Swin Transformer
UR - http://www.scopus.com/inward/record.url?scp=85202588100&partnerID=8YFLogxK
U2 - 10.1016/j.engappai.2024.109227
DO - 10.1016/j.engappai.2024.109227
M3 - Article
AN - SCOPUS:85202588100
SN - 0952-1976
VL - 137
JO - Engineering Applications of Artificial Intelligence
JF - Engineering Applications of Artificial Intelligence
M1 - 109227
ER -