Combining optical flow and Swin Transformer for Space-Time video super-resolution

Xin Wang; Hua Wang; Mingli Zhang; Fan Zhang

doi:10.1016/j.engappai.2024.109227

Combining optical flow and Swin Transformer for Space-Time video super-resolution

Xin Wang, Hua Wang, Mingli Zhang, Fan Zhang^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

Space–time video super-resolution is a task that aims to interpolate low frame rate, low resolution videos to high frame rate, high resolution ones. While existing Transformer-based methods have achieved results comparable to convolutional neural networks-based methods, the computational cost of Transformer limits its performance with constrained computational resources. Moreover, Swin Transformer may fail to fully exploit the spatio-temporal information of video frames due to the limitation of window size, impeding its effectiveness in handling large motions. To address these limitations, we propose an end-to-end space–time video super-resolution architecture based on optical flow alignment and Swin Transformer. The alignment module is introduced to extract spatio-temporal information from adjacent frames without significantly increasing the computational burden. Additionally, we design a residual convolution layer to enhance the translational invariance of the features extracted by Swin Transformer and introduces additional nonlinear transformations. Experimental results demonstrate that our proposed method achieves superior performance on various benchmark datasets compared to state-of-the-art methods. In terms of Peak Signal-to-Noise Ratio, our method outperforms the state-of-the-art methods by at least 0.15 dB on Vimeo-Medium dataset.

Original language	English
Article number	109227
Journal	Engineering Applications of Artificial Intelligence
Volume	137
DOIs	https://doi.org/10.1016/j.engappai.2024.109227
Publication status	Published - Nov 2024
Externally published	Yes

Keywords

Deep learning
Feature alignment
Optical flow
Residual convolution
Space-Time video super-resolution
Swin Transformer

Access to Document

10.1016/j.engappai.2024.109227

Cite this

Wang, X., Wang, H., Zhang, M., & Zhang, F. (2024). Combining optical flow and Swin Transformer for Space-Time video super-resolution. Engineering Applications of Artificial Intelligence, 137, Article 109227. https://doi.org/10.1016/j.engappai.2024.109227

@article{d8b546900fab4e929a97928913db330a,

title = "Combining optical flow and Swin Transformer for Space-Time video super-resolution",

abstract = "Space–time video super-resolution is a task that aims to interpolate low frame rate, low resolution videos to high frame rate, high resolution ones. While existing Transformer-based methods have achieved results comparable to convolutional neural networks-based methods, the computational cost of Transformer limits its performance with constrained computational resources. Moreover, Swin Transformer may fail to fully exploit the spatio-temporal information of video frames due to the limitation of window size, impeding its effectiveness in handling large motions. To address these limitations, we propose an end-to-end space–time video super-resolution architecture based on optical flow alignment and Swin Transformer. The alignment module is introduced to extract spatio-temporal information from adjacent frames without significantly increasing the computational burden. Additionally, we design a residual convolution layer to enhance the translational invariance of the features extracted by Swin Transformer and introduces additional nonlinear transformations. Experimental results demonstrate that our proposed method achieves superior performance on various benchmark datasets compared to state-of-the-art methods. In terms of Peak Signal-to-Noise Ratio, our method outperforms the state-of-the-art methods by at least 0.15 dB on Vimeo-Medium dataset.",

keywords = "Deep learning, Feature alignment, Optical flow, Residual convolution, Space-Time video super-resolution, Swin Transformer",

author = "Xin Wang and Hua Wang and Mingli Zhang and Fan Zhang",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier Ltd",

year = "2024",

month = nov,

doi = "10.1016/j.engappai.2024.109227",

language = "English",

volume = "137",

journal = "Engineering Applications of Artificial Intelligence",

issn = "0952-1976",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Combining optical flow and Swin Transformer for Space-Time video super-resolution

AU - Wang, Xin

AU - Wang, Hua

AU - Zhang, Mingli

AU - Zhang, Fan

PY - 2024/11

Y1 - 2024/11

N2 - Space–time video super-resolution is a task that aims to interpolate low frame rate, low resolution videos to high frame rate, high resolution ones. While existing Transformer-based methods have achieved results comparable to convolutional neural networks-based methods, the computational cost of Transformer limits its performance with constrained computational resources. Moreover, Swin Transformer may fail to fully exploit the spatio-temporal information of video frames due to the limitation of window size, impeding its effectiveness in handling large motions. To address these limitations, we propose an end-to-end space–time video super-resolution architecture based on optical flow alignment and Swin Transformer. The alignment module is introduced to extract spatio-temporal information from adjacent frames without significantly increasing the computational burden. Additionally, we design a residual convolution layer to enhance the translational invariance of the features extracted by Swin Transformer and introduces additional nonlinear transformations. Experimental results demonstrate that our proposed method achieves superior performance on various benchmark datasets compared to state-of-the-art methods. In terms of Peak Signal-to-Noise Ratio, our method outperforms the state-of-the-art methods by at least 0.15 dB on Vimeo-Medium dataset.

AB - Space–time video super-resolution is a task that aims to interpolate low frame rate, low resolution videos to high frame rate, high resolution ones. While existing Transformer-based methods have achieved results comparable to convolutional neural networks-based methods, the computational cost of Transformer limits its performance with constrained computational resources. Moreover, Swin Transformer may fail to fully exploit the spatio-temporal information of video frames due to the limitation of window size, impeding its effectiveness in handling large motions. To address these limitations, we propose an end-to-end space–time video super-resolution architecture based on optical flow alignment and Swin Transformer. The alignment module is introduced to extract spatio-temporal information from adjacent frames without significantly increasing the computational burden. Additionally, we design a residual convolution layer to enhance the translational invariance of the features extracted by Swin Transformer and introduces additional nonlinear transformations. Experimental results demonstrate that our proposed method achieves superior performance on various benchmark datasets compared to state-of-the-art methods. In terms of Peak Signal-to-Noise Ratio, our method outperforms the state-of-the-art methods by at least 0.15 dB on Vimeo-Medium dataset.

KW - Deep learning

KW - Feature alignment

KW - Optical flow

KW - Residual convolution

KW - Space-Time video super-resolution

KW - Swin Transformer

UR - http://www.scopus.com/inward/record.url?scp=85202588100&partnerID=8YFLogxK

U2 - 10.1016/j.engappai.2024.109227

DO - 10.1016/j.engappai.2024.109227

M3 - Article

AN - SCOPUS:85202588100

SN - 0952-1976

VL - 137

JO - Engineering Applications of Artificial Intelligence

JF - Engineering Applications of Artificial Intelligence

M1 - 109227

ER -

Combining optical flow and Swin Transformer for Space-Time video super-resolution

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this