Deep video quality assessment using constrained multi-task regression and Spatio-temporal feature fusion

Mingyang Wen; Lixiong Liu; Qingbing Sang; Yongmei Zhang

doi:10.1007/s11042-023-14652-2

Deep video quality assessment using constrained multi-task regression and Spatio-temporal feature fusion

Mingyang Wen, Lixiong Liu^*, Qingbing Sang, Yongmei Zhang

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Many popular video quality assessment (VQA) methods usually build models by simulating the process of human visual perception and adopt a simple regression strategy to predict video quality scores. However, these methods either hardly pay enough attention to regression processing prone to misprediction, or fail to accurately understand video content containing changes of movement or sudden movements. To remedy these, we propose a full reference (FR) video quality assessment model that integrates multi-task learning regression and analysis of spatio-temporal features to conduct video quality predictions. Firstly, the model arranges each frame of the reference and distorted videos into patches and calculates their entropy values to guide the selection of frame patches. A 2D Siamese network is then applied on the selected patches to learn spatial information. To more effectively capture temporal distortions, a multi-frame difference map is computed on each distorted video. The computed multi-frame difference maps are also partitioned into patches to select half of the ones with highest entropy values as temporal features. Additionally, we incorporate the temporal masking effect to optimize the spatial error and temporal features and adopt 3D convolutional neural network (CNN) in spatio-temporal feature fusion. Following recent evidence towards quality classification and quality regression, a constrained multi-task learning regression model is designed to aggregate the quality score, using quality classification subtask to contrain and optimize quality regression main task. Finally, the video quality score is predicted through the regression branch. We have evaluated our algorithm on five public VQA databases. The experimental results have revealed that the proposed algorithm can achieve superior performance as compared with the existing VQA methods.

Original language	English
Pages (from-to)	28067-28086
Number of pages	20
Journal	Multimedia Tools and Applications
Volume	82
Issue number	18
DOIs	https://doi.org/10.1007/s11042-023-14652-2
Publication status	Published - Jul 2023

Keywords

Deep video quality assessment
Entropy
Multi-frame difference
Multi-task learning regression
Temporal masking

Access to Document

10.1007/s11042-023-14652-2

Cite this

Wen, M., Liu, L., Sang, Q., & Zhang, Y. (2023). Deep video quality assessment using constrained multi-task regression and Spatio-temporal feature fusion. Multimedia Tools and Applications, 82(18), 28067-28086. https://doi.org/10.1007/s11042-023-14652-2

@article{12367bee90634a359141bd312e5935d9,

title = "Deep video quality assessment using constrained multi-task regression and Spatio-temporal feature fusion",

abstract = "Many popular video quality assessment (VQA) methods usually build models by simulating the process of human visual perception and adopt a simple regression strategy to predict video quality scores. However, these methods either hardly pay enough attention to regression processing prone to misprediction, or fail to accurately understand video content containing changes of movement or sudden movements. To remedy these, we propose a full reference (FR) video quality assessment model that integrates multi-task learning regression and analysis of spatio-temporal features to conduct video quality predictions. Firstly, the model arranges each frame of the reference and distorted videos into patches and calculates their entropy values to guide the selection of frame patches. A 2D Siamese network is then applied on the selected patches to learn spatial information. To more effectively capture temporal distortions, a multi-frame difference map is computed on each distorted video. The computed multi-frame difference maps are also partitioned into patches to select half of the ones with highest entropy values as temporal features. Additionally, we incorporate the temporal masking effect to optimize the spatial error and temporal features and adopt 3D convolutional neural network (CNN) in spatio-temporal feature fusion. Following recent evidence towards quality classification and quality regression, a constrained multi-task learning regression model is designed to aggregate the quality score, using quality classification subtask to contrain and optimize quality regression main task. Finally, the video quality score is predicted through the regression branch. We have evaluated our algorithm on five public VQA databases. The experimental results have revealed that the proposed algorithm can achieve superior performance as compared with the existing VQA methods.",

keywords = "Deep video quality assessment, Entropy, Multi-frame difference, Multi-task learning regression, Temporal masking",

author = "Mingyang Wen and Lixiong Liu and Qingbing Sang and Yongmei Zhang",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

month = jul,

doi = "10.1007/s11042-023-14652-2",

language = "English",

volume = "82",

pages = "28067--28086",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer",

number = "18",

}

TY - JOUR

T1 - Deep video quality assessment using constrained multi-task regression and Spatio-temporal feature fusion

AU - Wen, Mingyang

AU - Liu, Lixiong

AU - Sang, Qingbing

AU - Zhang, Yongmei

PY - 2023/7

Y1 - 2023/7

N2 - Many popular video quality assessment (VQA) methods usually build models by simulating the process of human visual perception and adopt a simple regression strategy to predict video quality scores. However, these methods either hardly pay enough attention to regression processing prone to misprediction, or fail to accurately understand video content containing changes of movement or sudden movements. To remedy these, we propose a full reference (FR) video quality assessment model that integrates multi-task learning regression and analysis of spatio-temporal features to conduct video quality predictions. Firstly, the model arranges each frame of the reference and distorted videos into patches and calculates their entropy values to guide the selection of frame patches. A 2D Siamese network is then applied on the selected patches to learn spatial information. To more effectively capture temporal distortions, a multi-frame difference map is computed on each distorted video. The computed multi-frame difference maps are also partitioned into patches to select half of the ones with highest entropy values as temporal features. Additionally, we incorporate the temporal masking effect to optimize the spatial error and temporal features and adopt 3D convolutional neural network (CNN) in spatio-temporal feature fusion. Following recent evidence towards quality classification and quality regression, a constrained multi-task learning regression model is designed to aggregate the quality score, using quality classification subtask to contrain and optimize quality regression main task. Finally, the video quality score is predicted through the regression branch. We have evaluated our algorithm on five public VQA databases. The experimental results have revealed that the proposed algorithm can achieve superior performance as compared with the existing VQA methods.

AB - Many popular video quality assessment (VQA) methods usually build models by simulating the process of human visual perception and adopt a simple regression strategy to predict video quality scores. However, these methods either hardly pay enough attention to regression processing prone to misprediction, or fail to accurately understand video content containing changes of movement or sudden movements. To remedy these, we propose a full reference (FR) video quality assessment model that integrates multi-task learning regression and analysis of spatio-temporal features to conduct video quality predictions. Firstly, the model arranges each frame of the reference and distorted videos into patches and calculates their entropy values to guide the selection of frame patches. A 2D Siamese network is then applied on the selected patches to learn spatial information. To more effectively capture temporal distortions, a multi-frame difference map is computed on each distorted video. The computed multi-frame difference maps are also partitioned into patches to select half of the ones with highest entropy values as temporal features. Additionally, we incorporate the temporal masking effect to optimize the spatial error and temporal features and adopt 3D convolutional neural network (CNN) in spatio-temporal feature fusion. Following recent evidence towards quality classification and quality regression, a constrained multi-task learning regression model is designed to aggregate the quality score, using quality classification subtask to contrain and optimize quality regression main task. Finally, the video quality score is predicted through the regression branch. We have evaluated our algorithm on five public VQA databases. The experimental results have revealed that the proposed algorithm can achieve superior performance as compared with the existing VQA methods.

KW - Deep video quality assessment

KW - Entropy

KW - Multi-frame difference

KW - Multi-task learning regression

KW - Temporal masking

UR - http://www.scopus.com/inward/record.url?scp=85148063951&partnerID=8YFLogxK

U2 - 10.1007/s11042-023-14652-2

DO - 10.1007/s11042-023-14652-2

M3 - Article

AN - SCOPUS:85148063951

SN - 1380-7501

VL - 82

SP - 28067

EP - 28086

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

IS - 18

ER -

Deep video quality assessment using constrained multi-task regression and Spatio-temporal feature fusion

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this