融合音画同步的唇形合成研究

Cong Jin; Jie Wang; Zichun Guo; Jing Wang

doi:10.11959/j.issn.2096-6652.202335

融合音画同步的唇形合成研究

Cong Jin, Jie Wang, Zichun Guo^*, Jing Wang

^*此作品的通讯作者

信息与电子学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

With the flourishing development of video-based information dissemination, audio and video synchronization is gradually becoming an important standard for measuring video quality. Deep synthesis technology has been entering the public's view in the international communication field, and lip-sync technology integrating audio and video synchronization has attracted more and more attention. The existing lip-synthesis models are mainly based on lip-synthesis of static images, which are not effective for synthesis of dynamic videos, and most of them use English datasets for training which results in poor synthesis of Chinese Mandarin. To address these problems, this paper conducted optimization experiments on the Wav2Lip lip synthesis model in Chinese context based on its research foundation, and tested the effect of different routes of training models through multiple sets of experiments, which provided important reference values for the subsequent Wav2Lip series research. This study realized lip synthesis from speech-driven to text-driven, discussed the application of lip synthesis in multiple fields such as virtual digital human, and laid the foundation for the broader application and development of lip synthesis technology.

投稿的翻译标题	Lipsynthesis incorporating audio-visual synchronisation
源语言	繁体中文
页（从-至）	397-405
页数	9
期刊	Chinese Journal of Intelligent Science and Technology
卷	5
期	3
DOI	https://doi.org/10.11959/j.issn.2096-6652.202335
出版状态	已出版 - 15 9月 2023

关键词

artificial intelligence
computer visualization
deep learning
lip generation
synchronization of audio and video

访问文件

10.11959/j.issn.2096-6652.202335

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{c86a93075249414eb50c06fe8a2fcf90,

title = "融合音画同步的唇形合成研究",

abstract = "With the flourishing development of video-based information dissemination, audio and video synchronization is gradually becoming an important standard for measuring video quality. Deep synthesis technology has been entering the public's view in the international communication field, and lip-sync technology integrating audio and video synchronization has attracted more and more attention. The existing lip-synthesis models are mainly based on lip-synthesis of static images, which are not effective for synthesis of dynamic videos, and most of them use English datasets for training which results in poor synthesis of Chinese Mandarin. To address these problems, this paper conducted optimization experiments on the Wav2Lip lip synthesis model in Chinese context based on its research foundation, and tested the effect of different routes of training models through multiple sets of experiments, which provided important reference values for the subsequent Wav2Lip series research. This study realized lip synthesis from speech-driven to text-driven, discussed the application of lip synthesis in multiple fields such as virtual digital human, and laid the foundation for the broader application and development of lip synthesis technology.",

keywords = "artificial intelligence, computer visualization, deep learning, lip generation, synchronization of audio and video",

author = "Cong Jin and Jie Wang and Zichun Guo and Jing Wang",

year = "2023",

month = sep,

day = "15",

doi = "10.11959/j.issn.2096-6652.202335",

language = "繁体中文",

volume = "5",

pages = "397--405",

journal = "Chinese Journal of Intelligent Science and Technology",

issn = "2096-6652",

publisher = "Beijing Xintong Media Co., Ltd.",

number = "3",

}

TY - JOUR

T1 - 融合音画同步的唇形合成研究

AU - Jin, Cong

AU - Wang, Jie

AU - Guo, Zichun

AU - Wang, Jing

PY - 2023/9/15

Y1 - 2023/9/15

N2 - With the flourishing development of video-based information dissemination, audio and video synchronization is gradually becoming an important standard for measuring video quality. Deep synthesis technology has been entering the public's view in the international communication field, and lip-sync technology integrating audio and video synchronization has attracted more and more attention. The existing lip-synthesis models are mainly based on lip-synthesis of static images, which are not effective for synthesis of dynamic videos, and most of them use English datasets for training which results in poor synthesis of Chinese Mandarin. To address these problems, this paper conducted optimization experiments on the Wav2Lip lip synthesis model in Chinese context based on its research foundation, and tested the effect of different routes of training models through multiple sets of experiments, which provided important reference values for the subsequent Wav2Lip series research. This study realized lip synthesis from speech-driven to text-driven, discussed the application of lip synthesis in multiple fields such as virtual digital human, and laid the foundation for the broader application and development of lip synthesis technology.

AB - With the flourishing development of video-based information dissemination, audio and video synchronization is gradually becoming an important standard for measuring video quality. Deep synthesis technology has been entering the public's view in the international communication field, and lip-sync technology integrating audio and video synchronization has attracted more and more attention. The existing lip-synthesis models are mainly based on lip-synthesis of static images, which are not effective for synthesis of dynamic videos, and most of them use English datasets for training which results in poor synthesis of Chinese Mandarin. To address these problems, this paper conducted optimization experiments on the Wav2Lip lip synthesis model in Chinese context based on its research foundation, and tested the effect of different routes of training models through multiple sets of experiments, which provided important reference values for the subsequent Wav2Lip series research. This study realized lip synthesis from speech-driven to text-driven, discussed the application of lip synthesis in multiple fields such as virtual digital human, and laid the foundation for the broader application and development of lip synthesis technology.

KW - artificial intelligence

KW - computer visualization

KW - deep learning

KW - lip generation

KW - synchronization of audio and video

UR - http://www.scopus.com/inward/record.url?scp=85175338275&partnerID=8YFLogxK

U2 - 10.11959/j.issn.2096-6652.202335

DO - 10.11959/j.issn.2096-6652.202335

M3 - 文章

AN - SCOPUS:85175338275

SN - 2096-6652

VL - 5

SP - 397

EP - 405

JO - Chinese Journal of Intelligent Science and Technology

JF - Chinese Journal of Intelligent Science and Technology

IS - 3

ER -

融合音画同步的唇形合成研究

摘要

关键词

访问文件

其它文件与链接

指纹

引用此