跨语言知识蒸馏的视频中文字幕生成

Jing Yi Hou; Ya Yun Qi; Xin Xiao Wu; Yun De Jia

doi:10.11897/SP.J.1016.2021.01907

跨语言知识蒸馏的视频中文字幕生成

Jing Yi Hou, Ya Yun Qi, Xin Xiao Wu^*, Yun De Jia

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

4 引用（Scopus）

摘要

Video captioning aims to automatically generate the natural language descriptions of a video, which requires understanding the visual content and describing it with grammatically accurate sentences. Video captioning has wide applications in video recommendation, vision assistance, human-robot interaction and many other fields, and has attracted growing attention in the fields of computer vision and natural language processing. Although remarkable progress has been made on English video captioning, using other languages such as Chinese to describe a video remains under-explored. In this paper, we investigate Chinese video captioning. However, the insufficiency of paired videos and Chinese captions makes it difficult to train a powerful model for Chinese video captioning. Since there exist many English video captioning methods and training data, it is a feasible method to perform Chinese video captioning by translating the English captions via machine translation. However, the difference between Chinese and Western cultures and the performance of machine translation algorithms will both affect the quality of generated Chinese captions. To this end, we propose a cross-lingual knowledge distillation method for Chinese video captioning. Based on a two-branches structure, our method does not only directly generate Chinese captions according to the video content, but also takes full advantage of the easily accessible English video captions as the privileged information to guide the generation of Chinese video captions. Since the Chinese and English captions are semantically correlated with respect to the video content, our method learns cross-lingual knowledge from them and utilizes knowledge distillation to integrate the high-level semantic information in English captions into Chinese captions generation. Meanwhile, the consistency between the training target and the captioning target is guaranteed by the end-to-end training strategy, thus effectively improving the performance of Chinese video captioning. Benefit from the mechanism of knowledge distillation, our method only utilizes English captions data during the training stage, and after training it can directly generate Chinese captions from the input video. To verify the universality and flexibility of our cross-lingual knowledge distillation method, we use four mainstream visual captioning models for evaluation, covering the CNN-RNN structure, RNN-RNN structure, CNN-CNN structure and model based on Top-Down attention mechanism. These models are widely used as the backbone models in a large number of visual captioning methods. Moreover, we extend the English video captioning dataset MSVD into a cross-lingual video captioning dataset with Chinese captions, called MSVD-CN. MSVD-CN contains 1970 video clips collected from the Internet and 11758 Chinese captions besides the original 41 English captions per video in MSVD. In order to reduce the annotation mistakes caused by annotators' typos or misunderstandings of the video contents, we propose two automatic inspection methods to perform semantic and syntactic checks, respectively, on the collected manual annotations in the data collection stage. Extensive experiments are carried out on the MSVD-CN dataset, via four widely used evaluation metrics for video captioning including BLEU, METEOR, ROUGE-L, and CIDEr. The results demonstrate that the superiority of proposed cross-lingual knowledge distillation on Chinese video captioning. Furthermore, we also report some qualitative experiment results to show the effectiveness of our method.

投稿的翻译标题	Cross-Lingual Knowledge Distillation for Chinese Video Captioning
源语言	繁体中文
页（从-至）	1907-1921
页数	15
期刊	Jisuanji Xuebao/Chinese Journal of Computers
卷	44
期	9
DOI	https://doi.org/10.11897/SP.J.1016.2021.01907
出版状态	已出版 - 9月 2021

关键词

Chinese video captioning
Cross-lingual video captioning dataset
Knowledge distillation
Privileged information
Video understanding

访问文件

10.11897/SP.J.1016.2021.01907

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ddd382f5279e4cf4a5b34744cfff3f36,

title = "跨语言知识蒸馏的视频中文字幕生成",

abstract = "Video captioning aims to automatically generate the natural language descriptions of a video, which requires understanding the visual content and describing it with grammatically accurate sentences. Video captioning has wide applications in video recommendation, vision assistance, human-robot interaction and many other fields, and has attracted growing attention in the fields of computer vision and natural language processing. Although remarkable progress has been made on English video captioning, using other languages such as Chinese to describe a video remains under-explored. In this paper, we investigate Chinese video captioning. However, the insufficiency of paired videos and Chinese captions makes it difficult to train a powerful model for Chinese video captioning. Since there exist many English video captioning methods and training data, it is a feasible method to perform Chinese video captioning by translating the English captions via machine translation. However, the difference between Chinese and Western cultures and the performance of machine translation algorithms will both affect the quality of generated Chinese captions. To this end, we propose a cross-lingual knowledge distillation method for Chinese video captioning. Based on a two-branches structure, our method does not only directly generate Chinese captions according to the video content, but also takes full advantage of the easily accessible English video captions as the privileged information to guide the generation of Chinese video captions. Since the Chinese and English captions are semantically correlated with respect to the video content, our method learns cross-lingual knowledge from them and utilizes knowledge distillation to integrate the high-level semantic information in English captions into Chinese captions generation. Meanwhile, the consistency between the training target and the captioning target is guaranteed by the end-to-end training strategy, thus effectively improving the performance of Chinese video captioning. Benefit from the mechanism of knowledge distillation, our method only utilizes English captions data during the training stage, and after training it can directly generate Chinese captions from the input video. To verify the universality and flexibility of our cross-lingual knowledge distillation method, we use four mainstream visual captioning models for evaluation, covering the CNN-RNN structure, RNN-RNN structure, CNN-CNN structure and model based on Top-Down attention mechanism. These models are widely used as the backbone models in a large number of visual captioning methods. Moreover, we extend the English video captioning dataset MSVD into a cross-lingual video captioning dataset with Chinese captions, called MSVD-CN. MSVD-CN contains 1970 video clips collected from the Internet and 11758 Chinese captions besides the original 41 English captions per video in MSVD. In order to reduce the annotation mistakes caused by annotators' typos or misunderstandings of the video contents, we propose two automatic inspection methods to perform semantic and syntactic checks, respectively, on the collected manual annotations in the data collection stage. Extensive experiments are carried out on the MSVD-CN dataset, via four widely used evaluation metrics for video captioning including BLEU, METEOR, ROUGE-L, and CIDEr. The results demonstrate that the superiority of proposed cross-lingual knowledge distillation on Chinese video captioning. Furthermore, we also report some qualitative experiment results to show the effectiveness of our method.",

keywords = "Chinese video captioning, Cross-lingual video captioning dataset, Knowledge distillation, Privileged information, Video understanding",

author = "Hou, {Jing Yi} and Qi, {Ya Yun} and Wu, {Xin Xiao} and Jia, {Yun De}",

year = "2021",

month = sep,

doi = "10.11897/SP.J.1016.2021.01907",

language = "繁体中文",

volume = "44",

pages = "1907--1921",

journal = "Jisuanji Xuebao/Chinese Journal of Computers",

issn = "0254-4164",

publisher = "Science Press",

number = "9",

}

TY - JOUR

T1 - 跨语言知识蒸馏的视频中文字幕生成

AU - Hou, Jing Yi

AU - Qi, Ya Yun

AU - Wu, Xin Xiao

AU - Jia, Yun De

PY - 2021/9

Y1 - 2021/9

N2 - Video captioning aims to automatically generate the natural language descriptions of a video, which requires understanding the visual content and describing it with grammatically accurate sentences. Video captioning has wide applications in video recommendation, vision assistance, human-robot interaction and many other fields, and has attracted growing attention in the fields of computer vision and natural language processing. Although remarkable progress has been made on English video captioning, using other languages such as Chinese to describe a video remains under-explored. In this paper, we investigate Chinese video captioning. However, the insufficiency of paired videos and Chinese captions makes it difficult to train a powerful model for Chinese video captioning. Since there exist many English video captioning methods and training data, it is a feasible method to perform Chinese video captioning by translating the English captions via machine translation. However, the difference between Chinese and Western cultures and the performance of machine translation algorithms will both affect the quality of generated Chinese captions. To this end, we propose a cross-lingual knowledge distillation method for Chinese video captioning. Based on a two-branches structure, our method does not only directly generate Chinese captions according to the video content, but also takes full advantage of the easily accessible English video captions as the privileged information to guide the generation of Chinese video captions. Since the Chinese and English captions are semantically correlated with respect to the video content, our method learns cross-lingual knowledge from them and utilizes knowledge distillation to integrate the high-level semantic information in English captions into Chinese captions generation. Meanwhile, the consistency between the training target and the captioning target is guaranteed by the end-to-end training strategy, thus effectively improving the performance of Chinese video captioning. Benefit from the mechanism of knowledge distillation, our method only utilizes English captions data during the training stage, and after training it can directly generate Chinese captions from the input video. To verify the universality and flexibility of our cross-lingual knowledge distillation method, we use four mainstream visual captioning models for evaluation, covering the CNN-RNN structure, RNN-RNN structure, CNN-CNN structure and model based on Top-Down attention mechanism. These models are widely used as the backbone models in a large number of visual captioning methods. Moreover, we extend the English video captioning dataset MSVD into a cross-lingual video captioning dataset with Chinese captions, called MSVD-CN. MSVD-CN contains 1970 video clips collected from the Internet and 11758 Chinese captions besides the original 41 English captions per video in MSVD. In order to reduce the annotation mistakes caused by annotators' typos or misunderstandings of the video contents, we propose two automatic inspection methods to perform semantic and syntactic checks, respectively, on the collected manual annotations in the data collection stage. Extensive experiments are carried out on the MSVD-CN dataset, via four widely used evaluation metrics for video captioning including BLEU, METEOR, ROUGE-L, and CIDEr. The results demonstrate that the superiority of proposed cross-lingual knowledge distillation on Chinese video captioning. Furthermore, we also report some qualitative experiment results to show the effectiveness of our method.

AB - Video captioning aims to automatically generate the natural language descriptions of a video, which requires understanding the visual content and describing it with grammatically accurate sentences. Video captioning has wide applications in video recommendation, vision assistance, human-robot interaction and many other fields, and has attracted growing attention in the fields of computer vision and natural language processing. Although remarkable progress has been made on English video captioning, using other languages such as Chinese to describe a video remains under-explored. In this paper, we investigate Chinese video captioning. However, the insufficiency of paired videos and Chinese captions makes it difficult to train a powerful model for Chinese video captioning. Since there exist many English video captioning methods and training data, it is a feasible method to perform Chinese video captioning by translating the English captions via machine translation. However, the difference between Chinese and Western cultures and the performance of machine translation algorithms will both affect the quality of generated Chinese captions. To this end, we propose a cross-lingual knowledge distillation method for Chinese video captioning. Based on a two-branches structure, our method does not only directly generate Chinese captions according to the video content, but also takes full advantage of the easily accessible English video captions as the privileged information to guide the generation of Chinese video captions. Since the Chinese and English captions are semantically correlated with respect to the video content, our method learns cross-lingual knowledge from them and utilizes knowledge distillation to integrate the high-level semantic information in English captions into Chinese captions generation. Meanwhile, the consistency between the training target and the captioning target is guaranteed by the end-to-end training strategy, thus effectively improving the performance of Chinese video captioning. Benefit from the mechanism of knowledge distillation, our method only utilizes English captions data during the training stage, and after training it can directly generate Chinese captions from the input video. To verify the universality and flexibility of our cross-lingual knowledge distillation method, we use four mainstream visual captioning models for evaluation, covering the CNN-RNN structure, RNN-RNN structure, CNN-CNN structure and model based on Top-Down attention mechanism. These models are widely used as the backbone models in a large number of visual captioning methods. Moreover, we extend the English video captioning dataset MSVD into a cross-lingual video captioning dataset with Chinese captions, called MSVD-CN. MSVD-CN contains 1970 video clips collected from the Internet and 11758 Chinese captions besides the original 41 English captions per video in MSVD. In order to reduce the annotation mistakes caused by annotators' typos or misunderstandings of the video contents, we propose two automatic inspection methods to perform semantic and syntactic checks, respectively, on the collected manual annotations in the data collection stage. Extensive experiments are carried out on the MSVD-CN dataset, via four widely used evaluation metrics for video captioning including BLEU, METEOR, ROUGE-L, and CIDEr. The results demonstrate that the superiority of proposed cross-lingual knowledge distillation on Chinese video captioning. Furthermore, we also report some qualitative experiment results to show the effectiveness of our method.

KW - Chinese video captioning

KW - Cross-lingual video captioning dataset

KW - Knowledge distillation

KW - Privileged information

KW - Video understanding

UR - http://www.scopus.com/inward/record.url?scp=85115208509&partnerID=8YFLogxK

U2 - 10.11897/SP.J.1016.2021.01907

DO - 10.11897/SP.J.1016.2021.01907

M3 - 文章

AN - SCOPUS:85115208509

SN - 0254-4164

VL - 44

SP - 1907

EP - 1921

JO - Jisuanji Xuebao/Chinese Journal of Computers

JF - Jisuanji Xuebao/Chinese Journal of Computers

IS - 9

ER -

跨语言知识蒸馏的视频中文字幕生成

摘要

关键词

访问文件

其它文件与链接

指纹

引用此