TY - GEN
T1 - A Cross-Modal Classification Dataset on Social Network
AU - Hu, Yong
AU - Huang, Heyan
AU - Chen, Anfan
AU - Mao, Xian Ling
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Classifying tweets into general categories, such as food, music and games, is an essential work for social network platforms, which is the basis for information recommendation, user portraits and content construction. As far as we know, nearly all existing general tweet classification datasets only have textual content. However, textual content in tweets may be short, meaningless, and even none, which would harm the classification performance. In fact, images and videos are widespread in tweets, and they can intuitively provide extra useful information. To fill this gap, we construct a novel Cross-Modal Classification Dataset constructed from Weibo called CMCD. Specifically, we collect tweets with three modalities of text, image and video from 18 general categories, and then filter tweets that can easily be classified by only textual contents. Finally, the whole dataset consists of 85,860 tweets, and all of them have been manually labelled. Among them, 64.4% of tweets contain images, and 16.2% of tweets contain videos. We implement classical baselines for tweets classification and report human performance. Empirical results show that the classification over CMCD is challenging enough and requires further efforts.
AB - Classifying tweets into general categories, such as food, music and games, is an essential work for social network platforms, which is the basis for information recommendation, user portraits and content construction. As far as we know, nearly all existing general tweet classification datasets only have textual content. However, textual content in tweets may be short, meaningless, and even none, which would harm the classification performance. In fact, images and videos are widespread in tweets, and they can intuitively provide extra useful information. To fill this gap, we construct a novel Cross-Modal Classification Dataset constructed from Weibo called CMCD. Specifically, we collect tweets with three modalities of text, image and video from 18 general categories, and then filter tweets that can easily be classified by only textual contents. Finally, the whole dataset consists of 85,860 tweets, and all of them have been manually labelled. Among them, 64.4% of tweets contain images, and 16.2% of tweets contain videos. We implement classical baselines for tweets classification and report human performance. Empirical results show that the classification over CMCD is challenging enough and requires further efforts.
UR - http://www.scopus.com/inward/record.url?scp=85093119214&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-60450-9_55
DO - 10.1007/978-3-030-60450-9_55
M3 - Conference contribution
AN - SCOPUS:85093119214
SN - 9783030604493
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 697
EP - 709
BT - Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings
A2 - Zhu, Xiaodan
A2 - Zhang, Min
A2 - Hong, Yu
A2 - He, Ruifang
PB - Springer Science and Business Media Deutschland GmbH
T2 - 9th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2020
Y2 - 14 October 2020 through 18 October 2020
ER -