A Cross-Modal Classification Dataset on Social Network

Yong Hu, Heyan Huang*, Anfan Chen, Xian Ling Mao

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

2 引用 (Scopus)

摘要

Classifying tweets into general categories, such as food, music and games, is an essential work for social network platforms, which is the basis for information recommendation, user portraits and content construction. As far as we know, nearly all existing general tweet classification datasets only have textual content. However, textual content in tweets may be short, meaningless, and even none, which would harm the classification performance. In fact, images and videos are widespread in tweets, and they can intuitively provide extra useful information. To fill this gap, we construct a novel Cross-Modal Classification Dataset constructed from Weibo called CMCD. Specifically, we collect tweets with three modalities of text, image and video from 18 general categories, and then filter tweets that can easily be classified by only textual contents. Finally, the whole dataset consists of 85,860 tweets, and all of them have been manually labelled. Among them, 64.4% of tweets contain images, and 16.2% of tweets contain videos. We implement classical baselines for tweets classification and report human performance. Empirical results show that the classification over CMCD is challenging enough and requires further efforts.

源语言英语
主期刊名Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings
编辑Xiaodan Zhu, Min Zhang, Yu Hong, Ruifang He
出版商Springer Science and Business Media Deutschland GmbH
697-709
页数13
ISBN(印刷版)9783030604493
DOI
出版状态已出版 - 2020
活动9th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2020 - Zhengzhou, 中国
期限: 14 10月 202018 10月 2020

出版系列

姓名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
12430 LNAI
ISSN(印刷版)0302-9743
ISSN(电子版)1611-3349

会议

会议9th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2020
国家/地区中国
Zhengzhou
时期14/10/2018/10/20

指纹

探究 'A Cross-Modal Classification Dataset on Social Network' 的科研主题。它们共同构成独一无二的指纹。

引用此