A Cross-Modal Classification Dataset on Social Network

Yong Hu; Heyan Huang; Anfan Chen; Xian Ling Mao

doi:10.1007/978-3-030-60450-9_55

A Cross-Modal Classification Dataset on Social Network

Yong Hu, Heyan Huang^*, Anfan Chen, Xian Ling Mao

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

2 引用（Scopus）

摘要

Classifying tweets into general categories, such as food, music and games, is an essential work for social network platforms, which is the basis for information recommendation, user portraits and content construction. As far as we know, nearly all existing general tweet classification datasets only have textual content. However, textual content in tweets may be short, meaningless, and even none, which would harm the classification performance. In fact, images and videos are widespread in tweets, and they can intuitively provide extra useful information. To fill this gap, we construct a novel Cross-Modal Classification Dataset constructed from Weibo called CMCD. Specifically, we collect tweets with three modalities of text, image and video from 18 general categories, and then filter tweets that can easily be classified by only textual contents. Finally, the whole dataset consists of 85,860 tweets, and all of them have been manually labelled. Among them, 64.4% of tweets contain images, and 16.2% of tweets contain videos. We implement classical baselines for tweets classification and report human performance. Empirical results show that the classification over CMCD is challenging enough and requires further efforts.

源语言	英语
主期刊名	Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings
编辑	Xiaodan Zhu, Min Zhang, Yu Hong, Ruifang He
出版商	Springer Science and Business Media Deutschland GmbH
页	697-709
页数	13
ISBN（印刷版）	9783030604493
DOI	https://doi.org/10.1007/978-3-030-60450-9_55
出版状态	已出版 - 2020
活动	9th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2020 - Zhengzhou, 中国期限: 14 10月 2020 → 18 10月 2020

出版系列

姓名	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
卷	12430 LNAI
ISSN（印刷版）	0302-9743
ISSN（电子版）	1611-3349

会议

会议	9th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2020
国家/地区	中国
市	Zhengzhou
时期	14/10/20 → 18/10/20

访问文件

10.1007/978-3-030-60450-9_55

其它文件与链接

链接到 Scopus 的出版物

引用此

Hu, Y., Huang, H., Chen, A., & Mao, X. L. (2020). A Cross-Modal Classification Dataset on Social Network. 在 X. Zhu, M. Zhang, Y. Hong, & R. He (编辑), Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings (页码 697-709). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 卷 12430 LNAI). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-60450-9_55

Hu, Yong ; Huang, Heyan ; Chen, Anfan 等. / A Cross-Modal Classification Dataset on Social Network. Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings. 编辑 / Xiaodan Zhu ; Min Zhang ; Yu Hong ; Ruifang He. Springer Science and Business Media Deutschland GmbH, 2020. 页码 697-709 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{f56cfdc3193f4292907fb2029e4737ff,

title = "A Cross-Modal Classification Dataset on Social Network",

abstract = "Classifying tweets into general categories, such as food, music and games, is an essential work for social network platforms, which is the basis for information recommendation, user portraits and content construction. As far as we know, nearly all existing general tweet classification datasets only have textual content. However, textual content in tweets may be short, meaningless, and even none, which would harm the classification performance. In fact, images and videos are widespread in tweets, and they can intuitively provide extra useful information. To fill this gap, we construct a novel Cross-Modal Classification Dataset constructed from Weibo called CMCD. Specifically, we collect tweets with three modalities of text, image and video from 18 general categories, and then filter tweets that can easily be classified by only textual contents. Finally, the whole dataset consists of 85,860 tweets, and all of them have been manually labelled. Among them, 64.4% of tweets contain images, and 16.2% of tweets contain videos. We implement classical baselines for tweets classification and report human performance. Empirical results show that the classification over CMCD is challenging enough and requires further efforts.",

author = "Yong Hu and Heyan Huang and Anfan Chen and Mao, {Xian Ling}",

note = "Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 9th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2020 ; Conference date: 14-10-2020 Through 18-10-2020",

year = "2020",

doi = "10.1007/978-3-030-60450-9_55",

language = "English",

isbn = "9783030604493",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "697--709",

editor = "Xiaodan Zhu and Min Zhang and Yu Hong and Ruifang He",

booktitle = "Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings",

address = "Germany",

}

Hu, Y, Huang, H, Chen, A & Mao, XL 2020, A Cross-Modal Classification Dataset on Social Network. 在 X Zhu, M Zhang, Y Hong & R He (编辑), Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 卷 12430 LNAI, Springer Science and Business Media Deutschland GmbH, 页码 697-709, 9th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2020, Zhengzhou, 中国, 14/10/20. https://doi.org/10.1007/978-3-030-60450-9_55

A Cross-Modal Classification Dataset on Social Network. / Hu, Yong; Huang, Heyan; Chen, Anfan 等.
Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings. 编辑 / Xiaodan Zhu; Min Zhang; Yu Hong; Ruifang He. Springer Science and Business Media Deutschland GmbH, 2020. 页码 697-709 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 卷 12430 LNAI).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - A Cross-Modal Classification Dataset on Social Network

AU - Hu, Yong

AU - Huang, Heyan

AU - Chen, Anfan

AU - Mao, Xian Ling

PY - 2020

Y1 - 2020

N2 - Classifying tweets into general categories, such as food, music and games, is an essential work for social network platforms, which is the basis for information recommendation, user portraits and content construction. As far as we know, nearly all existing general tweet classification datasets only have textual content. However, textual content in tweets may be short, meaningless, and even none, which would harm the classification performance. In fact, images and videos are widespread in tweets, and they can intuitively provide extra useful information. To fill this gap, we construct a novel Cross-Modal Classification Dataset constructed from Weibo called CMCD. Specifically, we collect tweets with three modalities of text, image and video from 18 general categories, and then filter tweets that can easily be classified by only textual contents. Finally, the whole dataset consists of 85,860 tweets, and all of them have been manually labelled. Among them, 64.4% of tweets contain images, and 16.2% of tweets contain videos. We implement classical baselines for tweets classification and report human performance. Empirical results show that the classification over CMCD is challenging enough and requires further efforts.

AB - Classifying tweets into general categories, such as food, music and games, is an essential work for social network platforms, which is the basis for information recommendation, user portraits and content construction. As far as we know, nearly all existing general tweet classification datasets only have textual content. However, textual content in tweets may be short, meaningless, and even none, which would harm the classification performance. In fact, images and videos are widespread in tweets, and they can intuitively provide extra useful information. To fill this gap, we construct a novel Cross-Modal Classification Dataset constructed from Weibo called CMCD. Specifically, we collect tweets with three modalities of text, image and video from 18 general categories, and then filter tweets that can easily be classified by only textual contents. Finally, the whole dataset consists of 85,860 tweets, and all of them have been manually labelled. Among them, 64.4% of tweets contain images, and 16.2% of tweets contain videos. We implement classical baselines for tweets classification and report human performance. Empirical results show that the classification over CMCD is challenging enough and requires further efforts.

UR - http://www.scopus.com/inward/record.url?scp=85093119214&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-60450-9_55

DO - 10.1007/978-3-030-60450-9_55

M3 - Conference contribution

AN - SCOPUS:85093119214

SN - 9783030604493

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 697

EP - 709

BT - Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings

A2 - Zhu, Xiaodan

A2 - Zhang, Min

A2 - Hong, Yu

A2 - He, Ruifang

PB - Springer Science and Business Media Deutschland GmbH

T2 - 9th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2020

Y2 - 14 October 2020 through 18 October 2020

ER -

Hu Y, Huang H, Chen A, Mao XL. A Cross-Modal Classification Dataset on Social Network. 在 Zhu X, Zhang M, Hong Y, He R, 编辑, Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Proceedings. Springer Science and Business Media Deutschland GmbH. 2020. 页码 697-709. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-60450-9_55

A Cross-Modal Classification Dataset on Social Network

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此