FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain

Hao Wang; Jing Jing Zhu; Wei Wei; Heyan Huang; Xian Ling Mao

doi:10.1007/978-3-031-44696-2_51

FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain

Hao Wang, Jing Jing Zhu, Wei Wei, Heyan Huang, Xian Ling Mao^*

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

As scientific communities grow and evolve, more and more papers are published, especially in computer science field (CS). It is important to organize scientific information into structured knowledge bases extracted from a large corpus of CS papers, which usually requires Information Extraction (IE) about scientific entities and their relationships. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science field, there have been several handcrafted annotated entity-relation datasets like SciERC and SciREX, which are used to train supervised extracted algorithms. However, almost all these datasets ignore the annotation of following fine-grained named entities: nested entities, discontinuous entities and minimal independent semantics entities. To solve this problem, this paper will present a novel Fine-Grained entity-relation Extraction dataset in Computer Science field (FGCS), which contains rich fine-grained entities and their relationships. The proposed dataset includes 1,948 sentences of 6 entity types with up to 7 layers of nesting and 5 relation types. Extensive experiments show that the proposed dataset is a good benchmark for measuring an information extraction model’s ability of recognizing fine-grained entities and their relations. Our dataset is publicly available at https://github.com/broken-dream/FGCS.

源语言	英语
主期刊名	Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings
编辑	Fei Liu, Nan Duan, Qingting Xu, Yu Hong
出版商	Springer Science and Business Media Deutschland GmbH
页	653-665
页数	13
ISBN（印刷版）	9783031446955
DOI	https://doi.org/10.1007/978-3-031-44696-2_51
出版状态	已出版 - 2023
活动	12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023 - Foshan, 中国期限: 12 10月 2023 → 15 10月 2023

出版系列

姓名	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
卷	14303 LNAI
ISSN（印刷版）	0302-9743
ISSN（电子版）	1611-3349

会议

会议	12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023
国家/地区	中国
市	Foshan
时期	12/10/23 → 15/10/23

访问文件

10.1007/978-3-031-44696-2_51

其它文件与链接

链接到 Scopus 的出版物

引用此

Wang, H., Zhu, J. J., Wei, W., Huang, H., & Mao, X. L. (2023). FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain. 在 F. Liu, N. Duan, Q. Xu, & Y. Hong (编辑), Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings (页码 653-665). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 卷 14303 LNAI). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-44696-2_51

Wang, Hao ; Zhu, Jing Jing ; Wei, Wei 等. / FGCS : A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain. Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings. 编辑 / Fei Liu ; Nan Duan ; Qingting Xu ; Yu Hong. Springer Science and Business Media Deutschland GmbH, 2023. 页码 653-665 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{60e91b0152064ce080c6a006252d9a00,

title = "FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain",

abstract = "As scientific communities grow and evolve, more and more papers are published, especially in computer science field (CS). It is important to organize scientific information into structured knowledge bases extracted from a large corpus of CS papers, which usually requires Information Extraction (IE) about scientific entities and their relationships. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science field, there have been several handcrafted annotated entity-relation datasets like SciERC and SciREX, which are used to train supervised extracted algorithms. However, almost all these datasets ignore the annotation of following fine-grained named entities: nested entities, discontinuous entities and minimal independent semantics entities. To solve this problem, this paper will present a novel Fine-Grained entity-relation Extraction dataset in Computer Science field (FGCS), which contains rich fine-grained entities and their relationships. The proposed dataset includes 1,948 sentences of 6 entity types with up to 7 layers of nesting and 5 relation types. Extensive experiments show that the proposed dataset is a good benchmark for measuring an information extraction model{\textquoteright}s ability of recognizing fine-grained entities and their relations. Our dataset is publicly available at https://github.com/broken-dream/FGCS.",

keywords = "Datasets, Fine-grained Entities, Information Extraction",

author = "Hao Wang and Zhu, {Jing Jing} and Wei Wei and Heyan Huang and Mao, {Xian Ling}",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.; 12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023 ; Conference date: 12-10-2023 Through 15-10-2023",

year = "2023",

doi = "10.1007/978-3-031-44696-2_51",

language = "English",

isbn = "9783031446955",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "653--665",

editor = "Fei Liu and Nan Duan and Qingting Xu and Yu Hong",

booktitle = "Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings",

address = "Germany",

}

Wang, H, Zhu, JJ, Wei, W, Huang, H & Mao, XL 2023, FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain. 在 F Liu, N Duan, Q Xu & Y Hong (编辑), Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 卷 14303 LNAI, Springer Science and Business Media Deutschland GmbH, 页码 653-665, 12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023, Foshan, 中国, 12/10/23. https://doi.org/10.1007/978-3-031-44696-2_51

FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain. / Wang, Hao; Zhu, Jing Jing; Wei, Wei 等.
Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings. 编辑 / Fei Liu; Nan Duan; Qingting Xu; Yu Hong. Springer Science and Business Media Deutschland GmbH, 2023. 页码 653-665 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 卷 14303 LNAI).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - FGCS

T2 - 12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023

AU - Wang, Hao

AU - Zhu, Jing Jing

AU - Wei, Wei

AU - Huang, Heyan

AU - Mao, Xian Ling

N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

PY - 2023

Y1 - 2023

N2 - As scientific communities grow and evolve, more and more papers are published, especially in computer science field (CS). It is important to organize scientific information into structured knowledge bases extracted from a large corpus of CS papers, which usually requires Information Extraction (IE) about scientific entities and their relationships. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science field, there have been several handcrafted annotated entity-relation datasets like SciERC and SciREX, which are used to train supervised extracted algorithms. However, almost all these datasets ignore the annotation of following fine-grained named entities: nested entities, discontinuous entities and minimal independent semantics entities. To solve this problem, this paper will present a novel Fine-Grained entity-relation Extraction dataset in Computer Science field (FGCS), which contains rich fine-grained entities and their relationships. The proposed dataset includes 1,948 sentences of 6 entity types with up to 7 layers of nesting and 5 relation types. Extensive experiments show that the proposed dataset is a good benchmark for measuring an information extraction model’s ability of recognizing fine-grained entities and their relations. Our dataset is publicly available at https://github.com/broken-dream/FGCS.

AB - As scientific communities grow and evolve, more and more papers are published, especially in computer science field (CS). It is important to organize scientific information into structured knowledge bases extracted from a large corpus of CS papers, which usually requires Information Extraction (IE) about scientific entities and their relationships. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science field, there have been several handcrafted annotated entity-relation datasets like SciERC and SciREX, which are used to train supervised extracted algorithms. However, almost all these datasets ignore the annotation of following fine-grained named entities: nested entities, discontinuous entities and minimal independent semantics entities. To solve this problem, this paper will present a novel Fine-Grained entity-relation Extraction dataset in Computer Science field (FGCS), which contains rich fine-grained entities and their relationships. The proposed dataset includes 1,948 sentences of 6 entity types with up to 7 layers of nesting and 5 relation types. Extensive experiments show that the proposed dataset is a good benchmark for measuring an information extraction model’s ability of recognizing fine-grained entities and their relations. Our dataset is publicly available at https://github.com/broken-dream/FGCS.

KW - Datasets

KW - Fine-grained Entities

KW - Information Extraction

UR - http://www.scopus.com/inward/record.url?scp=85174684437&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-44696-2_51

DO - 10.1007/978-3-031-44696-2_51

M3 - Conference contribution

AN - SCOPUS:85174684437

SN - 9783031446955

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 653

EP - 665

BT - Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings

A2 - Liu, Fei

A2 - Duan, Nan

A2 - Xu, Qingting

A2 - Hong, Yu

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 12 October 2023 through 15 October 2023

ER -

Wang H, Zhu JJ, Wei W, Huang H, Mao XL. FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain. 在 Liu F, Duan N, Xu Q, Hong Y, 编辑, Natural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings. Springer Science and Business Media Deutschland GmbH. 2023. 页码 653-665. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-44696-2_51

FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此