FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain

Hao Wang, Jing Jing Zhu, Wei Wei, Heyan Huang, Xian Ling Mao*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

As scientific communities grow and evolve, more and more papers are published, especially in computer science field (CS). It is important to organize scientific information into structured knowledge bases extracted from a large corpus of CS papers, which usually requires Information Extraction (IE) about scientific entities and their relationships. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science field, there have been several handcrafted annotated entity-relation datasets like SciERC and SciREX, which are used to train supervised extracted algorithms. However, almost all these datasets ignore the annotation of following fine-grained named entities: nested entities, discontinuous entities and minimal independent semantics entities. To solve this problem, this paper will present a novel Fine-Grained entity-relation Extraction dataset in Computer Science field (FGCS), which contains rich fine-grained entities and their relationships. The proposed dataset includes 1,948 sentences of 6 entity types with up to 7 layers of nesting and 5 relation types. Extensive experiments show that the proposed dataset is a good benchmark for measuring an information extraction model’s ability of recognizing fine-grained entities and their relations. Our dataset is publicly available at https://github.com/broken-dream/FGCS.

Original languageEnglish
Title of host publicationNatural Language Processing and Chinese Computing - 12th National CCF Conference, NLPCC 2023, Proceedings
EditorsFei Liu, Nan Duan, Qingting Xu, Yu Hong
PublisherSpringer Science and Business Media Deutschland GmbH
Pages653-665
Number of pages13
ISBN (Print)9783031446955
DOIs
Publication statusPublished - 2023
Event12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023 - Foshan, China
Duration: 12 Oct 202315 Oct 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14303 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023
Country/TerritoryChina
CityFoshan
Period12/10/2315/10/23

Keywords

  • Datasets
  • Fine-grained Entities
  • Information Extraction

Fingerprint

Dive into the research topics of 'FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain'. Together they form a unique fingerprint.

Cite this