基于自动回标的地理实体关系语料库构建方法

Jibu Wang; Feng Lu; Sheng Wu; Li Yu

doi:10.12082/dqxxkx.2018.180032

基于自动回标的地理实体关系语料库构建方法

Translated title of the contribution: Constructing the Corpus of Geographical Entity Relations Based on Automatic Annotation

Jibu Wang, Feng Lu, Sheng Wu, Li Yu^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

7 Citations (Scopus)

Abstract

The corpus of geographical entity relations is the basic data resource of geographical information acquisition and geographical knowledge services, and its scale directly affects the training effect of machine learning models. Fast-updated web text is constantly emerging as a new relational example, requiring the corpus to be updated in a timely manner to cover richer relational instances. Manually constructing and updating corpus are expensive. Therefore, it needs a more efficient technology of corpus construction for massive geographical entity relations. In this paper, we propose an efficient method of corpus construction for massive geographical entity relations through the automatic annotation technique. First of all, based on encyclopedia resources, referring geographical entity classification standard and semantic relation, spatial relation classification standard to establish an annotation scheme of geographical relation, which considers both the linguistic habits of natural language and the annotation normalization. Secondly, we combine the fully- matching with the approximate matching to improve the coverage rate of object entity finding. Thirdly, we define the rules of sentence scoring by using the optimal sequence diagram method, as well as quantitatively evaluate the results of mapping the seed triples to the sentences. Finally, a series of experiments based on the Chinese BaiduBaike are carried out, which is used to verify the effectiveness of the improved automatic annotation. The results show that, the average success rate of the automatic annotation is 67.83%, and the average accuracy of the annotated relations by our method is 76.36%. Comparing with the manually annotated corpus of the spatial relations, the proposed method constructed a large- scale corpus of geographical entity relations more efficiently, which provides a feasible scheme for expending geographical entity relations corpus automatic. Experimental results on self- built corpus by LSTM (Long Short Term Memory) network shows that the accuracy of geographical relation extracting from web texts is 73.2%, and the accuracy of relative corpora is 75.2%, which proofs that the corpus of geographical entity relations is available. At the same time, this method takes into account the semantic relationship and spatial relationship between geographical entities, and it can be used for open relation extraction task. Besides, the relation types are not limited, which can be applied to open relation extraction.

Translated title of the contribution	Constructing the Corpus of Geographical Entity Relations Based on Automatic Annotation
Original language	Chinese (Traditional)
Pages (from-to)	871-879
Number of pages	9
Journal	Journal of Geo-Information Science
Volume	20
Issue number	7
DOIs	https://doi.org/10.12082/dqxxkx.2018.180032
Publication status	Published - 25 Jul 2018
Externally published	Yes

Access to Document

10.12082/dqxxkx.2018.180032

Cite this

@article{991d431825174aaf86f5571436c46d17,

title = "基于自动回标的地理实体关系语料库构建方法",

abstract = "The corpus of geographical entity relations is the basic data resource of geographical information acquisition and geographical knowledge services, and its scale directly affects the training effect of machine learning models. Fast-updated web text is constantly emerging as a new relational example, requiring the corpus to be updated in a timely manner to cover richer relational instances. Manually constructing and updating corpus are expensive. Therefore, it needs a more efficient technology of corpus construction for massive geographical entity relations. In this paper, we propose an efficient method of corpus construction for massive geographical entity relations through the automatic annotation technique. First of all, based on encyclopedia resources, referring geographical entity classification standard and semantic relation, spatial relation classification standard to establish an annotation scheme of geographical relation, which considers both the linguistic habits of natural language and the annotation normalization. Secondly, we combine the fully- matching with the approximate matching to improve the coverage rate of object entity finding. Thirdly, we define the rules of sentence scoring by using the optimal sequence diagram method, as well as quantitatively evaluate the results of mapping the seed triples to the sentences. Finally, a series of experiments based on the Chinese BaiduBaike are carried out, which is used to verify the effectiveness of the improved automatic annotation. The results show that, the average success rate of the automatic annotation is 67.83%, and the average accuracy of the annotated relations by our method is 76.36%. Comparing with the manually annotated corpus of the spatial relations, the proposed method constructed a large- scale corpus of geographical entity relations more efficiently, which provides a feasible scheme for expending geographical entity relations corpus automatic. Experimental results on self- built corpus by LSTM (Long Short Term Memory) network shows that the accuracy of geographical relation extracting from web texts is 73.2%, and the accuracy of relative corpora is 75.2%, which proofs that the corpus of geographical entity relations is available. At the same time, this method takes into account the semantic relationship and spatial relationship between geographical entities, and it can be used for open relation extraction task. Besides, the relation types are not limited, which can be applied to open relation extraction.",

keywords = "Annotation scheme, Automatic annotation, Corpus construction, Geographical information extraction, Geographical relations",

author = "Jibu Wang and Feng Lu and Sheng Wu and Li Yu",

year = "2018",

month = jul,

day = "25",

doi = "10.12082/dqxxkx.2018.180032",

language = "繁体中文",

volume = "20",

pages = "871--879",

journal = "Journal of Geo-Information Science",

issn = "1560-8999",

publisher = "Science Press",

number = "7",

}

TY - JOUR

T1 - 基于自动回标的地理实体关系语料库构建方法

AU - Wang, Jibu

AU - Lu, Feng

AU - Wu, Sheng

AU - Yu, Li

PY - 2018/7/25

Y1 - 2018/7/25

N2 - The corpus of geographical entity relations is the basic data resource of geographical information acquisition and geographical knowledge services, and its scale directly affects the training effect of machine learning models. Fast-updated web text is constantly emerging as a new relational example, requiring the corpus to be updated in a timely manner to cover richer relational instances. Manually constructing and updating corpus are expensive. Therefore, it needs a more efficient technology of corpus construction for massive geographical entity relations. In this paper, we propose an efficient method of corpus construction for massive geographical entity relations through the automatic annotation technique. First of all, based on encyclopedia resources, referring geographical entity classification standard and semantic relation, spatial relation classification standard to establish an annotation scheme of geographical relation, which considers both the linguistic habits of natural language and the annotation normalization. Secondly, we combine the fully- matching with the approximate matching to improve the coverage rate of object entity finding. Thirdly, we define the rules of sentence scoring by using the optimal sequence diagram method, as well as quantitatively evaluate the results of mapping the seed triples to the sentences. Finally, a series of experiments based on the Chinese BaiduBaike are carried out, which is used to verify the effectiveness of the improved automatic annotation. The results show that, the average success rate of the automatic annotation is 67.83%, and the average accuracy of the annotated relations by our method is 76.36%. Comparing with the manually annotated corpus of the spatial relations, the proposed method constructed a large- scale corpus of geographical entity relations more efficiently, which provides a feasible scheme for expending geographical entity relations corpus automatic. Experimental results on self- built corpus by LSTM (Long Short Term Memory) network shows that the accuracy of geographical relation extracting from web texts is 73.2%, and the accuracy of relative corpora is 75.2%, which proofs that the corpus of geographical entity relations is available. At the same time, this method takes into account the semantic relationship and spatial relationship between geographical entities, and it can be used for open relation extraction task. Besides, the relation types are not limited, which can be applied to open relation extraction.

AB - The corpus of geographical entity relations is the basic data resource of geographical information acquisition and geographical knowledge services, and its scale directly affects the training effect of machine learning models. Fast-updated web text is constantly emerging as a new relational example, requiring the corpus to be updated in a timely manner to cover richer relational instances. Manually constructing and updating corpus are expensive. Therefore, it needs a more efficient technology of corpus construction for massive geographical entity relations. In this paper, we propose an efficient method of corpus construction for massive geographical entity relations through the automatic annotation technique. First of all, based on encyclopedia resources, referring geographical entity classification standard and semantic relation, spatial relation classification standard to establish an annotation scheme of geographical relation, which considers both the linguistic habits of natural language and the annotation normalization. Secondly, we combine the fully- matching with the approximate matching to improve the coverage rate of object entity finding. Thirdly, we define the rules of sentence scoring by using the optimal sequence diagram method, as well as quantitatively evaluate the results of mapping the seed triples to the sentences. Finally, a series of experiments based on the Chinese BaiduBaike are carried out, which is used to verify the effectiveness of the improved automatic annotation. The results show that, the average success rate of the automatic annotation is 67.83%, and the average accuracy of the annotated relations by our method is 76.36%. Comparing with the manually annotated corpus of the spatial relations, the proposed method constructed a large- scale corpus of geographical entity relations more efficiently, which provides a feasible scheme for expending geographical entity relations corpus automatic. Experimental results on self- built corpus by LSTM (Long Short Term Memory) network shows that the accuracy of geographical relation extracting from web texts is 73.2%, and the accuracy of relative corpora is 75.2%, which proofs that the corpus of geographical entity relations is available. At the same time, this method takes into account the semantic relationship and spatial relationship between geographical entities, and it can be used for open relation extraction task. Besides, the relation types are not limited, which can be applied to open relation extraction.

KW - Annotation scheme

KW - Automatic annotation

KW - Corpus construction

KW - Geographical information extraction

KW - Geographical relations

UR - http://www.scopus.com/inward/record.url?scp=85089234598&partnerID=8YFLogxK

U2 - 10.12082/dqxxkx.2018.180032

DO - 10.12082/dqxxkx.2018.180032

M3 - 文章

AN - SCOPUS:85089234598

SN - 1560-8999

VL - 20

SP - 871

EP - 879

JO - Journal of Geo-Information Science

JF - Journal of Geo-Information Science

IS - 7

ER -

基于自动回标的地理实体关系语料库构建方法

Abstract

Access to Document

Other files and links

Fingerprint

Cite this