TY - JOUR
T1 - Active learning strategies for extracting phrase-level topics from scientific literature
AU - Yue, Tao
AU - Li, Yu
AU - Runjie, Zhang
N1 - Publisher Copyright:
© 2020, Chinese Academy of Sciences. All rights reserved.
PY - 2020
Y1 - 2020
N2 - [Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.
AB - [Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.
KW - Active Learning
KW - Information Extraction
KW - Neural Network
UR - http://www.scopus.com/inward/record.url?scp=85101630135&partnerID=8YFLogxK
U2 - 10.11925/infotech.2096-3467.2020.0281
DO - 10.11925/infotech.2096-3467.2020.0281
M3 - Article
AN - SCOPUS:85101630135
SN - 2096-3467
VL - 4
SP - 134
EP - 143
JO - Data Analysis and Knowledge Discovery
JF - Data Analysis and Knowledge Discovery
IS - 10
ER -