Active learning sample selection based on multicriteria

Zhonghai He; Kun Shen; Xiaofang Zhang

doi:10.1177/09670335231211618

Active learning sample selection based on multicriteria

Zhonghai He, Kun Shen^*, Xiaofang Zhang

^*Corresponding author for this work

School of Optics and Photonics

Research output: Contribution to journal › Article › peer-review

Abstract

In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (Diversity Informativeness Density Active Learning, DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.

Original language	English
Pages (from-to)	289-297
Number of pages	9
Journal	Journal of Near Infrared Spectroscopy
Volume	31
Issue number	6
DOIs	https://doi.org/10.1177/09670335231211618
Publication status	Published - Dec 2023

Keywords

Multivariate calibration
active learning
multicriteria modeling
sample selection

Access to Document

10.1177/09670335231211618

Cite this

He, Z., Shen, K., & Zhang, X. (2023). Active learning sample selection based on multicriteria. Journal of Near Infrared Spectroscopy, 31(6), 289-297. https://doi.org/10.1177/09670335231211618

@article{05a4b9d2f4aa4afabf08cad2fb35f4e6,

title = "Active learning sample selection based on multicriteria",

abstract = "In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (Diversity Informativeness Density Active Learning, DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.",

keywords = "Multivariate calibration, active learning, multicriteria modeling, sample selection",

author = "Zhonghai He and Kun Shen and Xiaofang Zhang",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2023.",

year = "2023",

month = dec,

doi = "10.1177/09670335231211618",

language = "English",

volume = "31",

pages = "289--297",

journal = "Journal of Near Infrared Spectroscopy",

issn = "0967-0335",

publisher = "SAGE Publications Inc.",

number = "6",

}

TY - JOUR

T1 - Active learning sample selection based on multicriteria

AU - He, Zhonghai

AU - Shen, Kun

AU - Zhang, Xiaofang

N1 - Publisher Copyright: © The Author(s) 2023.

PY - 2023/12

Y1 - 2023/12

N2 - In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (Diversity Informativeness Density Active Learning, DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.

AB - In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (Diversity Informativeness Density Active Learning, DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.

KW - Multivariate calibration

KW - active learning

KW - multicriteria modeling

KW - sample selection

UR - http://www.scopus.com/inward/record.url?scp=85176362097&partnerID=8YFLogxK

U2 - 10.1177/09670335231211618

DO - 10.1177/09670335231211618

M3 - Article

AN - SCOPUS:85176362097

SN - 0967-0335

VL - 31

SP - 289

EP - 297

JO - Journal of Near Infrared Spectroscopy

JF - Journal of Near Infrared Spectroscopy

IS - 6

ER -

Active learning sample selection based on multicriteria

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this