TY - JOUR
T1 - TCIC_FS
T2 - Total correlation information coefficient-based feature selection method for high-dimensional data
AU - Qiu, Ping
AU - Niu, Zhendong
N1 - Publisher Copyright:
© 2021
PY - 2021/11/14
Y1 - 2021/11/14
N2 - High-dimensional data have been a challenging problem in classification. Feature selection works as a filter to remove irrelevant or redundant features and has made comparative progress. However, this problem is still challenging because current methods consider only the correlation between two variables while leaving the correlation among multiple variables largely unsolved, and multivariate interactions can contain joint information that cannot be obtained pairwise. Furthermore, many feature selection methods require hyperparameter settings, which require prior knowledge and lack interpretability. Focusing on the above problems, this paper proposes the total correlation information coefficient-based feature selection (TCIC_FS) method to select the optimal solution, which can avoid setting hyperparameters and fully consider the correlations among multiple variables. First, based on a Gaussian copula, the total correlation information coefficient (TCIC) is proposed to evaluate the correlations among multiple variables. Compared with the existing multivariate correlation methods, TCIC can measure a wider range of multivariate correlations, including linear, nonlinear, functional, and nonfunctional correlations. Second, a novel evaluation mechanism based on TCIC is proposed to measure the relevance between features and classes and the redundancy between a single feature and a selected feature subset. Finally, the TCIC_FS method is constructed based on the TCIC and the evaluation mechanism. Compared with the baseline values, the TCIC_FS method has the lowest time complexity and the smallest optimal feature subset obtained by single selection. Therefore, TCIC_FS is more suitable for processing high-dimensional data.
AB - High-dimensional data have been a challenging problem in classification. Feature selection works as a filter to remove irrelevant or redundant features and has made comparative progress. However, this problem is still challenging because current methods consider only the correlation between two variables while leaving the correlation among multiple variables largely unsolved, and multivariate interactions can contain joint information that cannot be obtained pairwise. Furthermore, many feature selection methods require hyperparameter settings, which require prior knowledge and lack interpretability. Focusing on the above problems, this paper proposes the total correlation information coefficient-based feature selection (TCIC_FS) method to select the optimal solution, which can avoid setting hyperparameters and fully consider the correlations among multiple variables. First, based on a Gaussian copula, the total correlation information coefficient (TCIC) is proposed to evaluate the correlations among multiple variables. Compared with the existing multivariate correlation methods, TCIC can measure a wider range of multivariate correlations, including linear, nonlinear, functional, and nonfunctional correlations. Second, a novel evaluation mechanism based on TCIC is proposed to measure the relevance between features and classes and the redundancy between a single feature and a selected feature subset. Finally, the TCIC_FS method is constructed based on the TCIC and the evaluation mechanism. Compared with the baseline values, the TCIC_FS method has the lowest time complexity and the smallest optimal feature subset obtained by single selection. Therefore, TCIC_FS is more suitable for processing high-dimensional data.
KW - Evaluation mechanism
KW - Feature selection
KW - Gaussian copula
KW - High dimensional data
KW - Multivariate correlation
KW - Recommendation system
UR - http://www.scopus.com/inward/record.url?scp=85114639461&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2021.107418
DO - 10.1016/j.knosys.2021.107418
M3 - Article
AN - SCOPUS:85114639461
SN - 0950-7051
VL - 231
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 107418
ER -