TY - GEN
T1 - An empirical study on preprocessing high-dimensional class-imbalanced data for classification
AU - Yin, Hua
AU - Gai, Keke
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/11/23
Y1 - 2015/11/23
N2 - The emerging new data types bring tremendous challenges to data mining. There is an enormous amount of high-dimensional class-imbalanced data in different fields. In this case, traditional classification methods are not appropriate because they are prone to ensure the accuracy of the majority class. Meanwhile, the curse of dimensionality makes situations more complicated. Finding a complicated classifier is not an easy way and such a classifier may overfit for the data. Preprocessing these data before classification is a more direct method. For the cross effect of high-dimension and class-imbalance, it is necessary to know about how preprocessing methods (feature selection and data sampling) affect the final classification. Previous experiments either had less considerations on datasets or introduced other characteristics to make the situation more complicated. We use two types of feature selection (wrapper and filter) and data sampling (oversampling and undersampling) methods on twelve selected datasets with different dimensions and imbalanced-level in four fields, and test the effects on the performance of c4.5 classifier. In our setting, experiments state that (1) feature selection before sampling is mostly better, (2) among the combinations of feature selection and data sampling, undersampling performs better than oversampling when the dataset is largely imbalanced, (3) when dataset is less imbalance, preprocessing may not be necessary, (4) In wrapper-based feature selection, we suggest using the simple searching method.
AB - The emerging new data types bring tremendous challenges to data mining. There is an enormous amount of high-dimensional class-imbalanced data in different fields. In this case, traditional classification methods are not appropriate because they are prone to ensure the accuracy of the majority class. Meanwhile, the curse of dimensionality makes situations more complicated. Finding a complicated classifier is not an easy way and such a classifier may overfit for the data. Preprocessing these data before classification is a more direct method. For the cross effect of high-dimension and class-imbalance, it is necessary to know about how preprocessing methods (feature selection and data sampling) affect the final classification. Previous experiments either had less considerations on datasets or introduced other characteristics to make the situation more complicated. We use two types of feature selection (wrapper and filter) and data sampling (oversampling and undersampling) methods on twelve selected datasets with different dimensions and imbalanced-level in four fields, and test the effects on the performance of c4.5 classifier. In our setting, experiments state that (1) feature selection before sampling is mostly better, (2) among the combinations of feature selection and data sampling, undersampling performs better than oversampling when the dataset is largely imbalanced, (3) when dataset is less imbalance, preprocessing may not be necessary, (4) In wrapper-based feature selection, we suggest using the simple searching method.
KW - Classification
KW - Feature selection
KW - High-dimensional class-imbalanced data
KW - Preprocessing
KW - Sampling
UR - http://www.scopus.com/inward/record.url?scp=84961696430&partnerID=8YFLogxK
U2 - 10.1109/HPCC-CSS-ICESS.2015.205
DO - 10.1109/HPCC-CSS-ICESS.2015.205
M3 - Conference contribution
AN - SCOPUS:84961696430
T3 - Proceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, HPCC-CSS-ICESS 2015
SP - 1314
EP - 1319
BT - Proceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, HPCC-CSS-ICESS 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security and IEEE 12th International Conference on Embedded Software and Systems, HPCC-ICESS-CSS 2015
Y2 - 24 August 2015 through 26 August 2015
ER -