An empirical study on preprocessing high-dimensional class-imbalanced data for classification

Hua Yin*, Keke Gai

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

98 Citations (Scopus)

Abstract

The emerging new data types bring tremendous challenges to data mining. There is an enormous amount of high-dimensional class-imbalanced data in different fields. In this case, traditional classification methods are not appropriate because they are prone to ensure the accuracy of the majority class. Meanwhile, the curse of dimensionality makes situations more complicated. Finding a complicated classifier is not an easy way and such a classifier may overfit for the data. Preprocessing these data before classification is a more direct method. For the cross effect of high-dimension and class-imbalance, it is necessary to know about how preprocessing methods (feature selection and data sampling) affect the final classification. Previous experiments either had less considerations on datasets or introduced other characteristics to make the situation more complicated. We use two types of feature selection (wrapper and filter) and data sampling (oversampling and undersampling) methods on twelve selected datasets with different dimensions and imbalanced-level in four fields, and test the effects on the performance of c4.5 classifier. In our setting, experiments state that (1) feature selection before sampling is mostly better, (2) among the combinations of feature selection and data sampling, undersampling performs better than oversampling when the dataset is largely imbalanced, (3) when dataset is less imbalance, preprocessing may not be necessary, (4) In wrapper-based feature selection, we suggest using the simple searching method.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, HPCC-CSS-ICESS 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1314-1319
Number of pages6
ISBN (Electronic)9781479989362
DOIs
Publication statusPublished - 23 Nov 2015
Externally publishedYes
Event17th IEEE International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security and IEEE 12th International Conference on Embedded Software and Systems, HPCC-ICESS-CSS 2015 - New York, United States
Duration: 24 Aug 201526 Aug 2015

Publication series

NameProceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, HPCC-CSS-ICESS 2015

Conference

Conference17th IEEE International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security and IEEE 12th International Conference on Embedded Software and Systems, HPCC-ICESS-CSS 2015
Country/TerritoryUnited States
CityNew York
Period24/08/1526/08/15

Keywords

  • Classification
  • Feature selection
  • High-dimensional class-imbalanced data
  • Preprocessing
  • Sampling

Fingerprint

Dive into the research topics of 'An empirical study on preprocessing high-dimensional class-imbalanced data for classification'. Together they form a unique fingerprint.

Cite this