TY - JOUR
T1 - Scalable Model-Free Feature Screening via Sliced-Wasserstein Dependency
AU - Li, Tao
AU - Yu, Jun
AU - Meng, Cheng
N1 - Publisher Copyright:
© 2023 American Statistical Association and Institute of Mathematical Statistics.
PY - 2023
Y1 - 2023
N2 - We consider the model-free feature screening problem that aims to discard non-informative features before downstream analysis. Most of the existing feature screening approaches have at least quadratic computational cost with respect to the sample size n, thus, may suffer from a huge computational burden when n is large. To alleviate the computational burden, we propose a scalable model-free sure independence screening approach. This approach is based on the so-called sliced-Wasserstein dependency, a novel metric that measures the dependence between two random variables. Specifically, we quantify the dependence between two random variables by measuring the sliced-Wasserstein distance between their joint distribution and the product of their marginal distributions. For a predictor matrix of size n × d, the computational cost for the proposed algorithm is at the order of (Formula presented.), even when the response variable is multivariate. Theoretically, we show the proposed method enjoys both sure screening and rank consistency properties under mild regularity conditions. Numerical studies on various synthetic and real-world datasets demonstrate the superior performance of the proposed method in comparison with mainstream competitors, requiring significantly less computational time. Supplementary materials for this article are available online.
AB - We consider the model-free feature screening problem that aims to discard non-informative features before downstream analysis. Most of the existing feature screening approaches have at least quadratic computational cost with respect to the sample size n, thus, may suffer from a huge computational burden when n is large. To alleviate the computational burden, we propose a scalable model-free sure independence screening approach. This approach is based on the so-called sliced-Wasserstein dependency, a novel metric that measures the dependence between two random variables. Specifically, we quantify the dependence between two random variables by measuring the sliced-Wasserstein distance between their joint distribution and the product of their marginal distributions. For a predictor matrix of size n × d, the computational cost for the proposed algorithm is at the order of (Formula presented.), even when the response variable is multivariate. Theoretically, we show the proposed method enjoys both sure screening and rank consistency properties under mild regularity conditions. Numerical studies on various synthetic and real-world datasets demonstrate the superior performance of the proposed method in comparison with mainstream competitors, requiring significantly less computational time. Supplementary materials for this article are available online.
KW - Multivariate response model
KW - Nonlinear model
KW - Optimal transport
KW - Sure screening
KW - Variable selection
UR - https://www.scopus.com/pages/publications/85152453992
U2 - 10.1080/10618600.2023.2183213
DO - 10.1080/10618600.2023.2183213
M3 - Article
AN - SCOPUS:85152453992
SN - 1061-8600
VL - 32
SP - 1501
EP - 1511
JO - Journal of Computational and Graphical Statistics
JF - Journal of Computational and Graphical Statistics
IS - 4
ER -