TY - JOUR
T1 - Core-elements for large-scale least squares estimation
AU - Li, Mengyu
AU - Yu, Jun
AU - Li, Tao
AU - Meng, Cheng
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
PY - 2024/12
Y1 - 2024/12
N2 - The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive application in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an O(nnz(X)+rp2) computational cost, where X is an n×p predictor matrix, r is the number of elements selected from each column of X, and nnz(·) denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method’s superior performance compared to mainstream competitors.
AB - The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive application in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an O(nnz(X)+rp2) computational cost, where X is an n×p predictor matrix, r is the number of elements selected from each column of X, and nnz(·) denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method’s superior performance compared to mainstream competitors.
KW - Coresets
KW - Linear model
KW - Sparse matrix
KW - Subset selection
UR - http://www.scopus.com/inward/record.url?scp=85205256100&partnerID=8YFLogxK
U2 - 10.1007/s11222-024-10505-6
DO - 10.1007/s11222-024-10505-6
M3 - Article
AN - SCOPUS:85205256100
SN - 0960-3174
VL - 34
JO - Statistics and Computing
JF - Statistics and Computing
IS - 6
M1 - 190
ER -