Core-elements for large-scale least squares estimation

Mengyu Li; Jun Yu; Tao Li; Cheng Meng

doi:10.1007/s11222-024-10505-6

Core-elements for large-scale least squares estimation

Mengyu Li, Jun Yu, Tao Li, Cheng Meng^*

^*Corresponding author for this work

School of Mathematics and Statistics

Renmin University of China

Research output: Contribution to journal › Article › peer-review

Abstract

The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive application in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an O(nnz(X)+rp²) computational cost, where X is an n×p predictor matrix, r is the number of elements selected from each column of X, and nnz(·) denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method’s superior performance compared to mainstream competitors.

Original language	English
Article number	190
Journal	Statistics and Computing
Volume	34
Issue number	6
DOIs	https://doi.org/10.1007/s11222-024-10505-6
Publication status	Published - Dec 2024

Keywords

Coresets
Linear model
Sparse matrix
Subset selection

Access to Document

10.1007/s11222-024-10505-6

Cite this

@article{e6c395264e5f4f819a6a2d0d2250ac30,

title = "Core-elements for large-scale least squares estimation",

abstract = "The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive application in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an O(nnz(X)+rp2) computational cost, where X is an n×p predictor matrix, r is the number of elements selected from each column of X, and nnz(·) denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method{\textquoteright}s superior performance compared to mainstream competitors.",

keywords = "Coresets, Linear model, Sparse matrix, Subset selection",

author = "Mengyu Li and Jun Yu and Tao Li and Cheng Meng",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.",

year = "2024",

month = dec,

doi = "10.1007/s11222-024-10505-6",

language = "English",

volume = "34",

journal = "Statistics and Computing",

issn = "0960-3174",

publisher = "Springer Netherlands",

number = "6",

}

TY - JOUR

T1 - Core-elements for large-scale least squares estimation

AU - Li, Mengyu

AU - Yu, Jun

AU - Li, Tao

AU - Meng, Cheng

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

PY - 2024/12

Y1 - 2024/12

N2 - The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive application in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an O(nnz(X)+rp2) computational cost, where X is an n×p predictor matrix, r is the number of elements selected from each column of X, and nnz(·) denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method’s superior performance compared to mainstream competitors.

AB - The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive application in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an O(nnz(X)+rp2) computational cost, where X is an n×p predictor matrix, r is the number of elements selected from each column of X, and nnz(·) denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method’s superior performance compared to mainstream competitors.

KW - Coresets

KW - Linear model

KW - Sparse matrix

KW - Subset selection

UR - http://www.scopus.com/inward/record.url?scp=85205256100&partnerID=8YFLogxK

U2 - 10.1007/s11222-024-10505-6

DO - 10.1007/s11222-024-10505-6

M3 - Article

AN - SCOPUS:85205256100

SN - 0960-3174

VL - 34

JO - Statistics and Computing

JF - Statistics and Computing

IS - 6

M1 - 190

ER -

Core-elements for large-scale least squares estimation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this