Optimal subsampling algorithms for big data regressions

Mingyao Ai; Jun Yu; Huiming Zhang; Hai Ying Wang

doi:10.5705/ss.202018.0439

Optimal subsampling algorithms for big data regressions

Mingyao Ai, Jun Yu, Huiming Zhang, Hai Ying Wang^*

^*Corresponding author for this work

School of Mathematics and Statistics

Research output: Contribution to journal › Article › peer-review

88 Citations (Scopus)

Abstract

In order to quickly approximate maximum likelihood estimators from massive data, this study examines the optimal subsampling method under the A-optimality criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius-norm matrix concentration inequalities, the finite-sample properties of the subsample estimator based on optimal subsampling probabilities are also derived. Because the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. The asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated using numerical experiments on simulated and real data sets.

Original language	English
Pages (from-to)	749-772
Number of pages	24
Journal	Statistica Sinica
Volume	31
Issue number	2
DOIs	https://doi.org/10.5705/ss.202018.0439
Publication status	Published - Apr 2021

Keywords

Generalized linear models
Massive data
Matrix concentration inequality

Access to Document

10.5705/ss.202018.0439

Cite this

@article{af663da6acff4775b531888a8d6f1e81,

title = "Optimal subsampling algorithms for big data regressions",

abstract = "In order to quickly approximate maximum likelihood estimators from massive data, this study examines the optimal subsampling method under the A-optimality criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius-norm matrix concentration inequalities, the finite-sample properties of the subsample estimator based on optimal subsampling probabilities are also derived. Because the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. The asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated using numerical experiments on simulated and real data sets.",

keywords = "Generalized linear models, Massive data, Matrix concentration inequality",

author = "Mingyao Ai and Jun Yu and Huiming Zhang and Wang, {Hai Ying}",

year = "2021",

month = apr,

doi = "10.5705/ss.202018.0439",

language = "English",

volume = "31",

pages = "749--772",

journal = "Statistica Sinica",

issn = "1017-0405",

publisher = "Institute of Statistical Science",

number = "2",

}

TY - JOUR

T1 - Optimal subsampling algorithms for big data regressions

AU - Ai, Mingyao

AU - Yu, Jun

AU - Zhang, Huiming

AU - Wang, Hai Ying

PY - 2021/4

Y1 - 2021/4

N2 - In order to quickly approximate maximum likelihood estimators from massive data, this study examines the optimal subsampling method under the A-optimality criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius-norm matrix concentration inequalities, the finite-sample properties of the subsample estimator based on optimal subsampling probabilities are also derived. Because the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. The asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated using numerical experiments on simulated and real data sets.

AB - In order to quickly approximate maximum likelihood estimators from massive data, this study examines the optimal subsampling method under the A-optimality criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius-norm matrix concentration inequalities, the finite-sample properties of the subsample estimator based on optimal subsampling probabilities are also derived. Because the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. The asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated using numerical experiments on simulated and real data sets.

KW - Generalized linear models

KW - Massive data

KW - Matrix concentration inequality

UR - http://www.scopus.com/inward/record.url?scp=85103114926&partnerID=8YFLogxK

U2 - 10.5705/ss.202018.0439

DO - 10.5705/ss.202018.0439

M3 - Article

AN - SCOPUS:85103114926

SN - 1017-0405

VL - 31

SP - 749

EP - 772

JO - Statistica Sinica

JF - Statistica Sinica

IS - 2

ER -

Optimal subsampling algorithms for big data regressions

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this