A distance metric-based space-filling subsampling method for nonparametric models

Huaimin Diao; Dianpeng Wang; Xu He

doi:10.1214/24-EJS2251

A distance metric-based space-filling subsampling method for nonparametric models

Huaimin Diao, Dianpeng Wang, Xu He^*

^*Corresponding author for this work

School of Mathematics and Statistics

Research output: Contribution to journal › Article › peer-review

Abstract

Taking subset samples from the original data set is an efficient and popular strategy to handle massive data that is too large to be directly modeled. To optimize inference and prediction accuracy, it is crucial to employ a subsampling scheme to collect observations intelligently. In this paper, we propose a space-filling subsampling method that uses distance metric-based strata to select subsamples from high-volume data sets. To minimize the maximal distance from pairs of samples that locate in the same stratum, Voronoi cells of thinnest covering lattices are used to partition the input space. In addition, subsamples that are space-filling according to the response are collected from each stratum. With the help of an algorithm to quickly identify the cell an observation locates in, the computational cost of our subsampling method is proportional to the number of observations and irrelevant to the number of cells, which makes our method applicable to extremely large data sets. Results from simulated studies and real data analysis show that the new method is remarkably better than existing approaches when used in conjunction with Gaussian process models.

Original language	English
Pages (from-to)	3247-3273
Number of pages	27
Journal	Electronic Journal of Statistics
Volume	18
Issue number	2
DOIs	https://doi.org/10.1214/24-EJS2251
Publication status	Published - 2024

Keywords

Big data
nonparametric model
space-filling design
tall data

Access to Document

10.1214/24-EJS2251

Cite this

@article{956eb130fbdb42c0b9f3fd376fe07bb5,

title = "A distance metric-based space-filling subsampling method for nonparametric models",

abstract = "Taking subset samples from the original data set is an efficient and popular strategy to handle massive data that is too large to be directly modeled. To optimize inference and prediction accuracy, it is crucial to employ a subsampling scheme to collect observations intelligently. In this paper, we propose a space-filling subsampling method that uses distance metric-based strata to select subsamples from high-volume data sets. To minimize the maximal distance from pairs of samples that locate in the same stratum, Voronoi cells of thinnest covering lattices are used to partition the input space. In addition, subsamples that are space-filling according to the response are collected from each stratum. With the help of an algorithm to quickly identify the cell an observation locates in, the computational cost of our subsampling method is proportional to the number of observations and irrelevant to the number of cells, which makes our method applicable to extremely large data sets. Results from simulated studies and real data analysis show that the new method is remarkably better than existing approaches when used in conjunction with Gaussian process models.",

keywords = "Big data, nonparametric model, space-filling design, tall data",

author = "Huaimin Diao and Dianpeng Wang and Xu He",

year = "2024",

doi = "10.1214/24-EJS2251",

language = "English",

volume = "18",

pages = "3247--3273",

journal = "Electronic Journal of Statistics",

issn = "1935-7524",

publisher = "Institute of Mathematical Statistics",

number = "2",

}

TY - JOUR

T1 - A distance metric-based space-filling subsampling method for nonparametric models

AU - Diao, Huaimin

AU - Wang, Dianpeng

AU - He, Xu

PY - 2024

Y1 - 2024

N2 - Taking subset samples from the original data set is an efficient and popular strategy to handle massive data that is too large to be directly modeled. To optimize inference and prediction accuracy, it is crucial to employ a subsampling scheme to collect observations intelligently. In this paper, we propose a space-filling subsampling method that uses distance metric-based strata to select subsamples from high-volume data sets. To minimize the maximal distance from pairs of samples that locate in the same stratum, Voronoi cells of thinnest covering lattices are used to partition the input space. In addition, subsamples that are space-filling according to the response are collected from each stratum. With the help of an algorithm to quickly identify the cell an observation locates in, the computational cost of our subsampling method is proportional to the number of observations and irrelevant to the number of cells, which makes our method applicable to extremely large data sets. Results from simulated studies and real data analysis show that the new method is remarkably better than existing approaches when used in conjunction with Gaussian process models.

AB - Taking subset samples from the original data set is an efficient and popular strategy to handle massive data that is too large to be directly modeled. To optimize inference and prediction accuracy, it is crucial to employ a subsampling scheme to collect observations intelligently. In this paper, we propose a space-filling subsampling method that uses distance metric-based strata to select subsamples from high-volume data sets. To minimize the maximal distance from pairs of samples that locate in the same stratum, Voronoi cells of thinnest covering lattices are used to partition the input space. In addition, subsamples that are space-filling according to the response are collected from each stratum. With the help of an algorithm to quickly identify the cell an observation locates in, the computational cost of our subsampling method is proportional to the number of observations and irrelevant to the number of cells, which makes our method applicable to extremely large data sets. Results from simulated studies and real data analysis show that the new method is remarkably better than existing approaches when used in conjunction with Gaussian process models.

KW - Big data

KW - nonparametric model

KW - space-filling design

KW - tall data

UR - http://www.scopus.com/inward/record.url?scp=85201828971&partnerID=8YFLogxK

U2 - 10.1214/24-EJS2251

DO - 10.1214/24-EJS2251

M3 - Article

AN - SCOPUS:85201828971

SN - 1935-7524

VL - 18

SP - 3247

EP - 3273

JO - Electronic Journal of Statistics

JF - Electronic Journal of Statistics

IS - 2

ER -

A distance metric-based space-filling subsampling method for nonparametric models

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this