678
Views
0
CrossRef citations to date
0
Altmetric
Research Articles

Efficient Model-Free Subsampling Method for Massive Data

, , ORCID Icon &
Pages 240-252 | Received 10 Dec 2022, Accepted 06 Oct 2023, Published online: 27 Nov 2023
 

Abstract

Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.

Supplementary Materials

The online supplementary materials contain (a) all the proofs of the theoretical supports, (b) four additional figures for the experiment in Section 5.2, and (c) R codes for the modeling experiments in Sections 4 and 5.

Acknowledgments

The authors thank the editor, the associate editor, and two referees for their valuable comments that greatly improved the presentation of the article.

Disclosure Statement

The authors report there are no competing interests to declare.

Additional information

Funding

This work was supported by the National Natural Science Foundation of China (12131001 and 11871288), LPMC, and KLMDASR.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.