Search in:

Advanced search

Technometrics Volume 66, 2024 - Issue 2

Submit an article Journal homepage

673

Views

CrossRef citations to date

Altmetric

Research Articles

Efficient Model-Free Subsampling Method for Massive Data

Zheng Zhoua NITFID, School of Statistics and Data Science, Nankai University, Tianjin, ChinaView further author information

Zebin Yangb Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong, ChinaView further author information

Aijun Zhangb Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong, China

https://orcid.org/0000-0001-9729-9018 View further author information

Yongdao Zhoua NITFID, School of Statistics and Data Science, Nankai University, Tianjin, ChinaCorrespondence[email protected]
View further author information

Pages 240-252 | Received 10 Dec 2022, Accepted 06 Oct 2023, Published online: 27 Nov 2023

Cite this article
https://doi.org/10.1080/00401706.2023.2271091
CrossMark

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Reprints & Permissions

References

Abbena, E., Salamon, S., and Gray, A. (2017), Modern Differential Geometry of Curves and Surfaces with Mathematica, Boca Raton, FL: CRC Press.
Google Scholar
Ai, M. Y., Wang, F., Yu, J., and Zhang, H. M. (2021), “Optimal Subsampling for Large-Scale Quantile Regression,” Journal of Complexity, 62, 101512. DOI: 10.1016/j.jco.2020.101512.
Web of Science ®Google Scholar
Aronszajn, N. (1950), “Theory of Reproducing Kernels,” Transactions of the American Mathematical Society, 68, 337–404. DOI: 10.1090/S0002-9947-1950-0051437-7.
Web of Science ®Google Scholar
Bachem, O., Lucic, M., and Krause, A. (2017), “Practical Coreset Constructions for Machine Learning,” arXiv preprint arXiv:1703.06476.
Google Scholar
Baker, D., Braverman, V., Huang, L. X., Jiang, S. F., Krauthgamer, R., and Wu, X. (2020), “Coresets for Clustering in Graphs of Bounded Treewidth,” in International Conference on Machine Learning, pp. 569–579, PMLR.
Google Scholar
Blum, M., Floyd, R. W., Pratt, V. R., Rivest, R. L., and Tarjan, R. E. (1973), “Time Bounds for Selection,” Journal of Computer and System Sciences, 7, 448–461. DOI: 10.1016/S0022-0000(73)80033-9.
Google Scholar
Breiman, L. (2001), “Random Forests,” Machine Learning, 45, 5–32. DOI: 10.1023/A:1010933404324.
Web of Science ®Google Scholar
Campbell, T., and Broderick, T. (2018), “Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent,” in International Conference on Machine Learning, pp. 698–706, PMLR.
Google Scholar
Chen, Y. T., Welling, M., and Smola, A. (2010), “Super-Samples from Kernel Herding,” in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pp. 109–116.
Google Scholar
Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006), “Sampling Algorithms for l2 Regression and Applications,” in Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136. DOI: 10.1145/1109557.1109682.
Google Scholar
Fang, K. T., Li, R. Z., and Sudjianto, A. (2006), Design and Modeling for Computer Experiments, London: Chapman and Hall/CRC.
Google Scholar
Fang, K. T., Liu, M. Q., Qin, H., and Zhou, Y. D. (2018), Theory and Application of Uniform Experimental Designs (Vol. 221), Singapore: Springer.
Google Scholar
Fang, K. T., and Wang, Y. (1994), Number-Theoretic Methods in Statistics, London: Chapman and Hall.
Google Scholar
Freund, Y., and Schapire, R. E. (1997), “A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting,” Journal of Computer and System Sciences, 55, 119–139. DOI: 10.1006/jcss.1997.1504.
Web of Science ®Google Scholar
Friedman, J. H. (1991), “Multivariate Adaptive Regression Splines,” The Annals of Statistics, 19, 1–67. DOI: 10.1214/aos/1176347963.
Web of Science ®Google Scholar
Hanley, J. A., and McNeil, B. J. (1982), “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve,” Radiology, 143, 29–36. DOI: 10.1148/radiology.143.1.7063747.
PubMed Web of Science ®Google Scholar
Hickernell, F. (1998), “A Generalized Discrepancy and Quadrature Error Bound,” Mathematics of Computation, 67, 299–322. DOI: 10.1090/S0025-5718-98-00894-1.
Web of Science ®Google Scholar
Hoare, C. A. R. (1962), “Quicksort,” The Computer Journal, 5, 10–16. DOI: 10.1093/comjnl/5.1.10.
Web of Science ®Google Scholar
Huang, C., and Joseph, V. R. (2022), supercompress: Supervised Compression of Big Data. R package version 1.1.
Google Scholar
Joseph, V. R., and Mak, S. (2021), “Supervised Compression of Big Data,” Statistical Analysis and Data Mining: The ASA Data Science Journal, 14, 217–229. DOI: 10.1002/sam.11508.
Web of Science ®Google Scholar
Joseph, V. R., and Vakayil, A. (2022), “Split: An Optimal Method for Data Splitting,” Technometrics, 64, 166–176. DOI: 10.1080/00401706.2021.1921037.
Web of Science ®Google Scholar
Kaul, M. (2013), “3D Road Network (North Jutland, Denmark). UCI Machine Learning Repository. DOI: 10.24432/C5GP51.
Google Scholar
Kennard, R. W., and Stone, L. A. (1969), “Computer Aided Design of Experiments,” Technometrics, 11, 137–148. DOI: 10.1080/00401706.1969.10490666.
Web of Science ®Google Scholar
Kiefer, J. (1959), “Optimum Experimental Designs,” Journal of the Royal Statistical Society, Series B, 21, 272–304. DOI: 10.1111/j.2517-6161.1959.tb00338.x.
Google Scholar
Ma, P., Mahoney, M. W., and Yu, B. (2015), “A Statistical Perspective on Algorithmic Leveraging,” Journal of Machine Learning Research, 16, 861–911.
Web of Science ®Google Scholar
Ma, P., and Sun, X. X. (2015), “Leveraging for Big Data Regression,” Wiley Interdisciplinary Reviews: Computational Statistics, 7, 70–76. DOI: 10.1002/wics.1324.
Google Scholar
Mak, S., and Joseph, V. R. (2018), “Support Points,” The Annals of Statistics, 46, 2562–2592. DOI: 10.1214/17-AOS1629.
Web of Science ®Google Scholar
Meng, C., Xie, R., Mandal, A., Zhang, X. L., Zhong, W. X., and Ma, P. (2021),“Lowcon: A Design-based Subsampling Approach in a Misspecified Linear Model,” Journal of Computational and Graphical Statistics, 30, 694–708. DOI: 10.1080/10618600.2020.1844215.
Web of Science ®Google Scholar
Royston, P. (1992), “Approximating the Shapiro-Wilk w-test for Non-normality,” Statistics and Computing, 2, 117–119. DOI: 10.1007/BF01891203.
Google Scholar
Sener, O., and Savarese, S. (2017), “Active Learning for Convolutional Neural Networks: A Core-Set Approach,” arXiv preprint arXiv:1708.00489.
Google Scholar
Shapiro, S. S., and Francia, R. (1972), “An Approximate Analysis of Variance Test for Normality,” Journal of the American Statistical Association, 67, 215–216. DOI: 10.1080/01621459.1972.10481232.
Web of Science ®Google Scholar
Shapiro, S. S., and Wilk, M. B. (1965), “An Analysis of Variance Test for Normality (Complete Samples),” Biometrika, 52, 591–611. DOI: 10.1093/biomet/52.3-4.591.
Web of Science ®Google Scholar
Shi, C. L., and Tang, B. X. (2021), “Model-Robust Subdata Selection for Big Data,” Journal of Statistical Theory and Practice, 15, 1–17. DOI: 10.1007/s42519-021-00217-9.
Web of Science ®Google Scholar
Snee, R. D. (1977), “Validation of Regression Models: Methods and Examples,” Technometrics, 19, 415–428. DOI: 10.1080/00401706.1977.10489581.
Web of Science ®Google Scholar
Székely, G. J., and Rizzo, M. L. (2013), “Energy Statistics: A Class of Statistics based on Distances,” Journal of Statistical Planning and Inference, 143, 1249–1272. DOI: 10.1016/j.jspi.2013.03.018.
Web of Science ®Google Scholar
Vakayil, A., and Joseph, V. R. (2022a), “Data Twinning,” Statistical Analysis and Data Mining: The ASA Data Science Journal, 15, 598–610. DOI: 10.1002/sam.11574.
Web of Science ®Google Scholar
——- (2022b), twinning: Data Twinning. R package version 1.0.
Google Scholar
Vakayil, A., Joseph, V. R., and Mak, S. (2021), SPlit: Split a Dataset for Training and Testing. R package version 1.0.
Google Scholar
Wang, H. Y. (2019), “More Efficient Estimation for Logistic Regression with Optimal Subsamples,” Journal of Machine Learning Research, 20, 1–59.
Web of Science ®Google Scholar
Wang, H. Y., and Ma, Y. Y. (2021), “Optimal Subsampling for Quantile Regression in Big Data,” Biometrika, 108, 99–112. DOI: 10.1093/biomet/asaa043.
Web of Science ®Google Scholar
Wang, H. Y., Yang, M., and Stufken, J. (2019), “Information-based Optimal Subdata Selection for Big Data Linear Regression,” Journal of the American Statistical Association, 114, 393–405. DOI: 10.1080/01621459.2017.1408468.
Web of Science ®Google Scholar
Wang, H. Y., Zhu, R., and Ma, P. (2018), “Optimal Subsampling for Large Sample Logistic Regression,” Journal of the American Statistical Association, 113, 829–844. DOI: 10.1080/01621459.2017.1292914.
PubMed Web of Science ®Google Scholar
Whiteson, D. (2014), “Susy,” UCI Machine Learning Repository. DOI: 10.24432/C54606.
Google Scholar
Yao, Y. Q., and Wang, H. Y. (2019), “Optimal Subsampling for Softmax Regression,” Statistical Papers, 60, 585–599. DOI: 10.1007/s00362-018-01068-6.
Web of Science ®Google Scholar
Zhang, A. J., Li, H. Y., Quan, S. J., and Yang, Z. B. (2018), Unidoe: Uniform Design of Experiments. R package version 1.0.2.
Google Scholar
Zhang, M., Zhou, Y. D., Zhou, Z., and Zhang, A. J. (2023), “Model-Free Subsampling Method based on Uniform Designs,” IEEE Transactions on Knowledge and Data Engineering. DOI: 10.1109/TKDE.2023.3297167.
PubMed Web of Science ®Google Scholar
Zhou, Y. D., Fang, K. T., and Ning, J. H. (2013), “Mixture Discrepancy for Quasi-Random Point Sets,” Journal of Complexity, 29, 283–301. DOI: 10.1016/j.jco.2012.11.006.
Web of Science ®Google Scholar
Zhou, Z., Yang, Z. B., Zhang, A. J., and Zhou, Y. D. (2023), PDDS: Parallel Data-Driven Subsampling. R package version 1.0.0. DOI: 10.1080/00401706.2023.2271091.
Google Scholar
Zhu, L. P., Li, L. X., Li, R. Z., and Zhu, L. X. (2011), “Model-Free Feature Screening for Ultrahigh-Dimensional Data,” Journal of the American Statistical Association, 106, 1464–1475. DOI: 10.1198/jasa.2011.tm10563.
PubMed Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Efficient Model-Free Subsampling Method for Massive Data

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Efficient Model-Free Subsampling Method for Massive Data

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date