673
Views
0
CrossRef citations to date
0
Altmetric
Research Articles

Efficient Model-Free Subsampling Method for Massive Data

, , ORCID Icon &
Pages 240-252 | Received 10 Dec 2022, Accepted 06 Oct 2023, Published online: 27 Nov 2023

References

  • Abbena, E., Salamon, S., and Gray, A. (2017), Modern Differential Geometry of Curves and Surfaces with Mathematica, Boca Raton, FL: CRC Press.
  • Ai, M. Y., Wang, F., Yu, J., and Zhang, H. M. (2021), “Optimal Subsampling for Large-Scale Quantile Regression,” Journal of Complexity, 62, 101512. DOI: 10.1016/j.jco.2020.101512.
  • Aronszajn, N. (1950), “Theory of Reproducing Kernels,” Transactions of the American Mathematical Society, 68, 337–404. DOI: 10.1090/S0002-9947-1950-0051437-7.
  • Bachem, O., Lucic, M., and Krause, A. (2017), “Practical Coreset Constructions for Machine Learning,” arXiv preprint arXiv:1703.06476.
  • Baker, D., Braverman, V., Huang, L. X., Jiang, S. F., Krauthgamer, R., and Wu, X. (2020), “Coresets for Clustering in Graphs of Bounded Treewidth,” in International Conference on Machine Learning, pp. 569–579, PMLR.
  • Blum, M., Floyd, R. W., Pratt, V. R., Rivest, R. L., and Tarjan, R. E. (1973), “Time Bounds for Selection,” Journal of Computer and System Sciences, 7, 448–461. DOI: 10.1016/S0022-0000(73)80033-9.
  • Breiman, L. (2001), “Random Forests,” Machine Learning, 45, 5–32. DOI: 10.1023/A:1010933404324.
  • Campbell, T., and Broderick, T. (2018), “Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent,” in International Conference on Machine Learning, pp. 698–706, PMLR.
  • Chen, Y. T., Welling, M., and Smola, A. (2010), “Super-Samples from Kernel Herding,” in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pp. 109–116.
  • Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006), “Sampling Algorithms for l2 Regression and Applications,” in Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136. DOI: 10.1145/1109557.1109682.
  • Fang, K. T., Li, R. Z., and Sudjianto, A. (2006), Design and Modeling for Computer Experiments, London: Chapman and Hall/CRC.
  • Fang, K. T., Liu, M. Q., Qin, H., and Zhou, Y. D. (2018), Theory and Application of Uniform Experimental Designs (Vol. 221), Singapore: Springer.
  • Fang, K. T., and Wang, Y. (1994), Number-Theoretic Methods in Statistics, London: Chapman and Hall.
  • Freund, Y., and Schapire, R. E. (1997), “A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting,” Journal of Computer and System Sciences, 55, 119–139. DOI: 10.1006/jcss.1997.1504.
  • Friedman, J. H. (1991), “Multivariate Adaptive Regression Splines,” The Annals of Statistics, 19, 1–67. DOI: 10.1214/aos/1176347963.
  • Hanley, J. A., and McNeil, B. J. (1982), “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve,” Radiology, 143, 29–36. DOI: 10.1148/radiology.143.1.7063747.
  • Hickernell, F. (1998), “A Generalized Discrepancy and Quadrature Error Bound,” Mathematics of Computation, 67, 299–322. DOI: 10.1090/S0025-5718-98-00894-1.
  • Hoare, C. A. R. (1962), “Quicksort,” The Computer Journal, 5, 10–16. DOI: 10.1093/comjnl/5.1.10.
  • Huang, C., and Joseph, V. R. (2022), supercompress: Supervised Compression of Big Data. R package version 1.1.
  • Joseph, V. R., and Mak, S. (2021), “Supervised Compression of Big Data,” Statistical Analysis and Data Mining: The ASA Data Science Journal, 14, 217–229. DOI: 10.1002/sam.11508.
  • Joseph, V. R., and Vakayil, A. (2022), “Split: An Optimal Method for Data Splitting,” Technometrics, 64, 166–176. DOI: 10.1080/00401706.2021.1921037.
  • Kaul, M. (2013), “3D Road Network (North Jutland, Denmark). UCI Machine Learning Repository. DOI: 10.24432/C5GP51.
  • Kennard, R. W., and Stone, L. A. (1969), “Computer Aided Design of Experiments,” Technometrics, 11, 137–148. DOI: 10.1080/00401706.1969.10490666.
  • Kiefer, J. (1959), “Optimum Experimental Designs,” Journal of the Royal Statistical Society, Series B, 21, 272–304. DOI: 10.1111/j.2517-6161.1959.tb00338.x.
  • Ma, P., Mahoney, M. W., and Yu, B. (2015), “A Statistical Perspective on Algorithmic Leveraging,” Journal of Machine Learning Research, 16, 861–911.
  • Ma, P., and Sun, X. X. (2015), “Leveraging for Big Data Regression,” Wiley Interdisciplinary Reviews: Computational Statistics, 7, 70–76. DOI: 10.1002/wics.1324.
  • Mak, S., and Joseph, V. R. (2018), “Support Points,” The Annals of Statistics, 46, 2562–2592. DOI: 10.1214/17-AOS1629.
  • Meng, C., Xie, R., Mandal, A., Zhang, X. L., Zhong, W. X., and Ma, P. (2021),“Lowcon: A Design-based Subsampling Approach in a Misspecified Linear Model,” Journal of Computational and Graphical Statistics, 30, 694–708. DOI: 10.1080/10618600.2020.1844215.
  • Royston, P. (1992), “Approximating the Shapiro-Wilk w-test for Non-normality,” Statistics and Computing, 2, 117–119. DOI: 10.1007/BF01891203.
  • Sener, O., and Savarese, S. (2017), “Active Learning for Convolutional Neural Networks: A Core-Set Approach,” arXiv preprint arXiv:1708.00489.
  • Shapiro, S. S., and Francia, R. (1972), “An Approximate Analysis of Variance Test for Normality,” Journal of the American Statistical Association, 67, 215–216. DOI: 10.1080/01621459.1972.10481232.
  • Shapiro, S. S., and Wilk, M. B. (1965), “An Analysis of Variance Test for Normality (Complete Samples),” Biometrika, 52, 591–611. DOI: 10.1093/biomet/52.3-4.591.
  • Shi, C. L., and Tang, B. X. (2021), “Model-Robust Subdata Selection for Big Data,” Journal of Statistical Theory and Practice, 15, 1–17. DOI: 10.1007/s42519-021-00217-9.
  • Snee, R. D. (1977), “Validation of Regression Models: Methods and Examples,” Technometrics, 19, 415–428. DOI: 10.1080/00401706.1977.10489581.
  • Székely, G. J., and Rizzo, M. L. (2013), “Energy Statistics: A Class of Statistics based on Distances,” Journal of Statistical Planning and Inference, 143, 1249–1272. DOI: 10.1016/j.jspi.2013.03.018.
  • Vakayil, A., and Joseph, V. R. (2022a), “Data Twinning,” Statistical Analysis and Data Mining: The ASA Data Science Journal, 15, 598–610. DOI: 10.1002/sam.11574.
  • ——- (2022b), twinning: Data Twinning. R package version 1.0.
  • Vakayil, A., Joseph, V. R., and Mak, S. (2021), SPlit: Split a Dataset for Training and Testing. R package version 1.0.
  • Wang, H. Y. (2019), “More Efficient Estimation for Logistic Regression with Optimal Subsamples,” Journal of Machine Learning Research, 20, 1–59.
  • Wang, H. Y., and Ma, Y. Y. (2021), “Optimal Subsampling for Quantile Regression in Big Data,” Biometrika, 108, 99–112. DOI: 10.1093/biomet/asaa043.
  • Wang, H. Y., Yang, M., and Stufken, J. (2019), “Information-based Optimal Subdata Selection for Big Data Linear Regression,” Journal of the American Statistical Association, 114, 393–405. DOI: 10.1080/01621459.2017.1408468.
  • Wang, H. Y., Zhu, R., and Ma, P. (2018), “Optimal Subsampling for Large Sample Logistic Regression,” Journal of the American Statistical Association, 113, 829–844. DOI: 10.1080/01621459.2017.1292914.
  • Whiteson, D. (2014), “Susy,” UCI Machine Learning Repository. DOI: 10.24432/C54606.
  • Yao, Y. Q., and Wang, H. Y. (2019), “Optimal Subsampling for Softmax Regression,” Statistical Papers, 60, 585–599. DOI: 10.1007/s00362-018-01068-6.
  • Zhang, A. J., Li, H. Y., Quan, S. J., and Yang, Z. B. (2018), Unidoe: Uniform Design of Experiments. R package version 1.0.2.
  • Zhang, M., Zhou, Y. D., Zhou, Z., and Zhang, A. J. (2023), “Model-Free Subsampling Method based on Uniform Designs,” IEEE Transactions on Knowledge and Data Engineering. DOI: 10.1109/TKDE.2023.3297167.
  • Zhou, Y. D., Fang, K. T., and Ning, J. H. (2013), “Mixture Discrepancy for Quasi-Random Point Sets,” Journal of Complexity, 29, 283–301. DOI: 10.1016/j.jco.2012.11.006.
  • Zhou, Z., Yang, Z. B., Zhang, A. J., and Zhou, Y. D. (2023), PDDS: Parallel Data-Driven Subsampling. R package version 1.0.0. DOI: 10.1080/00401706.2023.2271091.
  • Zhu, L. P., Li, L. X., Li, R. Z., and Zhu, L. X. (2011), “Model-Free Feature Screening for Ultrahigh-Dimensional Data,” Journal of the American Statistical Association, 106, 1464–1475. DOI: 10.1198/jasa.2011.tm10563.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.