428
Views
0
CrossRef citations to date
0
Altmetric
Statistical Learning

Statistical Significance of Clustering with Multidimensional Scaling

, & ORCID Icon
Pages 219-230 | Received 24 May 2022, Accepted 19 May 2023, Published online: 20 Jul 2023
 

Abstract

Clustering is a fundamental tool for exploratory data analysis. One central problem in clustering is deciding if the clusters discovered by clustering methods are reliable as opposed to being artifacts of natural sampling variation. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimension, low-sample size data. Despite its successful application to many scientific problems, there are cases where the original SigClust may not work well. Furthermore, for specific applications, researchers may not have access to the original data and only have the dissimilarity matrix. In this case, clustering is still a valuable exploratory tool, but the original SigClust is not applicable. To address these issues, we propose a new SigClust method using multidimensional scaling (MDS). The underlying idea behind MDS-based SigClust is that one can achieve low-dimensional representations of the original data via MDS using only the dissimilarity matrix and then apply SigClust on the low-dimensional MDS space. The proposed MDS-based SigClust can circumvent the challenge of parameter estimation of the original method in high-dimensional spaces while keeping the essential clustering structure in the MDS space. Both simulations and real data applications demonstrate that the proposed method works remarkably well for assessing the statistical significance of clustering. Supplementary materials for this article are available online.

Supplementary Materials

Appendix: Proofs of all theoretical results for Section 3 and additional numerical results for Sections 4 and 5. (Appendix.pdf, pdf file)

Code and data: Example code and code for each section in the manuscript and the Appendix. Please read the file README contained in the zip file for more details. (code_and_data.zip, zip file)

Acknowledgments

The authors are indebted to the editor, the associate editor, and two reviewers, whose helpful suggestions led to a much improved presentation.

Disclosure Statement

The authors report there are no competing interests to declare.

Additional information

Funding

The authors were supported in part by NSF grants DMS-2113662 (Bhamidi and Shen), DMS-1613072 (Bhamidi), DMS-1606839 (Bhamidi), DMS-2134107 (Bhamidi), DMS-2100729 (Liu), SES-2217440 (Liu); ARO grant W911NF-17-1-0010 (Bhamidi); and NIH grant R01-GM126550 (Liu and Shen).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.