Full article: A clustering algorithm for overlapping Gaussian mixtures

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Gaussian mixture models (GMM) are widely used as a probabilistic model for density estimation for multivariate data and as an unsupervised clustering algorithm to provide a soft clustering to the available data. The GMM relies on the expectation-maximization algorithm for maximizing the likelihood. A new approach is proposed in the present work, which depends on Approximate Bayesian Computation and aims not only to estimate the population parameters but also to assign each observation to a specific subpopulation. The performance, in terms of mean estimates for the parameters of each subpopulation, number of observations assigned to each cluster, accuracy, and other performance metrics, of the new approach is compared with the expectation-maximization algorithm for the GMM and the K-means algorithm under several challenging simulation scenarios and is applied to a real data set.

KEYWORDS:

1 Introduction

Cluster analysis is one of the most frequently used processes in unsupervised learning. Clustering aims to determine homogeneous subgroups in a data set by identifying similarities between data points’ characteristics (Ezugwu et al. Citation2022) and has many applications in different fields, like computer science, data science and statistics, healthcare, urban development and transportation, environmental and technological impact and innovations, and pattern recognition and image analysis (Liu, Sun, and Wang Citation2006; Jyoti and Singh Citation2011; Bandyopadhyay et al. Citation2014; Ding et al. 2023).

Clustering algorithms may be classified into two main categories, namely (a) as hard and exclusive membership methods and (b) as soft (fuzzy), overlapping techniques. In the latter, probability model-based approaches play an important role (Bouveyron and Brunet-Saumard Citation2014; Bouveyron et al. Citation2019). Among the most popular probability, model-based approaches are the mixture models, especially the Gaussian Mixture Models (GMM), which not only provide a useful tool to model populations consisting of two or more subpopulations but also a powerful approach to revealing underlying structures in bi- and multi-modal data (Quiñones-Grueiro et al. Citation2019). For fitting a mixture model, the expectation and maximization (EM) algorithm (Dempster, Laird, and Rubin Citation1977a) is usually used due to the complexity of the likelihood function.

GMMs and mixture models in general, estimate the probability that a data point belongs to each cluster and thus can be used as soft classifiers whenever clusters overlap. However, the phenomenon of cluster overlapping is not mathematically well characterized, especially in multivariate cases (Sun and Wang Citation2011), and can be, in many cases, a really challenging task to model correctly.

In this paper, we aim at clustering based on GMM in the case of not-well-separated, overlapped clusters, and we propose a likelihood-free approach based on the Approximate Bayesian Computation (ABC) algorithm to identify and estimate the parameters of the underlying subpopulations using a discrepancy measure focusing on the kernel density estimations. Kernel density approaches are often used in classification and clustering problems (see for example Taylor Citation1997; Hinneburg and Gabriel Citation2007; Anderson Citation2009) since a cluster (or a class, in the classification problems) can be easily identified/defined by a local maximum of the estimated density function. Moreover, we focus on Gaussian mixtures since they are not only suitable for many clustering algorithms but also sone of the most frequently used models in many applications (Yang, Lai, and Lin Citation2012; McLachlan and Rathnayake Citation2014; Ahmed et al. Citation2019). Moreover, we compare the proposed method with the classical EM approach for the GMM and the K-means clustering algorithm, which is one of the most popular nonparametric, hard, and exclusive membership methods.

The rest of the paper is organized as follows: In Section 2, we briefly review the related work and present some motivating examples. In Section 3, the proposed method is presented in detail. In Sections 4 and 5 a simulation study is conducted and the proposed method is applied to a real data set, respectively. Finally, some concluding remarks are given in the final section.

2 Preliminaries and related work

2.1 Gaussian mixture models as a clustering algorithm

Finite mixture models are often adopted to describe heterogeneity among individuals or groups in different applications and data (Economou and Caroni Citation2009; Economou Citation2013; Punzo Citation2019; McLachlan, Lee, and Rathnayake Citation2019). Under a finite mixture model, the population is assumed to consist of k subpopulations or groups with (probability) density function given by (1) $f (x | λ) = \sum_{j = 1}^{k} p_{j} f_{j} (x | λ_{j})$ (1) where x is a d-dimensional vector of real value random variables, and $p_{j}, j = 1, 2, \dots, k$ are the subpopulation proportions with $\sum_{j = 1}^{k} p_{j} = 1, f_{j} (x | λ_{j})$ the (probability) density function of the jth subpopulation, and $λ_{j}$ the corresponding parameter vector. Usually, the f_js have the same functional form, as in the rest of the paper, but different parameters.

One of the most frequently used mixture models is the so-called GMM. A GMM is a weighted sum of k component Gaussian densities given by $f (x | λ) = \sum_{j = 1}^{k} p_{j} g (x | μ_{j}, Σ_{j}),$ where $g (x | μ_{j}, Σ_{j}), j = 1, \dots, k$ are the k component Gaussian densities given by $g (x | μ_{j}, Σ_{j}) = \frac{1}{{(2 π)}^{d / 2} | Σ_{j} |^{1 / 2}} \exp {- \frac{1}{2} (x - μ_{j})' Σ_{j}^{- 1} (x - μ_{j})},$ with mean vector μ_j and covariance matrix Σ_j and λ is the vector of all the parameters. The jth d-dimensional Gaussian distribution is denoted in the following as $M N (μ_{j}, Σ_{j})$ .

There are several techniques available for estimating the parameters of a GMM but the most popular is the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin Citation1977b). During the EM algorithm, the Expectation (E-step) and the Maximization (M-step) steps are repeated until convergence is achieved. In the bth iteration, the E-step basically updates the probabilities $P (x_{i} \in j | x_{i}, μ_{j}^{(b - 1)}, Σ_{j}^{(b - 1)}, p_{j}^{(b - 1)}, λ^{(b - 1)})$ , i.e., the probabilities that a data point $x_{i}, i = 1, \dots, n$ belongs to component $j = 1, \dots, k$ , using the Bayes’ rule $\begin{matrix} P (x_{i} \in j | x_{i}, μ_{j}^{(b - 1)}, Σ_{j}^{(b - 1)}, p_{j}^{(b - 1)}, λ^{(b - 1)}) \\ = \frac{g (x_{i} | μ_{j}^{(b - 1)}, Σ_{j}^{(b - 1)}) p_{j}^{(b - 1)}}{f (x_{i} | λ^{(b - 1)})} \end{matrix}$ where the parameters $μ_{j}^{(b - 1)}, Σ_{j}^{(b - 1)}$ , $p_{j}^{(b - 1)}$ , and $λ^{(b - 1)}$ are the current values, obtained by the previous iteration, of the parameters. The probabilities $P (x_{i} \in j | x_{i}, μ_{j}^{(b - 1)}, Σ_{j}^{(b - 1)}, p_{j}^{(b - 1)}, λ^{(b - 1)})$ serve as soft labels that a data point $x_{i}, i = 1, \dots, n$ belongs to component $j = 1, \dots, k$ .

In the M-step, the new values of the parameters are determined by maximizing with respect to the parameter and by using the $γ_{x_{i}} (j) = P (x_{i} \in j | x_{i}, μ_{j}^{(b - 1)}, Σ_{j}^{(b - 1)}, p_{j}^{(b - 1)}, λ^{(b - 1)})$ , the expected value of the complete log-likelihood given by $\sum_{i = 1}^{n} \sum_{j = 1}^{k} γ_{x_{i}} (j) (\log (p_{j}) + \log (g (x_{i} | μ_{j}, Σ_{j}))) .$

Remark 1.

As already mentioned, the EM algorithm provides soft labels for each data point. Although, the final, fitted model can also be used as a hard clustering procedure by assigning each data point to the component yielding the highest posterior probability. Moreover, some modified approaches have been suggested based on penalized maximum likelihood estimation to obtain improved and more robust clustering algorithms based on GMM (Yang, Lai, and Lin Citation2012; He and Ho Citation2019).

2.2 K-means

The K-means algorithm is often considered as one of the most frequently used clustering algorithms (Berkhin Citation2006). K-means is an iterative algorithm that divides the available observations into k clusters by (step a) assigning each observation to the cluster with the nearest centroid (mean), and (step b) revising each cluster centroid (mean) based on the points in that cluster, and (step c) repeating steps (a) and (b) until convergence, i.e. the labels of all the points do not change, is achieved.

Although K-means is a hard clustering method and does not make any assumptions for the distribution of the underlying populations, as the GMM algorithm does, it can still be viewed as a limiting case of a GMM (Barber Citation2012) or as a special case of them when truncated variational EM approximations are applied to Gaussian Mixture Models (GMM) with isotropic Gaussians (Lücke and Forster Citation2019). As a result, K-means is good at capturing (hyper)ball-shaped structures (Kłopotek, Wierzchoń, and Kłopotek Citation2020) and for that reason, K-means is also considered, for comparison purposes, along with the GMM in the rest of the paper.

Remark 2.

An important extension of the K-means algorithm is the Fuzzy K-means. The Fuzzy K-means algorithm is a soft clustering algorithm that relies on the same principles as the K-means algorithm. In the Fuzzy K-means algorithm, each data point is not assigned to a single cluster, but membership scores are assigned to each data point for each cluster based on the distance between the cluster centroid and the data point (Bezdek Citation1981; Ferraro Citation2021).

2.3 Simulated examples

GMM and K-means are powerful clustering algorithms but can still underperform in certain situations in which overlapping subpopulations are presented. More specifically, the following examples present cases in which every subpopulation is relatively close to at least one of the other subpopulations. The closeness of the subpopulation is measured based on Matusita’s measure of overlapping coefficient (Matusita Citation1955) and in all cases, all subpopulations present a small to moderate degree of similarity with at least one of the rest of the subpopulations. In all the following simulated examples, the weakness of GMM and K-means is apparent. The K-means usually works well for unimodal clusters and well-separated groups, so it is expected to underperform in such situations. Nevertheless, K-means is also used for comparison purposes, since it is one of the most frequently used algorithms for clustering. On the other hand, GMM is more suitable for overlapping subpopulations, even if it is not expected to work perfectly whenever the subpopulations are not well separated (Celeux Citation1985). Of course, as the sample size increases, the GMM is expected to present more acceptable behavior as long as the mixing weights are not too extreme. For small sample sizes, such as the ones presented next, whenever overlapping subpopulations are presented, both GMM and K-means, as will be demonstrated, underperform.

Example 1.

The first data set was generated from a three-component Gaussian mixture with $\begin{matrix} μ_{1} = (- 2, - 2)^{'}, Σ_{1} = (\begin{matrix} 2 & 1 \\ 1 & 2 \end{matrix}) \\ μ_{2} = (0, 0)^{'}, Σ_{2} = (\begin{matrix} 2 & - 1 \\ - 1 & 2 \end{matrix}) \\ μ_{3} = (2, 2)^{'}, Σ_{3} = (\begin{matrix} 2 & 1 \\ 1 & 2 \end{matrix}) \end{matrix}$ and $p_{i} = 1 / 3, i = 1, 2, 3$ . The Matusita’s measure of overlapping coefficients $ρ_{i, j}$ between the ith and the jth distribution are $ρ_{1, 2} = ρ_{2, 3} = 0.53$ and $ρ_{1, 3} = 0.26$ . These values indicate that the second component presents moderate overlap with each of the other distributions. The simulated sample consisted of 33 observations from the first subpopulation, 46 observations from the second subpopulation, and 41 observations from the third subpopulation. In the contour plots of the three subpopulations given the label of the observations are given. Plotting lines in each contour plot define points with common values of the corresponding probability density function (pdf), estimated using the observed sample given the label of each data point. Lines closer to the center of each contour plot indicate high values of the corresponding pdf, i.e. ranges with higher probability. The number of lines in each contour plot and their closeness also represent how rapidly these values change; fewer and closer lines indicate a distribution with smaller standard deviations. In the clusters obtained by the K-means are presented, while in the clusters identified by GMM are depicted. presents the contour plots of the three clusters identified by the proposed method, presented in the following section. To avoid unnecessary repetition, we defer the discussion of the details of this plot to the next section.

Fig. 1 (a) The contour plots of the three subpopulations given the label of the observations generated under the distribution scheme of Example 1. (b) The clusters obtained by the K-means. (c) The clusters obtained by the GMM. (d) The clusters obtained by the proposed algorithm.

From it is clear that both the K-means and the GMM fail to capture the true underlying distribution of the three clusters (as demonstrated, despite the underrepresentation in that particular sample of the first subpopulation, in ). More specifically, K-means seems to manage to correctly identify the centroid of each cluster but fails to express correctly the variability (i.e., the variance) of each cluster, mainly due to the misclassification of some points (see ). On the other hand, GMM fails completely to identify the centroids and the variance-covariance structure of all three clusters (see ).

Example 2.

The second data set was again generated from a three-component Gaussian mixture, but this time with $\begin{matrix} μ_{1} = (- 4, 0)^{'}, Σ_{1} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}) \\ μ_{2} = (0, 0)^{'}, Σ_{2} = (\begin{matrix} 4 & 0 \\ 0 & 4 \end{matrix}) \\ μ_{3} = (4, 0)^{'}, Σ_{3} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}) \end{matrix}$ and $p_{i} = 1 / 3, i = 1, 2, 3$ . In this case the Matusita’s measure of overlapping coefficients between these distributions are $ρ_{1, 2} = ρ_{2, 3} = 0.36$ and $ρ_{1, 3} = 3.35 \cdot 10^{- 4}$ indicating that the second component presents moderate to small overlapping with each of the other distributions since their mean vectors are further away, compared to the first example, from the mean vector of the second subpopulation.

Thirty-nine (39) observations were generated from the first subpopulation, 34 observations from the second subpopulation, and 46 observations from the third subpopulation. Their scatterplot, along with the contour plots of the three subpopulations estimated by this sample, given the label of the observations, is presented in . Clearly, there is significant overlap of the three clusters. The first (left, red) and the third (right, blue) subpopulations seem to be almost complete in the volume covered by the second subpopulation (middle, green). This feature is missed by both the K-means (which was expected since K-means do not allow overlapping clusters) and the GMM, see , respectively. Especially, the clusters identified by the GMM clustering algorithm seem to behave more or less as the ones identified by the K-means, i.e., they present a very small to no overlap. This is mainly due to the noticeable elongation that the middle cluster presents.

Fig. 2 (a) The contour plots of the three subpopulations given the label of the observations generated under the distribution scheme of Example 2. (b) The clusters obtained by the K-means. (c) The clusters obtained by the GMM. (d) The clusters obtained by the proposed algorithm.

Example 3.

The third data set was generated from a four-component Gaussian mixture with $\begin{matrix} μ_{1} = (- 3, 0)^{'}, Σ_{1} = (\begin{matrix} 2 & - 1 \\ - 1 & 2 \end{matrix}) \\ μ_{2} = (0, 1)^{'}, Σ_{2} = (\begin{matrix} 2 & 0 \\ 0 & 2 \end{matrix}) \\ μ_{3} = (3, - 2)^{'}, Σ_{3} = Σ_{1} \\ μ_{4} = (5, 1)^{'}, Σ_{4} = Σ_{1} \end{matrix}$

and $p_{i} = 1 / 4, i = 1, 2, 3, 4$ . The Matusita’s measure of overlapping coefficients between these distributions are $\begin{matrix} ρ_{1, 2} = 0.45 ρ_{1, 3} = 9.69 \cdot 10^{- 2} \\ ρ_{1, 4} = 2.28 \cdot 10^{- 2} ρ_{2, 3} = 0.39 \\ ρ_{2, 4} = 0.18 ρ_{3, 4} = 0.21 \end{matrix}$

Thirty-six (36), 46, 43, and 35 observations were generated from the first, second, third, and fourth subpopulations, respectively. It is clear that three of the four clusters, corresponding to the last three subpopulations, were not identified correctly from both the K-means and the GMM algorithm (see ). Especially, the cluster of the second subpopulation (green) was completely missed by the GMM since it shrank and covered a very small volume. Moreover, the 4th cluster of the last subpopulation seems to cover a smaller volume based on the K-means algorithm (see ) compared with that indicated in . The same cluster based on the GMM seems to cover a larger volume with a small positive correlation (see ).

Fig. 3 (a) The contour plots of the three subpopulations given the label of the observations generated under the distribution scheme of Example 3. (b) The clusters obtained by the K-means. (c) The clusters obtained by the GMM. (d) The clusters obtained by the proposed algorithm.

3 Proposed methodology

The proposed clustering method relies heavily on the so-called Approximate Bayesian computation (ABC) method. ABC is a likelihood-free technique that is used to perform inference in cases in which the evaluation of the likelihood function is computationally costly or even infeasible (Rubin Citation1984; Diggle and Gratton Citation1984). It is used in many different applications and fields (see for example Economou et al. Citation2020; Park and Kwon Citation2022). The key idea behind ABC is the production of artificial data sets using some prior distribution and comparing them with the observed sample. If a produced data set is “close enough” to the observed data set, then the parameters used to generate this set are not-rejected. The procedure is repeated until M, a relatively large integer, and not-rejected artificial data sets are obtained. Then, the M not-rejected parameters are summarized to describe the posterior distribution of the parameters.

The ABC method for the proposed clustering algorithm is described in the following steps:

(a) Adopt a prior distribution for the cluster membership score for each data point.

(b) Generate a data partition by assigning each data point to one of the k clusters by generating an integer from 1 to k. Each integer represents the k clusters, and the integers are sampled using the prior distribution adopted for each data point.

Using the nj members allocated to the jth cluster, estimate the sample mean, ${\hat{μ}}_{j}$ , and the sample variance-covariance matrix, ${\hat{Σ}}_{j}$ .
For $j = 1, \dots, k$ generate nj observations from the $M N ({\hat{μ}}_{j}, {\hat{Σ}}_{j})$ .
Compute the discrepancy between the artificial $D^{*}$ and the observed data set D using a suitable defined discrepancy measure d.
Retain (i) the partition, (ii) the sample mean, ${\hat{μ}}_{j}$ and (iii) the sample variance-covariance matrix, ${\hat{Σ}}_{j}$ only if $d < ϵ$ , for some $ϵ > 0$ . Go to step 1 (b) and repeat until M not-rejected samples are obtained.

The above algorithm describes the procedure to obtain artificial data sets that are similar to the observed data set. In the sequel, a few essential details for applying steps 1(a), 1(b), 4, and 5 of the previous algorithm in practice are discussed.

(a) A flat, prior categorical distribution for the cluster membership score can be adopted for each data point. If prior knowledge is available for the cluster sizes, a categorical distribution with different parameters that reflects the mixing proportions of the underlying subpopulations can be used. In this case, special attention should be given to the labeling of each cluster. A simple approach is to label the clusters with respect to the order of their mean values with respect to one variable. For example, if one believes that the middle cluster in is twice the size of the other two clusters, then the categorical distribution with parameters 1/4, 2/4, and 1/4 should be used.

Another approach, that is actually adopted in the rest of the paper, is to use a data-driven prior distribution by initially applying a soft, fuzzy clustering algorithm to obtain a fuzzy n × k matrix with the membership score of each data point with respect to the k clusters. From the examples, presented in Section 2.3 it seems that the GMM fails, at least in some cases, to correctly identify the centroid of the clusters. On the other hand, the K-means algorithm seems to present more robust behavior regarding the centroids of the clusters. Given that Fuzzy K-means relies on the same principles as the K-means algorithm, using the membership score matrix obtained by the Fuzzy K-means as a data-driven prior distribution is recommended when applying the proposed algorithm.

(b) A similar procedure can be found in the Stochastic EM (SEM) algorithms proposed by (Celeux Citation1985). The SEM algorithm incorporates a stochastic step, the S-step, between the E- and M-steps of the classical EM algorithm in which each observation is assigned to each subpopulation according to a multinomial distribution with parameters equal to the a posteriori probabilities of the E-step. This procedure does not allow the EM algorithm to be trapped in local optima (Novais and Faria Citation2021). Although the idea seems similar the proposed algorithm does not make use of this partition to compute the M-step of the EM algorithm but to generate an artificial data set and compare it with the observed one.

4. A discrepancy measure can be defined based on the integrated squared error given by $T = \int_{x} \int_{y} {(f_{D} (x, y) - f_{D^{*}} (x, y))}^{2} d ydx$ between the kernel density estimates $f_{D} (x, y)$ and $f_{D^{*}} (x, y)$ of the observed D and the artificial $D^{*}$ sample. T is asymptotically normally distributed with a known mean and standard deviation under the null hypothesis, i.e. under the hypothesis that both samples share the same density (for more details see Duong, Goud, and Schauer 2012) and so an approximate p-value can be calculated based on the standardized version of T, d.

5. Small absolute values of d, or equivalently large p-values, assure that the retained partitions of the data points are more likely to resemble the true, unknown labeling of the data points, and as a consequence, the estimated sample mean, ${\hat{μ}}_{j}$ , and sample variance-covariance matrix, ${\hat{Σ}}_{j}$ to be closer to their true values. To ensure that only highly similar artificial data sets are retained, we propose the adoption of a very large threshold, at least for the two dimension case and for relatively small sample sizes, of p-value, for example, 0.75.

It is of note that the M not-rejected parameters can be used not only to summarize the posterior distribution of the parameters of the underlying Gaussian distributions but also to assign each data point to a specific subpopulation, i.e., cluster. This can be achieved by simply adopting the majority voting method.

Remark 3.

In mixture models and clustering, “label switching” is always a problem that needs to be resolved before a majority voting method can be applied. A common approach to dealing with this problem is to set an identifiability constraint on the parameter space regarding the order of the mean values in one of the dimensions of the identified clusters. In cases where these mean values are relatively close, the order of the mean values of the identified cluster in a second dimension can also be used to label the identified clusters. This approach is also used in the simulations and in the application presented in Sections 4 and 5, respectively.

To make the proposed algorithm clearer, the first of the two following examples demonstrate analytically the procedure, while the second reports the results of the applications of the proposed algorithm to the data sets of Examples 1–3 presented in Section 2.3.

Example 4.

In Example 1, a sample consisting of 33 observations from the first subpopulation, 46 observations from the second subpopulation, and 41 observations from the third subpopulation was generated and analyzed with K-means and GMM. The scatterplot of this sample, using the labels, which in what follows are assumed unknown, is presented in . The estimated kernel density plot for the observed data is presented in . Regarding the label switching problem mentioned in Remark 3, the three identified clusters in can be labeled as 1, 2, and 3 based on the order of their mean values with respect to the horizontal axis (from left to right).

Fig. 4 (a) The scatterplot of the three subpopulations given the label of the observations generated under the distribution scheme of Example 1. (b) A rejected data partition. (c) A not-rejected data partition. (d) The kernel density plot for the observed data. (e) The kernel density plot for the rejected artificial data set generated using the partition presented in subfigure (b). (f) The kernel density plot for the not-rejected artificial data set generated using the partition presented in subfigure (c).

From the analysis in Example 1 it is clear that the GMM underperforms with respect to the K-means which justifies the use of the Fuzzy K-means algorithm to obtain a fuzzy 120 × 3 matrix with the membership score of each one of the 120 data points with respect to the k = 3 clusters (step 1(a) of the algorithm). Based on these matrices, several different data partitions can be generated (step 1(b) of the algorithm). Two of them are presented in . Obviously, none of them is correct 100% but the scatterplot presented in , which, as we will see, corresponds to the rejected partition, is less similar to the partition presented in scatterplot of (see, for example, the two red points (marked in red cycles) in the up-right part of the plot).

Next (step 2 of the algorithm) the ${\hat{μ}}_{j}$ , and the ${\hat{Σ}}_{j}$ , j = 1, 2, 3 were calculated for each of the two partitions. Using the corresponding sample size n_j of each partition, random observations were generated from the $M N ({\hat{μ}}_{j}, {\hat{Σ}}_{j})$ distributions in order to produce artificial data sets (step 3 of the algorithm). The kernel density plots for these artificial data sets are presented in . From these plots, it is clear that the kernel density plot in resembles more that obtained by the observed data. This is also demonstrated by the p-value obtained by testing the hypothesis that these samples share the same density as the observed sample. More specifically, the p-values are 0.05384673 and 0.7563502 for the artificial data sets of , respectively (step 5 of the algorithm). Since the p-value of the first artificial data set is smaller than the threshold 0.75 set in the 5th step of the proposed plot, the used data partition and the corresponding ${\hat{μ}}_{j}$ , and ${\hat{Σ}}_{j}$ , j = 1, 2, 3 are dropped. The opposite occurs for the second artificial data set.

Examples 1–3 continued. In Section 2.3 we have deferred the discussion of the details regarding the last plots, labeled as plots (d), in . These plots present the contour plots of the clusters identified by the proposed method using 101 not-rejected samples. In Examples 1 and 3 it is clear that the proposed method managed to capture the true underlying distribution of the subpopulations. In the second example, the performance of the proposed method is not as great as before, but it is still acceptable and outperforms that of GMM and K-means.

4 Simulation study

Examples 1–3 indicate that the performance of the proposed algorithm outperforms that of GMM and K-means. Although, to evaluate the performance of the proposed algorithm in cases in which the K-Means and GMM are likely to underperform, and compare its performance with the GMM and K-means clustering algorithms a simulation study was carried out. More specifically, 1000 simulated samples were generated for each of the following scenarios: $\begin{matrix} S cenario 1 : μ_{1} = (- 4, 0)^{'}, Σ_{1} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}) \\ μ_{2} = (0, 0)^{'}, Σ_{2} = (\begin{matrix} 2 & 0 \\ 0 & 2 \end{matrix}) \\ μ_{3} = (4, 0)^{'}, Σ_{3} = (\begin{matrix} 1 & 0 \\ 1 & 1 \end{matrix}) \end{matrix}$ $\begin{matrix} S cenario 2 : μ_{1} = (- 4, 0)^{'}, Σ_{1} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}) \\ μ_{2} = (0, 0)^{'}, Σ_{2} = (\begin{matrix} 4 & 0 \\ 0 & 4 \end{matrix}) \\ μ_{3} = (4, 0)^{'}, Σ_{3} = (\begin{matrix} 1 & 0 \\ 1 & 1 \end{matrix}) \end{matrix}$ $\begin{matrix} S cenario 3 : μ_{1} = (- 4, 0)^{'}, Σ_{1} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}) \\ μ_{2} = (- 1, - 1)^{'}, Σ_{2} = (\begin{matrix} 2 & 0 \\ 0 & 2 \end{matrix}) \\ μ_{3} = (1, 1)^{'}, Σ_{2} = (\begin{matrix} 2 & 0 \\ 0 & 2 \end{matrix}) \\ μ_{4} = (4, 0)^{'}, Σ_{2} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}) \end{matrix}$

To make the comparison more straightforward, the sample size of each subpopulation was fixed at 40 observations. Additionally, to investigate the performance of the proposed algorithm with non-comparable cluster sizes, the following sample sizes $(n_{1}, n_{2}, n_{3}, n_{4}) = (40, 50, 30, 40)$ were also used for the four subpopulations in the third scenario.

For each data set, the K-means, the GMM (as a hard clustering algorithm), and the proposed algorithm (with M = 101) were applied. The value of M was selected through a small, initial simulation study in which the membership status of each point at each cluster was carefully monitored. From the simulation, it was clear that for the majority of the cases (and points), the membership status of each point was unchanged from M larger than 80–90 not-rejected samples. Furthermore, the value 101 was selected for M since 101 is a prime number, which makes it highly improbable that a point is assigned to 2 or more clusters the exactly same maximum number of times. However, in case this unlikely event occurs, the membership status of such a point may be determined by simply increasing the value of M The evaluation of the performance and the comparison of the three approaches were made in terms of the mean (based on the 1000 simulated samples for each scenario)

estimates for the parameters of each subpopulation,
number of observations assigned to each cluster
accuracy
Cohen’s Kappa statistic and
sensitivity, specificity, recall, and F1 score per cluster.

In the upper tabular in the mean values of the posterior distributions of the parameters of each subpopulation (Note that μ_x, μ_y in the tables denote the mean values of X and Y, respectively, while σ₁₁, σ₁₂, and σ₂₂ denote the (1,1)th, (1,2)th and (2,2)th elements of the corresponding variance-covariance matrix) and the mean number of observations assigned to each cluster for the 1000 simulated samples in each case are reported. Their standard deviations are reported in the second tabular, while the bottom two tabulars report the performance evaluation metrics for each scenario. In all cases, the best values (closer to the true value, minimum standard deviation, or higher value of the performance evaluation metric) are indicated in bold.

Table 1 Summary results (Mean values-upper tabular, standard deviations-second tabular, and performance evaluation metrics-two bottom tabular) for the 1000 simulated samples from Scenario 1.

Download CSV Display Table

Table

Download CSV Display Table

Table 2 Summary results (Mean values-upper tabular, standard deviations-second tabular, and performance evaluation metrics-two bottom tabular) for the 1000 simulated samples from Scenario 2.

Download CSV Display Table

Table

Download CSV Display Table

Table 3 Summary results (Mean values-upper tabular, standard deviations-second tabular, and performance evaluation metrics-two bottom tabular) for the 1000 simulated samples from Scenario 3 with equal cluster sizes.

Download CSV Display Table

Table

Download CSV Display Table

Table 4 Summary results (Mean values-upper tabular, standard deviations-second tabular, and performance evaluation metrics-two bottom tabular) for the 1000 simulated samples from Scenario 3 with non-equal cluster sizes.

Download CSV Display Table

Table

Download CSV Display Table

In all the scenarios studied, the proposed method presents on average a slightly better behavior with respect to the number of observations assigned to each cluster and the overall performance evaluation metrics (Accuracy & Kappa). Moreover, the mean values of the estimates for the parameters of each subpopulation for the proposed method were, in the majority of the cases, closer to their true values. In the cases in which the mean values of the estimates for the parameters obtained by the proposed method were not the best, the mean estimates of the parameters were still close to the true ones. Furthermore, it seems that the estimates for the parameters obtained by the proposed method seem to be less variable since they present, in almost all the cases, the smallest standard deviation.

Regarding the performance evaluation metrics for each cluster in every scenario, it seems that, with the exception of sensitivity, the proposed method presents the best performance with respect to all metrics in all the simulated scenarios among the three methods tested. Comparing Scenarios 1 and 2, which differ only in the variance of the middle subpopulation, it is clear that all the methods present better behavior in the first scenario. This indicates that when there is a significant overlap of the subpopulations, all the methods face larger problems in assigning the observations to the correct cluster. Scenario 3 differs from scenarios 1 and 2 in the number of subpopulations. From the results, it seems, as may be expected, that as the number of overlapped subpopulations increases, the values of the performance evaluation metrics decrease, demonstrating the more complex setup of the population. Moreover, the case of non-equal clusters () seems to suffer more compared with the equal size clusters (), mainly due to the mean number of observations assigned to the most extreme, with respect to their number of observations, clusters.

Remark 4.

The proposed algorithm seems to perform better compared to K-means and GMM in the above-studied cases, i.e. in cases of relatively small sample sizes and overlapping subpopulations. This is probably due to the fact that the proposed algorithm not only permits the overlapping of the subpopulations (which K-means do not) but also assigns the observed points to clusters that produce similar data sets as the observed one and not simply to the multivariate normal components that maximize the component posterior probability, given the data, as GMM do. Very often, in cases of relatively small sample sizes and overlapping subpopulations, the log-likelihood surface is littered with saddle points, plateaux, ridges, and sub-optimal maxima which often lead the GMM to wrong assignments and incorrect inference. On the other hand, the ABC does not make use of the underlying likelihood function and focuses only on the distance between the observed and the artificial data sets generated under the assumed model. As a result, the proposed algorithm, which relies on the ABC algorithm, does not suffer as much as the EM in these cases and can be used to carry out not only the Bayesian posterior inference, through the lens of density estimation but also the assignment of each data point to a cluster.

Remark 5.

The proposed algorithm assumes a known number of clusters in the data. If the number of clusters is misspecified, then the performance of the algorithm may be seriously affected. From a parallel simulation study that assumed an incorrect number of clusters for the above-studied scenarios, it seems that whenever the number of clusters is underestimated, the acceptance rate may be small, and a large number of artificial data sets need to be generated before M not-rejected samples are obtained. On the other hand, if the number of clusters is overestimated, the number of rejected samples becomes very small. While this may be an indicator that the acceptance rate can be used as a measure of selecting the optimal number of clusters, the simulations did not highlight a clear, solid rule that can be applied in all cases. Moreover, it is worth mentioning that due to the overlapping of the subpopulations studied, the algorithm does not manage to assign zero (or almost zero) observations to the extra cluster(s), but splits the overlapping region into two or more clusters.

Remark 6.

The average acceptance rates (standard deviation, minimum, and maximum values) for the four simulation scenarios were 0.2014 (0.1263, 0.0002, 0.6392), 0.0928 (0.0770, 0.0002, 0.4529), 0.1503 (0.1015, 0.0005, 0.5050), and 0.2560 (0.1819, 0.0103, 0.6159), respectively. This is equivalent, for example, to the fact that 501.49 samples were required on average in the first scenario to obtain the desired 101 not-rejected samples. From the results, it is clear that there is such a large heterogeneity in the acceptance rate across the simulated samples that no clear pattern can be identified.

5 Application

For illustration purposes, the proposed method is applied to a flea beetle data set that can be found in Lubischew (Citation1962) and Hand et al. (Citation1993). The data consists of 74 beetles in the genus Flea Beetle. Chaetocnema, which contains three species, namely the heikertingeri (number of observations: 31, subpopulation 1), the heptapotamica (number of observations: 22, subpopulation 2), and the concinna (number of observations: 21, subpopulation 3). These three subpopulations and the corresponding label of each observation are used to evaluate the performance of the proposed algorithm. For each observation, two characteristics of the beetle were measured: the maximal width (x) of the aedeagus in the forepart (in microns) and the front angle (y) of the aedeagus (1 unit = 7.5 degrees).

For comparison purposes, the K-means and the GMM are also applied.

5.1 Clustering algorithms

In the contour plots of the three subpopulations given their true labels are given. The red contour corresponds to the heikertingeri, the green to heptapotamica, and the blue to the concinna species. In the clusters obtained by the K-means are presented, while in the clusters identified by GMM are depicted. presents the contour plots of the three clusters identified by the proposed method. Again, the proposed approach to resolve the label-switching problem presented in Remark 3 was adopted and the three identified clusters were labeled as 1, 2, and 3 based on the order of their mean values with respect to the horizontal axis, i.e., the clusters were labeled with respect to their mean value of the maximal width (x) of the aedeagus in the forepart (in microns). In the confusion tables of the three approaches are also presented.

Fig. 5 (a) The contour plots of the three subpopulations given the labels of the observations for the Flea Beetle data set. (b) The clusters obtained by the K-means. (c) The clusters obtained by the GMM. (d) The clusters obtained by the proposed algorithm.

Table 5 Confusion tables for the flea beetle data set using the three clustering algorithms.

Download CSV Display Table

From and and the first two tabulars in it is clear that both the K-means and the GMM fail to capture the true underlying distribution of the three clusters/subpopulations. More specifically, K-means failed mainly to correctly identify the concinna species since K-means assigned 9 of them (those with large values of the front angle of the aedeagus) to the second cluster that consists mainly of heptapotamica beetles. On the other hand, GMM allocates all the heptapotamica and concinna species to the third cluster and, at the same time, splits the heikertingeri into two clusters. It is worth noticing that the second cluster identified by the GMM covers a very small volume and seems to be shrunken between the other two clusters, a behavior that was observed in the simulated examples presented in Section 2.3.

The proposed algorithm managed to identify almost perfectly the first two species (heikertingeri and heptapotamica) and seems to suffer less, compared to the K-Means, in identifying the third species (concinna). Although it is of note that the proposed method still fails to capture the true structure of the variance-covariance matrices of the underlying subpopulations, especially that of the heptapotamica as can be seen also in , in which the descriptive statistics for the three identified clusters by the three clustering algorithms are presented. Although it is worth mentioning that in all the cases, the proposed algorithm identified the centroid of each subpopulation better. Finally, it is worth mentioning that 2534 samples (acceptance rate: 0.03986) were generated until 101 not-rejected samples were obtained.

Table 6 Descriptive statistics for the three identified clusters by the used clustering algorithms.

Display Table

6 Conclusions and discussion

Clustering, whether hard or soft, is one of the most frequently used techniques to get an intuition about the structure of the data. Among the techniques appearing in the literature, the most frequently used are the K-means and the GMM. The GMM is also one of the most widely used probabilistic models for density estimation. Although, K-means and GMM are powerful clustering algorithms, they can still underperform in certain situations, as presented in Section 2. Especially in some situations, GMM tends to shrink certain subpopulations/clusters resulting in incorrect partitions of the data.

To overcome these problems, a new algorithm was proposed based on the Approximate Bayesian Computation. This method seems to be able to deal better with overlapping clusters and provide more accurate estimates for the parameters of the underlying subpopulations. More research is required on how the proposed algorithm may be applied in high-dimensional data and/or in cases in which some clusters may be well-separated from the rest of the clusters. For example, for the case of a well-separated cluster from the rest of the clusters other discrepancy measures can also be studied to compare the samples, like the kernel maximum mean discrepancy measure (Gretton et al. Citation2006). For high-dimensional data, since density estimation is usually a challenging task, (a) the determination of the threshold value with respect to the dimension of the data should be also considered and (b) the use of different concepts, as the Friedman-Rafsky test (Friedman and Rafsky Citation1979) and the nearest neighbor test (Henze Citation1988), can be worth studying to better compare the samples.

The future aspects of using other discrepancy measures are mentioned but it remains unclear if this is something to consider for the scenarios of focus in the manuscript or also for the higher dimensional data scenario.

Moreover, a study on hyper-parameters tuning may be necessary to achieve an even better performance of the proposed algorithm. In the present study, the default multivariate plug-in bandwidth selection with unconstrained pilot matrices was implemented in R (for more details, see Wand and Jones Citation1994; Duong and Hazelton Citation2003; Chacón and Duong Citation2010). Additionally, it will be interesting to develop, as future work, a technique, based on the proposed algorithm, to determine also the optimal number of the underlying subpopulations/clusters in a given data set. Another interesting point for further research could be the combination of the proposed algorithm with other clustering algorithms to develop an ensemble clustering algorithm and/or generate posterior samples after the best partition to provide more insights for a given sample (see for example Rastelli and Friel Citation2018).

Acknowledgments

The author thanks all the reviewers and the editor for their valuable comments and suggestions.

Disclosure statement

No potential conflict of interest was reported by the author.

Table

Download CSV Display Table

Table

Download CSV Display Table

Table

Download CSV Display Table

Table

Download CSV Display Table

References

Ahmed SRA, Al Barazanchi I, Jaaz ZA, Abdulshaheed HR. 2019. Clustering algorithms subjected to k-mean and Gaussian mixture model on multidimensional data set. Period Eng Nat Sci. 7(2):448–457.
Google Scholar
Anderson TK. 2009. Kernel density estimation and k-means clustering to profile road accident hotspots. Accid Anal Prev. 41(3):359–364.
PubMed Web of Science ®Google Scholar
Bandyopadhyay S, Bhadra T, Mitra P, Maulik U. 2014. Integration of dense subgraph finding with feature clustering for unsupervised feature selection. Pattern Recognit Lett. 40:104–112.
Web of Science ®Google Scholar
Barber D. 2012. Bayesian reasoning and machine learning. Cambridge: Cambridge University Press.
Google Scholar
Berkhin P. 2006. A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M, editors. Grouping multidimensional data: recent advances in clustering. Berlin Heidelberg: Springer. p. 25–71
Google Scholar
Bezdek J. 1981. Pattern recognition with fuzzy objective function algorithms. Advanced applications in pattern recognition. New York: Springer. Reprint 2013.
Google Scholar
Bouveyron C, Brunet-Saumard C. 2014. Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal. 71:52–78.
Web of Science ®Google Scholar
Bouveyron C, Celeux G, Murphy TB, Raftery AE. 2019. Model-based clustering and classification for data science: with applications in R. Vol. 50. Cambridge: Cambridge University Press.
Google Scholar
Celeux G. 1985. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q. 2:73–82.
Google Scholar
Chacón JE, Duong T. 2010. Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices. Test 19:375–398.
Web of Science ®Google Scholar
Dempster AP, Laird NM, Rubin DB. 1977a. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Stat Methodol. 3(1):1–38.
Google Scholar
Dempster AP, Laird NM, Rubin DB. 1977b. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Stat Methodol. 39(1):1–22.
Google Scholar
Diggle PJ, Gratton RJ. 1984. Monte carlo methods of inference for implicit statistical models. J R Stat Soc Series B Stat Methodol. 46(2):193–227.
Google Scholar
Ding S, Li C, Xu X, Ding L, Zhang J, Guo L, Shi T. A sampling-based density peaks clustering algorithm for large-scale data. Pattern Recognit. 136:109238.
Web of Science ®Google Scholar
Duong T, Hazelton M. 2003. Plug-in bandwidth matrices for bivariate kernel density estimation. J Nonparametr Stat. 15(1):17–30.
Web of Science ®Google Scholar
Duong T, Goud B, Schauer K. Closed-form density-based framework for automatic detection of cellular morphology changes. Proc Natl Acad Sci. 109(22):8382–8387.
PubMed Web of Science ®Google Scholar
Economou P. 2013. Modelling survival data using mixtures of frailties. Statistics 47(2):453–464.
Web of Science ®Google Scholar
Economou P, Caroni C. 2009. Fitting parametric frailty and mixture models under biased sampling. J Appl Stat. 36(1):53–66.
Web of Science ®Google Scholar
Economou P, Batsidis A, Tzavelas G, Alexopoulos P, Alzheimers Disease Neuroimaging Initiative. 2020. Berkson’s paradox and weighted distributions: an application to alzheimer’s disease. Biom J. 62(1):238–249.
PubMed Web of Science ®Google Scholar
Ezugwu AE, Ikotun AM, Oyelade OO, Abualigah L, Agushaka JO, Eke CI, Akinyelu AA. 2022. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng Appl Artif Intell. 110:104743.
Web of Science ®Google Scholar
Ferraro MB. 2021. Fuzzy k-means: history and applications. Econ Stat. in press.
Google Scholar
Friedman JH, Rafsky LC. 1979. Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. Ann Stat 7(4):697–717.
Web of Science ®Google Scholar
Gretton A, Borgwardt K, Rasch M, Schölkopf B, Smola A. 2006. A kernel method for the two-sample-problem. Adv Neural Inf Process Syst. 19:513–520.
Google Scholar
Hand DJ, Daly F, McConway K, Lunn D, Ostrowski E. 1993. A handbook of small data sets. Boca Raton, FL: CRC Press.
Google Scholar
He Z, Ho C-H. 2019. An improved clustering algorithm based on finite Gaussian mixture model. Multimed Tools Appl 78:24285–24299.
Web of Science ®Google Scholar
Henze N. 1988. A Multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann Stat. 16(2):772–783.
Web of Science ®Google Scholar
Hinneburg A, Gabriel H-H. 2007. Denclue 2.0: Fast clustering based on kernel density estimation. In: Advances in Intelligent Data Analysis VII: 7th International Symposium on Intelligent Data Analysis, IDA 2007, Ljubljana, Slovenia, September 6-8, 2007. Proceedings 7. Springer. p. 70–80.
Google Scholar
Jyoti K, Singh S. 2011. Data clustering approach to industrial process monitoring, fault detection and isolation. Int J Comput Appl 17(2):41–45.
Google Scholar
Kłopotek MA, Wierzchoń ST, Kłopotek RA. 2020. k-means cluster shape implications. In: Artificial Intelligence Applications and Innovations: 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos Marmaras, Greece, June 5–7, 2020, Proceedings, Part I 16. Springer. p. 107–118.
Google Scholar
Liu J, Sun J, Wang S. 2006. Pattern recognition: an overview. Int J Comput Sci Netw Secur. 6(6):57–61.
Google Scholar
Lubischew AA. 1962. On the use of discriminant functions in taxonomy. Biometrics 18:455–477.
Web of Science ®Google Scholar
Lücke J, Forster D. 2019. k-means as a variational EM approximation of Gaussian mixture models. Pattern Recognit Lett. 125:349–356.
Web of Science ®Google Scholar
Matusita K. 1955. Decision rules, based on the distance, for problems of fit, two samples, and estimation. Ann Math Stat. 26:631–640.
Google Scholar
McLachlan GJ, Rathnayake S. 2014. On the number of components in a Gaussian mixture model. WIREs Data Min Knowl Discov 4(5):341–355.
Web of Science ®Google Scholar
McLachlan GJ, Lee SX, Rathnayake SI. 2019. Finite mixture models. Ann Rev Statistics Appl. 6:355–378.
Web of Science ®Google Scholar
Novais L, Faria S. 2021. Comparison of the EM, CEM and SEM algorithms in the estimation of finite mixtures of linear mixed models: a simulation study. Comput Stat 36(4):2507–2533.
Web of Science ®Google Scholar
Park J, Kwon J. 2022. Wasserstein approximate bayesian computation for visual tracking. Pattern Recognit. 131:108905.
Web of Science ®Google Scholar
Punzo A. 2019. A new look at the inverse Gaussian distribution with applications to insurance and economic data. J Appl Stat. 46(7):1260–1287.
Web of Science ®Google Scholar
Quiñones-Grueiro, M, Prieto-Moreno A, Verde C, Llanes-Santiago O. 2019. Data-driven monitoring of multimode continuous processes: a review. Chemometr Intell Lab Syst. 189:56–71.
Web of Science ®Google Scholar
Rastelli R, Friel N. 2018. Optimal bayesian estimators for latent variable cluster models. Stat Comput. 28:1169–1186.
PubMed Web of Science ®Google Scholar
Rubin DB. 1984. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann Stat. 12(4):1151–1172.
Web of Science ®Google Scholar
Sun H, Wang S. 2011. Measuring the component overlapping in the Gaussian mixture model. Data Min Knowl Discov. 23:479–502.
Web of Science ®Google Scholar
Taylor C. 1997. Classification and kernel density estimation. Vistas Astron. 41(3):411–417.
Google Scholar
Wand MP, Jones MC. 1994. Multivariate plug-in bandwidth selection. Comput Stat. 9(2):97–116.
Web of Science ®Google Scholar
Yang MS, Lai C-Y, Lin C-Y. 2012. A robust EM clustering algorithm for Gaussian mixture models. Pattern Recognit. 45(11):3950–3961.
Web of Science ®Google Scholar

A clustering algorithm for overlapping Gaussian mixtures

Abstract

1 Introduction

2 Preliminaries and related work