Full article: Application Of Density-Based Clustering Approaches For Stock Market Analysis

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Present economy is largely dependent on the precise forecasting of the business avenues using the stock market data. As the stock market data falls under the category of big data, the task of handling becomes complex due to the presence of a large number of investment choices. In this paper, investigations have been carried out on the stock market data analysis using various density-based clustering approaches. For experimentation purpose, the stock market data from Quandl stock market was used. It was observed that the effectiveness of Dynamic Quantum clustering approach were better. This is because it has better adopting capability according of changing patterns of the stock market data. Similarly performances of other density-based clustering approaches like Weighted Adaptive Mean Shift Clustering, DBSCAN and Expectation Maximization and also partitive clustering methods such as k-means, k-medoids and fuzzy c means were also experimented on the same stock market data. The performance of all the approaches was tested in terms of standard measures. It was found that in majority of the cases, Dynamic Quantum clustering outperforms the other density-based clustering approaches. The algorithms were also subjected to paired t-tests which also confirmed the statistical significance of the results obtained.

Introduction

Stock market guides the future business avenues, which, in turn, influences the economy of a country. So, stock market data analysis is an important research issue for stock market prediction. Accurate prediction will guide the economy in healthy direction. Creating portfolios for the various companies of the stock market is a key aspect for investors as investing in companies is quite expensive. Also this is a very complex job. Since stock market data falls under the category of big data, conventional data mining tools does not work efficiently for the extraction of information from it which can help better stock market prediction. Many machine learning-based techniques have been developed for the purpose. The aim to these techniques are to find effective measures to select stocks having distinguished features (Das and Saha Citation2019; Rasekhschaffe and Jones Citation2019). Various clustering-based algorithms have been developed by the past researchers for stock market data analysis (Bini and Mathew Citation2016; Cai, Le-Khac, and Kechadi Citation2016; Chaudhuri and Ghosh Citation2016; Lee et al. Citation2010). But most of them exhibit a lot of limitations and poor performances owing to the high-dimensional and random trait of the time series stock market data. Also, existence of noise in the data often results in unsatisfactory performance using the traditional methodologies (Nair et al. Citation2017).

Organizational functions nowadays largely depend on stock market data which is of big data (Sivarajah et al. Citation2017) category. The information embedded in this type of big data is extensively used by the organizations for taking important decisions regarding its future growth (Data and Hub Citation2016). Many organizations have already tuned themselves to this category of big data in order to gain greater competitive advantages (Zillner et al. Citation2016). Particularly, financial organizations have turned toward stock market-based big data analytics to make better choice for investments (Prabhu et al. Citation2019). As these analysis deal with vast historical data, significant challenges exist for extracting the desired information from it (Alexander et al. Citation2017). The huge volume of this type of data demands for more refined statistical methods for achieving precise results.

Several algorithms were tried by the past researchers to overcome these limitations by analyzing the effect of changes in the parameters on the clustering mechanism (Dempster, Laird, and Rubin Citation1977; Ester et al. Citation1996; Ren et al. Citation2014; Weinstein and Horn Citation2009). The clustering approaches based on density of the data-points helps to create clusters which are homogeneous in nature with respect to their returns and volatilities (Dinandra, Hertono, and Handari Citation2019). Although many researchers applied density-based clustering approaches for analyzing different aspects of the stock market data (Gupta et al. Citation2023; Han, He, and Toh Citation2023; Parvatha et al. Citation2023; Rukmi et al. Citation2019), however, the techniques are limited to DBSCAN algorithm only and the other modern density-based clustering approaches are not yet being tried for stock market data analysis.

Therefore, in this investigation, experiments have been carried out using various density-based clustering approaches (such as Dynamic Quantum Clustering, Weighted Adaptive Mean Shift Clustering along with DBSCAN, and Expectation Maximization) for grouping companies on the basis of their market trends over two years time period. In order to achieve this, the historic returns and volatility are measured for each of the stocks and the same is being used for clustering purpose. From the clustering results, using the concept of moving average, the returns of the stocks are predicted. Most of the classification methods for stock market prediction require large number of training data. Whereas, methods presented in this article is density-based approach which does not require any training samples.

In this paper, the role of density-based clustering approaches were investigated for the purpose of stock market data analysis. The role of Dynamic Quantum Clustering (DQC) in the stock market data analysis has been thoroughly investigated and its performance was compared with other density-based clustering approaches namely, Weighted Adaptive Mean Shift Clustering (WAMSC), Density- based spatial clustering of applications with noise (DBSCAN) and Expectation Maximization (EM), also with three other partitive clustering-based methods, namely, k-means (Nanda, Mahanty, and Tiwari Citation2010), k-medoids Nakagawa, Imamura, and Yoshida (Citation2019) and fuzzy c means (Esfahanipour and Aghamiri Citation2010) previously applied on stock market data. It was found that in the event of stock market analysis to evaluate the stock market prediction capability, Dynamic Quantum clustering outperforms all the other density-based clustering approaches.

The advantage of choosing the density-based clustering approaches is that the number of clusters need not be specified but it is calculated automatically by the algorithms based on density of the data. Additionally, since the presented methods in this paper are density based, therefore, it can detect arbitrary shaped (both convex and concave shaped clusters). Also outliers can also be detected quite effectively by these approaches. Particularly, Dynamic Quantum clustering approach is a very novel approach in this field. Moreover, the future trend of the stocks under investigation can be predicted by utilizing the concept of moving averages of the respective stocks on top of the clustering results based on the historic returns and volatility factors of the stocks obtained by the different density-based clustering approaches. The objective of the investigation is to devise mechanisms for analyzing the stock market so as to form effective trading decisions.

The rest of the article is arranged in the following manner. Section 2 gives a glimpse of some of the previous clustering-based investigations performed on the stock market data. Section 3 discusses the various algorithms considered for our work along with the information regarding the stock market dataset used. Section 4 describes the experimental evaluation of the algorithms using various performance measures. Section 5 summarizes the experimental results obtained. Finally, Section 6 provides a brief conclusion of our work with a highlight on the subsequent direction of research.

Related Works

This section highlights some of the works done by previous researchers on the forecasting of the stock market along with their advantages and limitations.

Nanda, Mahanty, and Tiwari (Citation2010) used the k-means for constructing diversified portfolios of stocks using the Bombay Stock Exchange for the fiscal year 2007–2008 from the Capitaline Databases Plus. The results of their study were compared with SOM and fuzzy c means and intraclass inertia was used for their performance analysis. One of the major drawback of the k-means is that it is unable to capture if the structure of the market becomes concave.

In a similar work (Esfahanipour and Aghamiri Citation2010), Neuro-Fuzzy Inference System adopted on a Takagi–Sugeno–Kang (TSK)-type Fuzzy Rule-Based System was developed for stock price prediction. In this work, fuzzy C means clustering was implemented for specifying number of rules. The dataset used here is the Tehran Stock Exchange Indexes (TEPIX) and performance is measured using mean absolute percentage error (MAPE). The model forecasted results in TSE index and some Tehran Stock Exchange Indexes.

Baser and Saini tried to generate clusters by using some of the well known methods namely, k-means, k-medoids, and fast k-means (Baser and Saini Citation2015) to help investors for framing optimum portfolio and better risk-return profile. The dataset considered for the purpose was the Nifty 50 corporations from the market. The effectiveness of the three approaches was measured using intra-class inertia which pointed toward the better performance of the k-means algorithm, which produces clusters of compact size in contrast to the clusters produced by the other approaches. The clusters thus produced by the k-means algorithm were then experimented for asset building and portfolio study. The study can be further improved by comparison with other clustering-based methods, so that diverse portfolios with optimum risk-returns can be suggested for capitalists.

In a very preliminary work (Nair et al. Citation2017), the system enables the investors to determine viable options for profit maximization and also in comprehending the appropriate knowledge from the stock market. In this approach, dimensionality reduction is done using regression trees after which the data is subjected to Self Organizing Maps (SOM) for the purpose of clustering. The model is tested according to the performance measure of training error, testing error, number of rules and number of parameters, which are found to be .035, .072, 6 and 67, respectively. The performance of the model is not very promising and also very less number of data has been used for the experiment.

In another work (Cheong et al. Citation2017), a portfolio optimization strategy is demonstrated in which the clustering approach is assisted by a genetic algorithm (GA) for efficient handling of portfolios from the KOSPI 200. Initially, a group of portfolio is generated using the k-means clustering technique. After that a GA is applied to optimize the chosen stocks based on their respective weights. Results show that this amalgamated approach of k-means with the GA works efficiently for the Korean stock market. But the working of this approach may fail to capture the actual shape of the market due to the concave nature of the k-means clusters.

The P-spline-based clustering method proposed in Iorio et al. (Citation2018) constitutes in designing penalized spline smoothers for individual time series data frame. Each data frame is then subjected to a clustering approach on the spline coefficients thus computed. Thus segregating the entire dataset makes it possible to choose the assets from the generated clusters. The outcome of this research was a non-fluctuating portfolio of stocks from the market.

The research by (Nagy and Ormos Citation2018) using an approach established on the idea of spectral clustering demonstrates the dependency of the stock prices on network level data besides the individual firm. The research was done on a exhaustive investigation of 59 stock indices and the efficiency was measured by augmented Dicky-Fuller (ADF) test. The stocks were first clustered and then the equity index graph was generated using the historical daily closing prices.

Nakagawa, Imamura, and Yoshida (Citation2019) used k-medoids algorithms with Indexing Dynamic Time Warping method to predict the number of price fluctuation clusters. The dataset considered here is the TOPIX dataset from the Japan stock market. Using the performance measures of average accuracy and total returns, it was evident that the approach was effective in predicting monthly stock price changes.

(Chandar Citation2019) introduced a technique established on subtractive clustering-based adaptive neuro fuzzy systems for the purpose of forecasting the prices of apple stock. The input to the neuro fuzzy model for forecasting the stock prices were four technical indicators. For evaluating the effectiveness of this approach, standard criteria like training and testing error, number of rules and parameters were computed. One of the drawbacks of the system was that every time the model has to be re-trained, it has to pause for a full-cycle.

Rukmi et al. (Citation2019) used density-based clustering approach detect patterns of stock trading deviation. The dataset considered here are the stock trading transactions. DBSCAN helps in identifying the outlier transactions which have a very different pattern with respect to the conventional transactions. The primary objective of this work is to detect stock price manipulation as outlier transactions done by securities brokers.

Based on the ideas of Partitioning Around Medoids (PAM) and the fuzzy theory (D’Urso et al. Citation2020), developed a clustering technique for the stock market data depending on the predicted cepstrum. Cepstrum can be described as the spectrum of the logarithm of the spectral density function. The developed technique has been effective in computing the appropriate weights in the clustering approach, thus auto-tuning the importance of individual coefficients in the clustering process. The performance of the model was evaluated using Sharpe ratio. However, proposed method cannot handle the impact of outliers.

Parvatha et al. (Citation2023) first uses sentiment analysis on raw data from various social media platforms to convert them into valuable data. Then DBSCAN is applied on the data to evaluate the best approach for stock prediction based on the parameters of DBSCAN algorithm. These outcomes obtained are then compared with the approaches like k-means, LSTM, CNN, with the parameters like MAE, MSE, RMSE.

Han, He, and Toh (Citation2023) develops a pairs trading strategy via unsupervised learning and applied it to the US stock market from January 1980 to December 2020. Three common clustering approaches namely, k-means clustering, density-based spatial clustering of applications with noise (DBSCAN), and agglomerative clustering were implemented and results show that the long-short portfolio constructed via the agglomerative clustering earns a statistically significant annualized mean return. The portfolio f stocks thus obtained tends to have similar future price movement and possibilities for shorting overpriced stocks and longing underpriced stocks within the same cluster can be found.

Gupta et al. (Citation2023) presents a perspective of improving the computation time using an unsupervised machine learning approach, namely, PCA and then using DBSCAN for determining the best pair of stocks. The investigation is evaluated on the historical National Stock Exchange data which indicates significant performance of the work with regard to the NIFTY-50 Global benchmark.

Proper analysis of the framework of the global stock market is essential to spread out the risk evenly so that equity portfolios can be constructed effectively. But building of a relevant portfolio is fairly sophisticated. The assumption made in most of the research works that the framework of the stock market is linear is not very useful is obtaining significant outcomes (Erdős, Ormos, and Zibriczky Citation2011; Maldonado and Saunders Citation1981; Song et al. Citation2011). Additionally, influence of external factors have a great influence on the framework of the market, wherein uncorrelated equities can embark upon changing simultaneously (Heiberger Citation2014). Also, in some cases imbalanced data poses challenges as well as opportunities in the context of credit risk (Mahbobi, Kimiagari, and Vasudevan Citation2021). In such practical scenarios, the minority classes are most likely to provide the required objectives.

The objective of this research was to investigate appropriate approaches for establishing the structure of the stock market based on correlation rather than linearity so that diversification becomes consistent. Spherical clusters detected by k-means type algorithms (Tan, Steinbach, and Kumar Citation2016) (k-means, k-mediods, fast k-means) may not often capture the real shape of the stock market data as they may be sometimes concave in nature. Also pre-determining the optimal number of clusters in case of k-means clustering is a fundamental issue.

Using density-based clustering approaches, little investigations (Gupta et al. Citation2023; Han, He, and Toh Citation2023; Parvatha et al. Citation2023) have been carried out in the past in the stock market data analysis those are only limited to DBSCAN clustering approach. However, no other density-based clustering approaches have been applied so far for stock market data analysis. The fundamental idea of density-based concept is to segregate the sample space by utilizing the data based on density rather than mean centroid. This concept helps to determine tangled as well as irregular shaped clusters (Tan, Steinbach, and Kumar Citation2016). Also, the different shape of the clusters (convex and concave) can also be detected by density-based models (Bhattacharjee and Mitra Citation2021; Chowdhury, Bhattacharyya, and Kalita Citation2021). Moreover, density-based techniques have proven to be able to determine optimal number of clusters from relevant data (Scitovski and Sabo Citation2020), thus enhancing the performance of clustering. summarizes the notable works done toward stock market data analysis.

Table 1. Summary of related works.

Download CSV Display Table

Considering these factors, and as no density-based clustering-based technique except DBSCAN has been applied, therefore, in this research we tried to investigate the clustering of the stock companies using density-based clustering approaches such as Weighted Adaptive Mean Shift Clustering (WAMSC), Dynamic Quantum Clustering (DQC), along with DBSCAN, and Expectation Maximization (EM) algorithms to select the best performing model for diversified stock prediction.

Material and Methods

In this section, we present a short summary of the foundations of our research which deals with evaluation of best dynamic density-based clustering approach for stock market data analysis which has all the big data characteristics. That is we are pursuing to obtain dynamics-based clustering approach which will best suit for the Big data analysis also. We have discussed shortly about the different density-based clustering algorithms experimented. Section 3.1 gives a concise idea about Density-based spatial clustering of applications with noise (DBSCAN). Section 3.2 is associated with the technique of Expectation Maximization (EM). Section 3.3 provides the basics of Weighted Adaptive Mean Shift clustering and section 3.4 outlines Dynamic Quantum clustering whose performance was found to be the best for the purpose. Section 3.5 provides the details of the datasets used for this work.

Density-Based Spatial Clustering of Applications with Noise

Density-based spatial clustering of applications with noise (DBSCAN) is a clustering mechanism which is significantly robust for detecting irregular shaped clusters (Ester et al. Citation1996). It is capable of achieving satisfactory accuracy in clustering even in presence of noise (Yue et al. Citation2004).

The two fundamental attributes of DBSCAN are given by Eps and MinPts. Eps represents the cutoff distance of a point from the core point, to be considered a part of a cluster. MinPts represents the least possible number of data points necessary to form a cluster. The significance of these two attributes is that they are the deciding parameters to decide the the number of clusters in the given dataset (Starczewski and Cader Citation2019), which varies according to the data distribution.

Three types of data points are generated after the execution of the DBSCAN, algorithm. They are the core point, border point and noise. An instance is labeled as a core point if it has atleast, MinPts number of instances within its Eps neighborhood. An instance is termed as border point, if it has number of points less than MinPts within Eps neighborhood and also it must be within the neighborhood of a core point. The instances which cannot be categorized as a core point or a border point are considered as noise.

The determination of Eps in DBSCAN is an important step for determining the number of clusters. For this purpose, we have used an approach which was introduced by Rahmah and Sitanggang (Citation2016). An acceptable value of Eps is determined by first computing the distance to the nearest number of points for each point. Then these distances are sorted and analyzed to see at which point the most distinct changes are observed. The point at which the change is most distinct is selected to be the Eps.

Procedures A1 and A2 provides the pseudo-code of the DBSCAN approach. Procedure A3 gives the working mechanism for determining Eps.

Procedure A1 is used for the formation of the clusters from the dataset by searching the neighborhood of each datapoint in the entire dataset. If accumulation of minimum number of data-points in the neighborhood of a core point is exhausted, then the next cluster is formed taking the subsequent data point as its center. The data which cannot be assigned to any clusters forms the noise.

Procedure A1: DBSCAN

Input:Dataset $D = x_{1}, x_{2}, \dots, x_{n}$

Eps

Minpts

Output:Cluster centers $y (j)$

Steps:Initialize $j = 0$

For each data point $x_{i}$ (i = 1 to n) in dataset $D$

Mark $x_{i}$ as visited

Calculate NeighborPts = all points within $x_{i} {}^{‵}s$ Eps

If sizeof(NeighborPts) $<$ Minpts

mark $x_{i}$ as NOISE

Else

j = j + 1;

fillCluster( $x_{i}$ , NeighborPts, $j$ , Eps, Minpts)

End Procedure

Procedure A2 assigns the points which are reachable to the cluster centers as elements of the respective cluster.

Procedure A2: fillCluster( $x_{i}$ , NeighborPts, $j$ , Eps, Minpts)

Input:Datapoint $x_{i}$

Neighboring points $NeighborPts$

Cluster number $j$

Eps

Minpts

Output:Labeled dataset $p (j) (i)$ where $j^{th}$ cluster filled with the $i^{th}$ data points

Steps:Add data point $x_{i}$ to $p (j) (i)$

For each point $k$ in NeighborPts

If $k$ is not visited

Mark $k$ as visited

NeighborPts2 = all points within $ks$ Eps

If sizeof(NeighborPts2) $\geq$ Minpts

NeighborPts = NeighborPts

concatenated with NeighborPts2

If $k$ is not yet member of any cluster

Add $k$ to cluster $p (j) (k)$

End Procedure

Procedure A3 helps to determine the optimal value of Eps, which is an essential step for determining the number of clusters produced by the algorithm DBSCAN.

Procedure A3: determineEps( $x_{i}$ )

Input:Datapoint $x_{i}$

Output:Eps for each varied density

Steps:For each data point $x_{i}$

Find distance to the other points

$d (i, j)$ $\leftarrow$ find distance $(x_{i}, x_{j})$

Find the minimum values of distances

End for

Sort distances ascending and plot them to find each value

Eps corresponds to critical change in the curve

End Procedure

Expectation Maximization

The Expectation Maximization (EM) algorithm was originally proposed to detect the maximum likelihood estimates of parameters in probabilistic paradigms (Dempster, Laird, and Rubin Citation1977). The two principal steps adapted here includes expectation (E) and maximization (M). These two steps are used to discover the variables which are latent in the available observed data (Kurban, Jenne, and Dalkilic Citation2017). The EM algorithm can be used to construct statistical models from huge datasets (Fayyad, Bradley, and Reina Citation2001).

In order to achieve the required convergence while clustering large datasets, the EM approach uses the various Gaussian-based models (Verma, Dwivedi, and Sevakula Citation2015). For the initial parameters, describing each cluster $k$ namely, the mean $μ_{k}$ , variance $σ_{k}$ and size $π_{k}$ , the joint probability distribution ( $θ (x)$ ) of a data point $x$ can be described by

(1)

θ (x) = \sum_{k} π_{k} N (x; μ_{k}, σ_{k})

(1)

where $N$ is the Gaussian probability density function (Aldershof et al. Citation1995). Then the sample $x$ can initially be assigned to the parametric model, in the present case a particular cluster, as

(2)

θ (x | z = k) = N (x; μ_{k}, σ_{k}) \sim f_{θ_{k}} (x)

(2)

where $z$ is latent variable for observing $x$ . The presence of the unknown value of $z$ helps to explain the patterns in the value of $x$ , for example, groups or clusters.

EM then proceeds iteratively in two steps. In the first step, the expectation or E-step, for the initial Gaussian parameters, mean $μ_{k}$ and size $π_{k}$ , responsibility ( $γ_{i, k}$ ) of the $i^{th}$ data point ( $x_{i}$ ) and $k^{th}$ cluster, is computed as

(3)

γ_{i, k} = \frac{π_{k} f_{θ_{k}} (x_{i})}{\sum_{k = 1}^{K} π_{k} f_{θ_{k}} (x_{i})}, ∀i \in {1, 2, . . n} ∀k \in {1, 2, \dots K}

(3)

where $f_{θ_{k}} (x_{i})$ is the initial probability that the point $x_{i}$ belongs to the cluster $k$ .

EquationEquation (3)(3) $γ_{i, k} = \frac{π_{k} f_{θ_{k}} (x_{i})}{\sum_{k = 1}^{K} π_{k} f_{θ_{k}} (x_{i})}, ∀i \in {1, 2, . . n} ∀k \in {1, 2, \dots K}$ (3) measures the relative probability of data point $x_{i}$ belonging to cluster $k$ . If $x_{i}$ is very likely under the $k^{th}$ Gaussian, it gets higher weight. The denominator helps to make $γ$ ’s sum to one.

In the second step of EM, the maximization or M-step, for the fixed assigned responsibilities $γ_{i, k}$ , updating of the parameters mean $μ$ and size $π$ is done. For each cluster $k$ , updating of its parameter using an estimate weighted by the parameters $γ_{i, k}$ given by EquationEquation (4)(4) $π = \arg ma x_{π \sum π_{k} = 1} \sum_{k = 1}^{K} \log [π_{k}] (\sum_{i = 1}^{n} γ_{i, k})$ (4) and EquationEquation (5)(5) $θ_{k} = \arg ma x_{θ} \sum_{i = 1}^{n} γ_{i, k} \log f_{θ_{k}} (x_{i}), ∀k \in {1, 2, \dots K}$ (5) is done.

(4)

π = \arg ma x_{π \sum π_{k} = 1} \sum_{k = 1}^{K} \log [π_{k}] (\sum_{i = 1}^{n} γ_{i, k})

(4)

(5)

θ_{k} = \arg ma x_{θ} \sum_{i = 1}^{n} γ_{i, k} \log f_{θ_{k}} (x_{i}), ∀k \in {1, 2, \dots K}

(5)

EquationEquation (3)(3) $γ_{i, k} = \frac{π_{k} f_{θ_{k}} (x_{i})}{\sum_{k = 1}^{K} π_{k} f_{θ_{k}} (x_{i})}, ∀i \in {1, 2, . . n} ∀k \in {1, 2, \dots K}$ (3) gives the weighted mean of the assigned data while EquationEquation (4)(4) $π = \arg ma x_{π \sum π_{k} = 1} \sum_{k = 1}^{K} \log [π_{k}] (\sum_{i = 1}^{n} γ_{i, k})$ (4) gives the weighted co-variance (newly calculated co-variance) of the assigned data. It means that if $x_{i}$ is a strong member of cluster $k$ , the value of the assigned weight will be close to 1.

Procedure B provides the pseudo-code for the two fundamental functions of the EM approach. The first step tries to find the missing variables and the second step tries to optimize the estimates of the parameters used in the procedure so that it describes the data better.

Procedure B: Expectation Maximization

Input:Data points $D = x_{1}, x_{2}, \dots . x_{n}$

Parametric model $f_{θ}$

Output:Clusters $θ^{t}$

Steps:Choose an initial distribution $π^{0}$ in the probability simplex and pick a parameter $θ^{0}$ randomly in its state space.

Denote $θ^{0} = (π^{0}, θ^{0})$

For each cluster $t$ = $1$ to $\infty$ (until convergence)

E step to compute the expected responsibilities for

each observation $i$ = $1$ to $n$

γ_{i, k}^{t} = \frac{π_{k}^{t - 1} f_{θ_{k}}^{t - 1} (x_{i})}{\sum_{k = 1}^{K} π_{k}^{t - 1} f_{θ_{k}}^{t - 1} (x_{i})}

γ_{i, k}^{t} = \frac{π_{k}^{t - 1} f_{θ_{k}}^{t - 1} (x_{i})}{\sum_{k = 1}^{K} π_{k}^{t - 1} f_{θ_{k}}^{t - 1} (x_{i})}

M step to update the parameters $θ^{t} = (π^{t}, θ^{t})$

$π^{t}$ = arg $ma x_{π \sum π_{k} = 1} \sum_{k = 1}^{K} \log [π_{k}] (\sum_{i}^{n} = γ_{i, k}^{t})$

and

∀k \in {1, 2, \dots K},

$θ_{k}^{t}$ = arg $ma x_{θ} \sum_{i = 1}^{n} γ_{i, k}^{t} \log f_{θ_{k}} (x_{i})$

Return limiting values in $θ^{t}$

End Procedure

Weighted Adaptive Mean Shift Clustering

The objective of the original mean shift clustering (MSC) methodology (Fukunaga and Hostetler Citation1975) was to determine clusters by constantly updating the possible candidates for the centroid. The basis for this algorithm is the kernel density estimation (Weglarczyk Citation2018) function, which, helps to place each point to its respective cluster. The performance of traditional MSC degrades with the increase in the dimensions of data used (Hyrien and Baran Citation2016).

To address this limitation, Weighted Adaptive Mean Shift Clustering (WAMSC) was proposed (Ren et al. Citation2014). This is basically an extension of the MSC algorithm. This adaptation of weights is done by approximating the importance of features for each data point $x_{i}$ and then computing the distance between $x_{i}$ to the remaining data points in the resulting subset ( $x$ ) of important features (Ren et al. Citation2014). Given a subspace $w_{t}$ , the distance of a point $x_{t}$ from the rest of the points in $x$ is calculated as

(6)

D_{wt} (x_{t}, x) = \frac{\sum_{i = 1}^{d} w_{il} | x_{il} - x_{l} |}{s_{l}}

(6)

where $s_{l}$ is the average $l^{th}$ attribute distance of all pairs of points in $x$ , which is given by

(7)

s_{l} = \frac{1}{(\begin{matrix} n \\ 2 \end{matrix})} \sum_{i \neq j} | x_{il} - x_{jl} |

(7)

Equation 8 gives the values of the density estimation points

(8)

y_{t + 1} = \frac{\sum_{i = 1}^{n} \frac{x_{i}}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}}{\sum_{i = 1}^{n} \frac{1}{h_{t}^{d + 2}} {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}}

(8)

where $h_{t}$ denotes the distance between $x_{t}$ and its $k^{th}$ nearest neighbor and $g (x) = - D^{'} (x_{t}, x)$ , provided that the derivative of $D$ exists.

The corresponding weighted mean shift vector is evaluated using Equation 9,

(9)

m (y_{t}) = \frac{\sum_{i = 1}^{n} \frac{x_{i}}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}}{\sum_{i = 1}^{n} \frac{1}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}} - y_{t}

(9)

EquationEquation 9(9) $m (y_{t}) = \frac{\sum_{i = 1}^{n} \frac{x_{i}}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}}{\sum_{i = 1}^{n} \frac{1}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}} - y_{t}$ (9) ensures that this weighted mean shift algorithm performs effectively while dealing with high-dimensional data.

Procedure C outlines the steps of WAMSC. It attempts to perform clustering by identifying the relevant sub-space and then incorporating this information within the Mean Shift vector. This will result in a methodology which is capable of enhancing the clustering speed without compromising accuracy.

Procedure C: Weighted Adaptive Mean Shift Clustering

Input:Data points $D = x_{1}, x_{2}, \dots . x_{n}$

Output:Cluster centers $y (t)$

Steps:Initialize $i = 0$ , $t = 1$ and $y_{0} \leftarrow x_{i}$

For each cluster $t$ = $1$ to $\infty$ (Until convergence)

Calculate the density estimation points $y_{t + 1}$

y_{t + 1} = \frac{\sum_{i = 1}^{n} \frac{x_{i}}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}}{\sum_{i = 1}^{n} \frac{1}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}}

Calculate the corresponding Mean Shift Vector $m (y_{t})$

m (y_{t}) = \frac{\sum_{i = 1}^{n} \frac{x_{i}}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}}{\sum_{i = 1}^{n} \frac{1}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}} - y_{t}

Return $y (t)$

End Procedure

Dynamic Quantum Clustering

The Quantum clustering approach was introduced by Horn and Gottlieb (Horn and Gottlieb Citation2002). This technique is based on the fundamentals of quantum mechanics. The algorithm basically calculates the potential function from the Schrödinger potential equation. It then places each data point to its respective cluster using the gradient descent mechanism (Horn and Gottlieb Citation2002). The drawback of this algorithm is that it becomes computationally expensive with the increase in the volume of the data.

To overcome this limitation, Dynamic Quantum Clustering (DQC) algorithm was proposed by Weinstein and Horn (Weinstein and Horn Citation2009). This approach is effective in handling large datasets. The DQC algorithm, uses time-dependent part of the Schrodinger equation, for allocating data points to its corresponding cluster, thus making it suitable for clustering big data (Weinstein et al. Citation2013).

The algorithm starts by assigning a Gaussian $ψ$ to a data point $x$ ,

(10)

ψ (x) = \sum_{i = 1}^{n} e^{- \frac{{(x - x_{i})}^{2}}{2 σ^{2}}}

(10)

with width $σ$ to the $n$ data points.

Now, from the assigned Gaussian kernel, $ψ$ , of the data points and the width, $σ$ , between each data points, the potential function ( $V (x)$ ) is calculated as (Deodhar et al. Citation2008):

(11)

V (x) = \frac{1}{2 σ^{2} ψ} \sum_{i = 1}^{n} {(x - x_{i})}^{2} e^{- \frac{{(x - x_{i})}^{2}}{2 σ^{2}}}

(11)

The cluster centers for the set of points, $x_{i}$ , are then determined from the local minima of the potential function $V (x)$ (Deodhar et al. Citation2008). If $ψ (x_{i})$ is the Gaussian localized around the the $i^{th}$ data point $x_{i}$ , then the relation of $x_{i}$ to its respective cluster center, $V (x)$ , is derived using the following time-dependent trajectory as

(12)

x_{i} = ⟨ ψ (x_{i}) | x | ψ (x_{i}) ⟩

(12)

If $ψ (x_{i})$ follows a narrow Gaussian, then it can be ascertained that each data point moves toward the nearest minimum of the potential. The main advantage of this methodology is that it is capable of searching the connections among the closely related data-points even in huge spaces.

Procedure D describes the dynamic quantum clustering approach which can be used for clustering big data to find associated groups from it. The parameter $σ$ regulates the proximity of the data points in the respective clusters. The smaller the value of $σ$ , higher will be the number clusters with less number of instances in the individual clusters. For significantly smaller values of $σ$ , all the instances are grouped in their own distinct clusters. On the other hand, for larger values of $σ$ , all the instances tends to be in the same cluster.

Procedure D: Dynamic Quantum Clustering

Input:Data points $D = x_{1}, x_{2}, \dots . x_{n}$

Output:Cluster centers $y (i)$

Steps: $n$ = the number of data points

Calculate the Gaussian

ψ (x) = \sum_{i = 1}^{n} e^{- \frac{{(x - x_{i})}^{2}}{2 σ^{2}}}

For each data point $i$ = $1$ to $n$

Calculate the Potential Function for each $x_{i}$ as per EquationEquation (11)(11) $V (x) = \frac{1}{2 σ^{2} ψ} \sum_{i = 1}^{n} {(x - x_{i})}^{2} e^{- \frac{{(x - x_{i})}^{2}}{2 σ^{2}}}$ (11)

for $i$ = $1$ to $n$

Assign data to the clusters as per EquationEquation (12)(12) $x_{i} = ⟨ ψ (x_{i}) | x | ψ (x_{i}) ⟩$ (12)

Return $y (i)$

End Procedure

Dataset Description

Here we have used the historical time series data of S&P 500 obtained from the Quandl stock exchange data (https://www.quandl.com/) (accessed on $4^{th}$ January 2017). The S&P 500 (https://fred.stlouisfed.org/series/SP500) (accessed on $5^{th}$ April 2018) contains the most prevailing stocks traded on American stock exchanges. In this investigation, the period of stock market data under consideration was spanning from $4^{th}$ January 2016 to $4^{th}$ January 2018.

The dataset consists of all the facts concerning the day-to-day properties of the stock quotes, consisting of

Date of the complete stock exchange time period
The starting price at the opening of the time period
Maximum price reached during the time period
Minimum price attained during the time period
Ending price of the given time period
Entire number of transactions during this time period
Rectified closing price

The fundamental format of the dataset is given in . The prominent discriminating characteristics of this data is that the arrangement of the dataset matters, on account of their appended time tags.

Table 2. Structure of the stock market dataset.

Download CSV Display Table

In the domain of stock market, volatility (Day and Lewis Citation1992) illustrates the fluctuations in the prices of the stocks for a specific time span. Volatility is determined by constantly computing the standard deviation of the prices for the time period under consideration. From the dataset described in , we first calculate the returns of each stock using the standard method of dividing each closing price by the closing price of the day before and subtracting one (Fan and Yao Citation2017). The standard deviation of all the returns of a particular stock for the considered time frame is then computed as the respective volatilities for each stock.

Experimental Evaluation

The aim of this work is to cluster the companies of the same business stature using density based clustering methods. This is done using the data available from the stock market depending on their market movement. The two important steps in this regard is the assessment of the market movement and the determination of number of clusters, which is discussed in sub sections 4.2 and 4.1 respectively. The experiment is discussed in brief in sub section 4.3.

Determination of Number of Clusters

Since all the clustering methods applied are density based, therefore the number of clusters is automatically determined by the respective algorithms depending upon their parameter values as described below.

The number of clusters determined by the DBSCAN algorithm which is dependent on the value Eps, is determined by the approach given by Rahmah and Sitanggang (Citation2016), which is explained in Procedure A3. Based on the average optimal value of Eps, the parameter Eps generated produced 4 clusters with outliers at a few points. The number of clusters in case of the EM clustering algorithm is the limiting values of $θ^{t}$ as given in EquationEquation (5)(5) $θ_{k} = \arg ma x_{θ} \sum_{i = 1}^{n} γ_{i, k} \log f_{θ_{k}} (x_{i}), ∀k \in {1, 2, \dots K}$ (5) which is found to be 3. The modes $y_{t + 1}$ calculated by the WAMSC as given in EquationEquation (8)(8) $y_{t + 1} = \frac{\sum_{i = 1}^{n} \frac{x_{i}}{h_{t}^{d + 2}} g {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}}{\sum_{i = 1}^{n} \frac{1}{h_{t}^{d + 2}} {(\frac{D_{w} t (x_{t}, y_{t})}{h_{t}})}^{2}}$ (8) gives the number of clusters which is computed to be 3. Finally, for DQC the number of clusters determined by EquationEquation (11)(11) $V (x) = \frac{1}{2 σ^{2} ψ} \sum_{i = 1}^{n} {(x - x_{i})}^{2} e^{- \frac{{(x - x_{i})}^{2}}{2 σ^{2}}}$ (11) is found to be 5.

For the partitive clustering algorithms, the number of optimal clusters by implementing the k-means technique described in (Nanda, Mahanty, and Tiwari Citation2010) is computed to be 4, the k-medoids (Nakagawa, Imamura, and Yoshida Citation2019) is computed to be 6 and using the fuzzy c means technique described in (Esfahanipour and Aghamiri Citation2010) is computed to be 5 when applied on the stock market dataset considered in this study.

Assessment of the Market Movement

For experimentation purpose, the pricing data for the 3000 stocks from the S&P 500 has been taken into consideration. The stocks are grouped into distinct clusters depending upon their historic returns and volatilities. As the time tag is attached with each observation in the dataset, the clusters are derived depending on volatility ((Day and Lewis Citation1992)), which are basically the variations of the stock prices with respect to time.

If $r_{x}$ is the average rate of return of a stock $x$ , then volatility of $x$ is defined as

(13)

volatilit y_{x} = StdDev (r_{x})

(13)

where $StdDev (r_{x})$ is the standard deviation on the return of $x$ , which basically represents the difference in historic returns of a stock ((Day and Lewis Citation1992)).

Experimentation

Thus the list of stocks from the S&P 500 are assigned to its corresponding clusters based on their returns and volatility characteristics. Stocks belonging to clusters having high volatility, is chosen for investment choices. It is based on the fundamental rule that the performance of stocks with high volatility are significantly higher than the stocks with low volatility (Pandey and Sehgal Citation2017).

From the cluster having high volatility, two of the stocks were chosen for investment choices. And for evaluation of the results some of the most commonly used metrics for quantifying clustering applications were used for the two chosen stocks. These validity measures include Normalized Mutual Information ( $NMI$ ) (Estévez et al. Citation2009), Adjusted Rand Index ( $ARI$ ) (Yeung and Ruzzo Citation2001), Jaccard Index ( $JI$ ) (Halkidi, Batistakis, and Vazirgiannis Citation2001) and Fowlkes Mallows Index ( $FM$ ) (Campello Citation2007). The details of these performance measures are discussed in sub-section 4.4.

In order to frame the trading strategies, the two of the stock companies from the cluster having high volatility are chosen. The concept of moving averages have been applied to them. The concept of moving averages helps in identification of stock trends. Moving average (Chang et al. Citation2018) is a technical indicator, generally used for computing the future trend of a stock based on its past prices. It is calculated by averaging the closing prices spanning over the one year period. The moving average values, thus calculated, are cascaded to create a singular flowing line, depicting the market trend.

Validity Measures

Normalized Mutual Information

To find the efficiency of clustering, Normalized mutual information ( $NMI$ ) follows the principle of entropy-based measures (McDaid, Greene, and Hurley Citation2011). Entropy is useful for evaluating the measure of orderliness of the information in all the partitions.

Entropy of a clustering, $C$ , is measured as:

(14)

H (C) = - \sum_{i = 1}^{r} p_{C_{i}} log p_{C_{i}}

(14)

where $r$ is the total number of clusters obtained and $p_{C_{i}}$ is the probability of the $i^{th}$ cluster ( $C_{i}$ ). $p_{C_{i}}$ is calculated as:

(15)

p_{C_{i}} = \frac{n_{i}}{n}

(15)

which is basically the number of instances ( $n_{i}$ ) in $i^{th}$ particular cluster divided by the total number of instances ( $n$ ).

Then entropy of the actual partitioning of the dataset, $T$ , is defined in a similar way:

(16)

H (T) = - \sum_{j = 1}^{k} p_{T_{j}} log p_{T_{j}}

(16)

where $k$ is the actual number of clusters.

Here $p_{T_{j}}$ is the probability of the actual $j^{th}$ partition ( $T_{j}$ ). $p_{T_{j}}$ is calculated as:

(17)

p_{T_{j}} = \frac{n_{j}}{n}

(17)

where the $n_{j}$ is the number of instances in the actual $j^{th}$ partition and $n$ is the total number of instances.

Exploiting the idea of the entropy-based measure, Mutual information, $I (C, T)$ quantifies the aggregate information similar amidst obtained clustering C and actual partitioning T using the following equation:

(18)

I (C, T) = \sum_{i = 1}^{r} \sum_{j = 1}^{k} p_{ij} log \frac{p_{ij}}{p_{C_{i}} p_{T_{j}}}

(18)

where $r$ is the number of clusters obtained and $k$ is the number of actual clusters. $p_{ij}$ measures the observed joint probability of $C$ and $T$ .

Normalized mutual information, $NMI (C, T)$ to find how accurate the obtained outcome is to the actual knowledge is calculated as:

(19)

NMI (C, T) = \frac{I (C, T}{\sqrt{H (C) . H (T)}}

(19)

EquationEquation 19(19) $NMI (C, T) = \frac{I (C, T}{\sqrt{H (C) . H (T)}}$ (19) returns the normalized mutual information which is a value ranging between .0 and 1.0. $NMI (C, T)$ values close to 1 infers that the cluster labeling is accurate.

Adjusted Rand Index

The Adjusted Rand Index ( $ARI$ ) gives an estimate of similarity of the two clustering assignments for the predicted and actual values for the entire set of data points (Yeung and Ruzzo Citation2001).

Let $U = {u_{1}, \dots, u_{R}}$ be the set of actual clusters and $V = {v_{1}, \dots, v_{C}}$ be the set of clusters obtained for a set of $n$ objects $S = {o_{1}, \dots o_{n}}$ .

If $n_{ij}$ be the number of objects present in both the actual cluster $u_{i}$ and the obtained cluster $v_{j}$ ; and the number of objects in the actual cluster $u_{i}$ and obtained cluster $v_{j}$ are $n_{i}$ and $n_{j}$ respectively, then the Adjusted Rand Index is calculated as:

(20)

ARI = \frac{\sum_{i} \sum_{j}^{n_{ij}} C_{2} - \frac{[\sum_{i}^{n_{i}} C_{2} \sum_{j}^{n_{j}} C_{2}]}{n C_{2}}}{\frac{\frac{1}{2} [\sum_{i}^{n_{i}} C_{2} + \sum_{j}^{n_{j}} C_{2}] - [\sum_{i}^{n_{i}} C_{2} \sum_{j}^{n_{j}} C_{2}]}{n C_{2}}}

(20)

where $^{n} C_{r}$ represents the combination of $r$ objects from $n$ .

EquationEquation (20)(20) $ARI = \frac{\sum_{i} \sum_{j}^{n_{ij}} C_{2} - \frac{[\sum_{i}^{n_{i}} C_{2} \sum_{j}^{n_{j}} C_{2}]}{n C_{2}}}{\frac{\frac{1}{2} [\sum_{i}^{n_{i}} C_{2} + \sum_{j}^{n_{j}} C_{2}] - [\sum_{i}^{n_{i}} C_{2} \sum_{j}^{n_{j}} C_{2}]}{n C_{2}}}$ (20) gives the adjusted random index which returns a value between −1.0 and 1.0. A value of 1.0 means a perfect match for the clustering labels while a value of −1.0. means random labeling.

Jaccard Index

To calculate the consistency between the results of actual and obtained clustering, the Jaccard Index ( $JI$ ) (Yang Citation2016) is calculated as

(21)

JI = \frac{\sum_{i} \sum_{j^{n_{ij}}} C_{2}}{\sum_{i}^{n_{i}} C_{2} + \sum_{j}^{n_{j}} C_{2} - \sum_{i}^{n_{i, j}} C_{2}}

(21)

where $n_{ij}$ is the number of objects in both the actual cluster $u_{i}$ and the obtained cluster $v_{j}$ and the number of objects in cluster $u_{i}$ and cluster $v_{j}$ being $n_{i}$ and $n_{j}$ , respectively.

EquationEquation 21(21) $JI = \frac{\sum_{i} \sum_{j^{n_{ij}}} C_{2}}{\sum_{i}^{n_{i}} C_{2} + \sum_{j}^{n_{j}} C_{2} - \sum_{i}^{n_{i, j}} C_{2}}$ (21) returns the Jaccard index which calculates the similarity between the two sets of data. The range of this index is between .0 to 1.0. Greater the value of this indicator, higher the similarity between the actual and obtained results.

Fowlkes Mallows Index

The Fowlkes-Mallows ( $FM$ ) index estimates the efficiency of the clustering algorithm. Here the second clustering is assumed to be the ground-truth (Wagner and Wagner Citation2007).

For $n$ number of instances and $k$ number of clusters in the clustering $A$ and $B$ , the Fowlkes-Mallows index is calculated as:

(22)

F M_{k} = \frac{T_{k}}{\sqrt{P_{k} Q_{k}}}

(22)

where $T_{k} = (\sum_{i = 1}^{k} \sum_{j = 1}^{k} m_{ij}^{2}) - N$

P_{k} = (\sum_{i = 1}^{k} {(\sum_{j = 1}^{k} m_{ij})}^{2}) - N

$Q_{k} = (\sum_{j = 1}^{k} (\sum_{i = 1}^{k} m_{ij})^{2}) - N$ and

$m_{ij}$ is the number of instances in the $i^{th}$ $j^{th}$ position of the ( $kXk$ ) matrix which is constructed by the number of instances that lie in the $i^{th}$ cluster of clustering algorithm A and the $j^{th}$ cluster of a clustering algorithm B.

In simpler terms, Fowlkes-Mallows index can also be expressed as:

(23)

FM = \frac{TP}{\sqrt{(} TP + FP) (TP + FN)}

(23)

where

TP denotes the count of true-positive cases

FP denotes the count of false-positive cases

FN denotes the count of false-negative cases

The value produced by EquationEquation (23)(23) $FM = \frac{TP}{\sqrt{(} TP + FP) (TP + FN)}$ (23) ranges from .0 to 1.0. A higher score of FM infers a conformity between the ground truth and the obtained results.

Results and Discussion

The four density-based clustering approaches namely, density-based spatial clustering of applications with noise (DBSCAN), Expectation Maximization (EM), Weighted Adaptive Mean Shift Clustering (WAMSC) and Dynamic Quantum Clustering (DQC) were applied on the time series data. This was found to be very effective in creating portfolios having a collection of stocks with diverse properties. illustrates the results of clustering of the stocks into different similarity groups (represented with different colors) on the basis of their returns and volatilities using the four clustering algorithms. Experimental results suggest that the algorithms can detect the noisy points/outliers correctly as shown in where the black color dots represent the outliers which are situated far away.

Figure 1. Results of clustering of stocks into different similarity groups on the basis of their returns and volatilities formed by (a) DBSCAN (b) EM (c) WAMSC (d) DQC algorithm.

The models are used for executing statistically informed trade decisions. It is clear from the figures that the results obtained from DBSCAN and EM are somewhat skewed and this makes it quite difficult to infer a logical explanation of the market. On the contrary, WAMSC and DQC shows stock clusters with both high and low coverage.

From the clusters thus generated from the Dynamic Quantum clustering approach based on the returns and volatilities, we identified the stocks belonging to cluster with the highest volatility. The reason being that greater the amount of volatility, the greater the potential return. From such a pool of stocks, two of the stocks selected in the present case were Apple (AAPL) and Microsoft (MSFT).

summarizes the performance of the DBSCAN, expectation maximization (EM), weighted adaptive mean shift clustering (WAMSC) and dynamic quantum clustering (DQC) on the stock returns of AAPL and MSFT based on four performance metrics, namely, $NMI$ , $ARI$ , $JI$ and $FM$ . Also, for comparing the results with counterpart techniques from the literature, namely, k-means (Nanda, Mahanty, and Tiwari Citation2010),k-medoids Nakagawa, Imamura, and Yoshida (Citation2019) and fuzzy c means (Esfahanipour and Aghamiri Citation2010) applied on the same stock market dataset. The standard deviations of accuracies obtained from 10 simulations are also shown using $\pm$ sign corresponding to each percentage accuracy in . From the table, it is evident that in terms of all the metrics DQC outperformed all the other compared methods.

Table 3. Summary of the average experimental results (in term of accuracy $NMI$ , $ARI$ , $JI$ and $FM$ ) on the two stock market dataset, namely, AAPL and MSFT.

Display Table

depicts the box plots (Cox Citation1987) of percentage accuracy, $NMI$ , $ARI$ , $JI$ and $FM$ achieved by the different methods performed with 10 simulations on the two stock market datasets namely AAPL and MSFT respectively. From the , it is observed that the median values obtained by DQC are significantly higher than the other methods considered. This infers that Dynamic Quantum Clustering approach has better performance than the other approaches in terms of different evaluation indices.

Figure 2. Box plots of (a) %Accuracy, (b) $NMI$ , (c) $ARI$ , (d) $JI$ and (e) $FM$ index obtained using different methods, namely, DQC, DBSCAN, EM and WAMSC performed on AAPL dataset.

Figure 3. Box plots of (a) %Accuracy, (b) $NMI$ , (c) $ARI$ , (d) $JI$ and (e) $FM$ index obtained using different methods, namely, DQC, DBSCAN, EM and WAMSC performed on MSFT dataset.

Statistical Significance Test

From , it is seen that the performance of dynamic quantum approach on the stock market data analysis is better than that of other density-based approaches in case of Big data analysis like stock market data. The results are statistically verified by performing paired t-test (Cox Citation1987) on DQC versus the other counterpart methods for the two stocks AAPL and MSFT at 5% level of significance. The results of the paired t-tests in terms of p-score for AAPL and MSFT are given in accordingly. Results are treated as noteworthy when the p-values are less than or equal to .05 (at 5% significance level) which are represented by bold font. In a summary, out of total 30 paired tests as shown in , in 27 cases DQC achieved statistically significant improvements compared to its other counterpart techniques. Although in remaining 3 cases (shown in normal font in ), DQC achieved better results compared to other methods but that better results are not statistically significant.

Table 4. Paired t-test results in terms of p score values performed by the best performing method DQC versus other methods for AAPL dataset.

Display Table

Table 5. Paired t-test results in terms of p score values performed by the best performing method DQC versus other methods for MSFT dataset.

Display Table

Implication of Results

While building portfolios in stock market, diversification is a major step as investing in all the stocks may not be possible due to financial constraints. Also not all stocks are worth investing in and thus an efficient plan is required to choose only the stocks that behave differently. This strategy helps in minimizing risks by selecting stocks from different sectors but in a more data-driven way. So, clustering algorithms applied here identifies the different clusters of stocks from which best performing stocks can be chosen.

In this work, the stocks has been divided into distinct groups based upon said returns and volatilities. Dividing stocks into groups with similar characteristics can help in portfolio construction to ensure that the chosen stocks have sufficient diversification between them. Such diversification prevents the portfolio suffering greatly from downturn of a specific type of companies during a period. Also, volatility clustering has critical implications in financial risk management (Kim and Song Citation2020).

Effectiveness of the Density-Based Clustering Approaches

Now to justify the performance of the various density-based clustering approaches for identifying stock trends, we applied the concept of moving averages (Chang et al. Citation2018) to the stocks AAPL and MSFT, obtained from the same cluster. Simple Moving Average (SMA) refers to a stock’s average closing price over a specified period (Fifield, Power, and Knipe Citation2008).

SMA can be calculated by taking the arithmetic mean of a given set of values over a specified period (Metghalchi, Marcucci, and Chang Citation2012). A set of numbers, or prices of stocks, are added together and then divided by the number of prices in the set. The formula for calculating the simple moving average of a stock is as follows:

(24)

SMA = (A_{1} + A_{2} + . . + A_{n}) / n

(24)

where A is average of the stock prices in period n and n is the number of time periods. Moving averages are calculated to identify the trend direction of a stock.

To visualize the clustering results, stocks with the highest volatility values are plotted in based on the historical returns in vertical axis with respect to the dates (horizontal axis). The historic return is basically the difference between the current price of a stock and the previous day’s closing price (Fan and Yao Citation2017). Similarly, the historical returns of the stocks from the others clusters can also be visualized.

Figure 4. Historical correlation between the stocks AAPL and MSFT from the same cluster.

Qualitative Analysis of the Results

The results from the experimental evaluation implies that density-based clustering approaches, particularly, Dynamic Quantum clustering, are quite effective in creating portfolios for the stock market data. From the portfolios thus created, stocks showing different behavior can be selected for planning trading strategies. Results from statistical significance test confirm the validity of the Dynamic Quantum clustering approach. This better models the time series stock market data which results in better stock prediction accuracy and thus provides the useful profitable trades prediction tool.

Conclusion

Stock market prediction is an interesting topic which uses the methodologies related to extraction of knowledge from big data. With this objective in mind, the present empirical study tries to analyze the applicability of different density-based clustering techniques for the stock market data analysis. Under this category, Dynamic Quantum clustering (DQC), Density-based spatial clustering of applications with noise (DBSCAN), Expectation Maximization (EM) and Weighted Adaptive Mean Shift Clustering (WAMSC) were used. The stock market data used was the S&P 500 trading data for a time span of two years. Also for comparison purposes, partitive clustering methods, namely, k-means, k-medoids and fuzzy c means was implemented on the same stock market data.

The effectiveness of the techniques were validated on two stocks, namely AAPL and MSFT, in terms of different validity measures namely, NMI, ARI, Jaccard index and FM index. From the performance measures, DQC is found to be superior than the other density-based approaches as well as the partitive-based approaches. The statistical significance of the better achievement accomplished by DQC method in contrast to the other methods is confirmed from the paired t-test results for most of the evaluation indices.

The box plots of the results for the two stocks on AAPL and MSFT also suggest better median value of the evaluation measure produced by DQC method in comparison to other counter part methods. Thus indicating the superiority of DQC in stock market data analysis and prediction than its competitive techniques.

These stocks when introduced to the concept of moving average acts as a technical indicator of price trends and helps in the forecasting on the next-day stock limit predictions. The forecasting ability of the DQC turns out to be much better than its other competitors. Thus it can be inferred with sufficient proof that DQC is one of the best tool for the stock market data analysis. Since stock market data is of the category of big data, it can be safely inferred that DQC is a suitable tool for other variety of big data analysis also.

As the clustering approaches considered here are not limited to spherical shapes but based on density, so they can handle arbitrary shaped clusters (both concave and convex shaped clusters) and in turn handle noisy data quite effectively. However, as the data size increases, it takes a toll in the running time of the algorithms. So, in the future, we consider to increase the efficiency of the algorithms by decreasing the processing time of the data. Also trend detection with the help of moving averages works fairly well during smooth conditions. But the drawback is that if the prices of the stocks becomes uneven, multiple trade signals are generated which creates confusion. These issues can be addressed in the future work.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Publicly available stock market datasets are used in this article and therefore data are already publicly available.

References

Aldershof, B., J. Marron, B. Park, and M. Wand. 1995. Facts about the gaussian probability density function. Applicable Analysis 59 (1–4):289–31. doi:10.1080/00036819508840406.
Google Scholar
Alexander, L., S. R. Das, Z. Ives, H. Jagadish, and C. Monteleoni. 2017. Research challenges in financial data modeling and analysis. Big Data 5 (3):177–88. doi:10.1089/big.2016.0074.
Web of Science ®Google Scholar
Baser, P., and J. R. Saini. 2015. Agent based stock clustering for efficient portfolio management. International Journal of Computer Applications 975 (3):35–41. doi:10.5120/20317-2381.
Google Scholar
Bhattacharjee, P., and P. Mitra. 2021. A survey of density based clustering algorithms. Frontiers of Computer Science 15 (1):1–27. doi:10.1007/s11704-019-9059-3.
Web of Science ®Google Scholar
Bini, B., and T. Mathew. 2016. Clustering and regression techniques for stock prediction. Procedia Technology 24:1248–55. doi:10.1016/j.protcy.2016.05.104.
Google Scholar
Cai, F., N.-A. Le-Khac, and T. Kechadi. 2016. Clustering approaches for financial data analysis: A survey. arXiv Preprint arXiv 1609:08520.
Google Scholar
Campello, R. J. 2007. A fuzzy extension of the rand index and other related indexes for clustering and classification assessment. Pattern Recognition Letters 28 (7):833–41. doi:10.1016/j.patrec.2006.11.010.
Web of Science ®Google Scholar
Chandar, S. K. 2019. Stock market prediction using subtractive clustering for a neuro fuzzy hybrid approach. Cluster Computing 22 (6):13159–66. doi:10.1007/s10586-017-1321-6.
Google Scholar
Chang, C.-L., J. Ilomäki, H. Laurila, and M. McAleer. 2018. Long run returns predictability and volatility with moving averages. Risks 6 (4):105. doi:10.3390/risks6040105.
Google Scholar
Chaudhuri, T. D., and I. Ghosh. 2016. Using clustering method to understand Indian stock market volatility. arXiv Preprint arXiv 1604:05015.
Google Scholar
Cheong, D., Y. M. Kim, H. W. Byun, K. J. Oh, and T. Y. Kim. 2017. Using genetic algorithm to support clustering-based portfolio optimization by investor information. Applied Soft Computing 61:593–602. doi:10.1016/j.asoc.2017.08.042.
Web of Science ®Google Scholar
Chowdhury, H. A., D. K. Bhattacharyya, and J. K. Kalita. 2021. Uifdbc: Effective density based clustering to find clusters of arbitrary shapes without user input. Expert Systems with Applications 186:115746. doi:10.1016/j.eswa.2021.115746.
Web of Science ®Google Scholar
Cox, C. P. 1987. A handbook of introductory statistical methods. Washington DC: John Wiley & Sons.
Google Scholar
Das, T., and G. Saha. 2019. Addressing big data issues using rnn based techniques. Journal of Information and Optimization Sciences 40 (8):1773–85. doi:10.1080/02522667.2019.1703268.
Web of Science ®Google Scholar
Data, I. B., and A. Hub. 2016. Extracting business value from the 4 v’s of big data. Retrieved July 19:2017.
Google Scholar
Day, T. E., and C. M. Lewis. 1992. Stock market volatility and the information content of stock index options. Journal of Econometrics 52 (1–2):267–87. doi:10.1016/0304-4076(92)90073-Z.
Web of Science ®Google Scholar
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1):1–22. doi:10.1111/j.2517-6161.1977.tb01600.x.
Web of Science ®Google Scholar
Deodhar, M., J. Ghosh, G. Gupta, H. Cho, and I. Dhillon. 2008. Hunting for coherent co-clusters in high dimensional and noisy datasets. 2008 IEEE International Conference on Data Mining Workshops, Pisa, Italy, 654–63. IEEE.
Google Scholar
Dinandra, R., G. Hertono, and B. Handari. 2019. Implementation of density-based spatial clustering of application with noise and genetic algorithm in portfolio optimization with constraint. AIP Conference Proceedings, Depok, Indonesia, 2168, 020026. AIP Publishing LLC.
Google Scholar
D’Urso, P., L. De Giovanni, R. Massari, R. L. D’Ecclesia, and E. A. Maharaj. 2020. Cepstral-based clustering of financial time series. Expert Systems with Applications 161:113705. doi:10.1016/j.eswa.2020.113705.
Web of Science ®Google Scholar
Erdős, P., M. Ormos, and D. Zibriczky. 2011. Non-parametric and semi-parametric asset pricing. Economic Modelling 28 (3):1150–62. doi:10.1016/j.econmod.2010.12.008.
Web of Science ®Google Scholar
Esfahanipour, A., and W. Aghamiri. 2010. Adapted neuro-fuzzy inference system on indirect approach tsk fuzzy rule base for stock market analysis. Expert Systems with Applications 37 (7):4742–48. doi:10.1016/j.eswa.2009.11.020.
Web of Science ®Google Scholar
Ester, M., H.-P. Kriegel, J. Sander, X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96 (34):226–31.
Google Scholar
Estévez, P. A., M. Tesmer, C. A. Perez, and J. M. Zurada. 2009. Normalized mutual information feature selection. IEEE Transactions on Neural Networks 20 (2):189–201. doi:10.1109/TNN.2008.2005601.
PubMed Web of Science ®Google Scholar
Fan, J., and Q. Yao. 2017. The elements of financial econometrics. Cambridge, United Kingdom: Cambridge University Press.
Google Scholar
Fayyad, U., P. S. Bradley, and C. Reina. 2001. Scalable system for expectation maximization clustering of large databases. US Patent 6,263,337.
Google Scholar
Fifield, S., D. Power, and D. Knipe. 2008. The performance of moving average rules in emerging stock markets. Applied Financial Economics 18 (19):1515–32. doi:10.1080/09603100701720302.
Google Scholar
Fukunaga, K., and L. Hostetler. 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21 (1):32–40. doi:10.1109/TIT.1975.1055330.
Web of Science ®Google Scholar
Gupta, V., V. Kumar, Y. Yuvraj, and M. Kumar. 2023. Optimized pair trading strategy using unsupervised machine learning. 2023 IEEE 8th International Conference for Convergence in Technology (I2CT), Lonavla, India, 1–5. IEEE.
Google Scholar
Halkidi, M., Y. Batistakis, and M. Vazirgiannis. 2001. On clustering validation techniques. Journal of Intelligent Information Systems 17 (2–3):107–45. doi:10.1023/A:1012801612483.
Web of Science ®Google Scholar
Han, C., Z. He, and A. J. W. Toh. 2023. Pairs trading via unsupervised learning. European Journal of Operational Research 307 (2):929–47. doi:10.1016/j.ejor.2022.09.041.
Web of Science ®Google Scholar
Heiberger, R. H. 2014. Stock network stability in times of crisis. Physica A: Statistical Mechanics and Its Applications 393:376–81. doi:10.1016/j.physa.2013.08.053.
Web of Science ®Google Scholar
Horn, D., and A. Gottlieb. 2002. The method of quantum clustering. Advances in Neural Information Processing Systems 14:769–76.
Google Scholar
Hyrien, O., and A. Baran. 2016. Fast nonparametric density-based clustering of large datasets using a stochastic approximation mean-shift algorithm. Journal of Computational and Graphical Statistics 25 (3):899–916. doi:10.1080/10618600.2015.1051625.
PubMed Web of Science ®Google Scholar
Iorio, C., G. Frasso, A. D’Ambrosio, and R. Siciliano. 2018. A p-spline based clustering approach for portfolio selection. Expert Systems with Applications 95:88–103. doi:10.1016/j.eswa.2017.11.031.
Web of Science ®Google Scholar
Kim, K., and J. W. Song. 2020. Analyses on volatility clustering in financial time-series using clustering indices, asymmetry, and visibility graph. IEEE Access 8:208779–95. doi:10.1109/ACCESS.2020.3037240.
Web of Science ®Google Scholar
Kurban, H., M. Jenne, and M. M. Dalkilic. 2017. Using data to build a better em: Em* for big data. International Journal of Data Science and Analytics 4 (2):83–97. doi:10.1007/s41060-017-0062-1.
Google Scholar
Lee, A. J., M.-C. Lin, R.-T. Kao, and K.-T. Chen. 2010. An effective clustering approach to stock market prediction. PACIS 54:54.
Google Scholar
Mahbobi, M., S. Kimiagari, and M. Vasudevan. 2021. Credit risk classification: An integrated predictive accuracy algorithm using artificial and deep neural networks. Annals of Operations Research 330 (1–2):609–37. doi:10.1007/s10479-021-04114-z.
Web of Science ®Google Scholar
Maldonado, R., and A. Saunders. 1981. International portfolio diversification and the inter-temporal stability of international stock market relationships, 1957-78. Financial Management 10 (4):54–63. doi:10.2307/3665219.
Web of Science ®Google Scholar
McDaid, A. F., D. Greene, and N. Hurley. 2011. Normalized mutual information to evaluate overlapping community finding algorithms. arXiv Preprint arXiv 1110:2515.
Google Scholar
Metghalchi, M., J. Marcucci, and Y.-H. Chang. 2012. Are moving average trading rules profitable? evidence from the European stock markets. Applied Economics 44 (12):1539–59. doi:10.1080/00036846.2010.543084.
Web of Science ®Google Scholar
Nagy, L., and M. Ormos. 2018. Friendship of stock market indices: A cluster-based investigation of stock markets. Journal of Risk and Financial Management 11 (4):88. doi:10.3390/jrfm11040088.
Google Scholar
Nair, B. B., P. S. Kumar, N. Sakthivel, and U. Vipin. 2017. Clustering stock price time series data to generate stock trading recommendations: An empirical study. Expert Systems with Applications 70:20–36. doi:10.1016/j.eswa.2016.11.002.
Web of Science ®Google Scholar
Nakagawa, K., M. Imamura, and K. Yoshida. 2019. Stock price prediction using k-medoids clustering with indexing dynamic time warping. Electronics and Communications in Japan 102 (2):3–8. doi:10.1002/ecj.12140.
Web of Science ®Google Scholar
Nanda, S., B. Mahanty, and M. Tiwari. 2010. Clustering Indian stock market data for portfolio management. Expert Systems with Applications 37 (12):8793–98. doi:10.1016/j.eswa.2010.06.026.
Web of Science ®Google Scholar
Pandey, A., and S. Sehgal. 2017. Volatility effect and the role of firm quality factor in returns: Evidence from the Indian stock market. IIMB Management Review 29 (1):18–28. doi:10.1016/j.iimb.2017.01.001.
Web of Science ®Google Scholar
Parvatha, L. S., D. N. V. Tarun, M. Yeswanth, and J. S. Kiran. 2023. Stock market prediction using sentiment analysis and incremental clustering approaches. 2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 1, 888–93. IEEE.
Google Scholar
Prabhu, C., A. S. Chivukula, A. Mogadala, R. Ghosh, and L. J. Livingston. 2019. Big data analytics for financial services and banking. In Big data analytics: Systems, algorithms, applications, ed. Aneesh S. C., A. Mogadala, R. Ghosh, L. M. J. Livingston, C. S. R. Prabhu, and A. S. Chivukula, 249–56. Springer, Singapore: Springer.
Google Scholar
Rahmah, N., and I. S. Sitanggang. 2016. Determination of optimal epsilon (eps) value on dbscan algorithm to clustering data on peatland hotspots in Sumatra. IOP conference series: earth and environmental science, Bogor, Indonesia, 31, 012012. IOP Publishing.
Google Scholar
Rasekhschaffe, K. C., and R. C. Jones. 2019. Machine learning for stock selection. Financial Analysts Journal 75 (3):70–88. doi:10.1080/0015198X.2019.1596678.
Web of Science ®Google Scholar
Ren, Y., C. Domeniconi, G. Zhang, and G. Yu. 2014. A weighted adaptive mean shift clustering algorithm. Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, PA, USA, 794–802. SIAM.
Google Scholar
Rukmi, A. M., A. Wahid. 2019. Role of clustering based on density to detect patterns of stock trading deviation. Journal of Physics: Conference Series, Surabaya, Indonesia, 1218, 012053. IOP Publishing.
Google Scholar
Scitovski, R., and K. Sabo. 2020. Dbscan-like clustering method for various data densities. Pattern Analysis and Applications 23 (2):541–54. doi:10.1007/s10044-019-00809-z.
Web of Science ®Google Scholar
Sivarajah, U., M. M. Kamal, Z. Irani, and V. Weerakkody. 2017. Critical analysis of big data challenges and analytical methods. Journal of Business Research 70:263–86. doi:10.1016/j.jbusres.2016.08.001.
Web of Science ®Google Scholar
Song, D.-M., M. Tumminello, W.-X. Zhou, and R. N. Mantegna. 2011. Evolution of worldwide stock markets, correlation structure, and correlation-based graphs. Physical Review E 84 (2):026108. doi:10.1103/PhysRevE.84.026108.
Web of Science ®Google Scholar
Starczewski, A., and A. Cader. 2019. Determining the eps parameter of the dbscan algorithm. International Conference on Artificial Intelligence and Soft Computing, Chennai, India, 420–30. Springer.
Google Scholar
Tan, P.-N., M. Steinbach, and V. Kumar. 2016. Introduction to data mining. Noida, Uttar Pradesh, India: Pearson Education India.
Google Scholar
Verma, N. K., S. Dwivedi, and R. K. Sevakula. 2015. Expectation maximization algorithm made fast for large scale data. 2015 IEEE workshop on computational intelligence: Theories, applications and future directions (WCI), Kanpur, India, 1–7. IEEE.
Google Scholar
Wagner, S., and D. Wagner. 2007. Comparing clusterings: An overview. Karlsruhe, Germany: Universität Karlsruhe, Fakultät für Informatik Karlsruhe.
Google Scholar
Weglarczyk, S. 2018. Kernel density estimation and its application. ITM Web of Conferences, Aviemore, Scotland, 23, 00037. EDP Sciences.
Google Scholar
Weinstein, M., and D. Horn. 2009. Dynamic quantum clustering: A method for visual exploration of structures in data. Physical Review E 80 (6):066117. doi:10.1103/PhysRevE.80.066117.
Web of Science ®Google Scholar
Weinstein, M., F. Meirer, A. Hume, P. Sciau, G. Shaked, R. Hofstetter, E. Persi, A. Mehta, and D. Horn. 2013. Analyzing big data with dynamic quantum clustering. arXiv Preprint arXiv 1310:2700.
Google Scholar
Yang, Y. 2016. Temporal data mining via unsupervised ensemble learning. Amsterdam, Netherlands: Elsevier.
Google Scholar
Yeung, K. Y., and W. L. Ruzzo. 2001. Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data. Bioinformatics 17 (9):763–74. doi:10.1093/bioinformatics/17.9.763.
PubMed Web of Science ®Google Scholar
Yue, S.-H., P. Li, J.-D. Guo, and S.-G. Zhou. 2004. Using greedy algorithm: Dbscan revisited ii. Journal of Zhejiang University-Science A 5 (11):1405–12. doi:10.1631/jzus.2004.1405.
Google Scholar
Zillner, S., T. Becker, R. Munné, K. Hussain, S. Rusitschka, H. Lippell, E. Curry, and A. Ojo. 2016. Big data-driven innovation in industrial sectors. In New horizons for a data-driven economy, ed. Cavanillas J. M., C. Edward, and W. Wolfgang, 169–78. Cham: Springer.
Google Scholar

Application Of Density-Based Clustering Approaches For Stock Market Analysis

ABSTRACT

Introduction

Related Works

Table 1. Summary of related works.

Material and Methods

Density-Based Spatial Clustering of Applications with Noise

Expectation Maximization

Weighted Adaptive Mean Shift Clustering

Dynamic Quantum Clustering

Dataset Description

Table 2. Structure of the stock market dataset.

Experimental Evaluation

Determination of Number of Clusters

Assessment of the Market Movement

Experimentation

Validity Measures

Normalized Mutual Information

Adjusted Rand Index

Jaccard Index

Fowlkes Mallows Index

Results and Discussion

Table 3. Summary of the average experimental results (in term of accuracy $NMI$ , $ARI$ , $JI$ and $FM$ ) on the two stock market dataset, namely, AAPL and MSFT.

Statistical Significance Test

Table 4. Paired t-test results in terms of p score values performed by the best performing method DQC versus other methods for AAPL dataset.

Table 5. Paired t-test results in terms of p score values performed by the best performing method DQC versus other methods for MSFT dataset.

Implication of Results

Effectiveness of the Density-Based Clustering Approaches

Qualitative Analysis of the Results

Conclusion

Disclosure statement

Data availability statement

References

Information for

Open access

Opportunities

Help and information

Application Of Density-Based Clustering Approaches For Stock Market Analysis

ABSTRACT

Introduction

Related Works

Table 1. Summary of related works.

Material and Methods

Density-Based Spatial Clustering of Applications with Noise

Expectation Maximization

Weighted Adaptive Mean Shift Clustering

Dynamic Quantum Clustering

Dataset Description

Table 2. Structure of the stock market dataset.

Experimental Evaluation

Determination of Number of Clusters

Assessment of the Market Movement

Experimentation

Validity Measures

Normalized Mutual Information

Adjusted Rand Index

Jaccard Index

Fowlkes Mallows Index

Results and Discussion

Table 3. Summary of the average experimental results (in term of accuracy NMI, ARI, JI and FM) on the two stock market dataset, namely, AAPL and MSFT.

Statistical Significance Test

Table 4. Paired t-test results in terms of p score values performed by the best performing method DQC versus other methods for AAPL dataset.

Table 5. Paired t-test results in terms of p score values performed by the best performing method DQC versus other methods for MSFT dataset.

Implication of Results

Effectiveness of the Density-Based Clustering Approaches

Qualitative Analysis of the Results

Conclusion

Disclosure statement

Data availability statement

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 3. Summary of the average experimental results (in term of accuracy $NMI$ , $ARI$ , $JI$ and $FM$ ) on the two stock market dataset, namely, AAPL and MSFT.