Search in:

Applied Artificial Intelligence

An International Journal

Volume 38, 2024 - Issue 1

Submit an article Journal homepage

Open access

2,796

Views

CrossRef citations to date

Altmetric

Listen

Research Article

Sentiment Analysis of Short Texts Using SVMs and VSMs-Based Multiclass Semantic Classification

K. Suresh Kumara Department of Information Technology, Sri Krishna College of Technology, Coimbatore, IndiaView further author information

A.S. Radha Manib Department of Electronics and Communication Engineering, Amrita College of Engineering and Technology, Nagercoil, IndiaView further author information

T. Ananth Kumarc Department of Computer Science and Engineering, IFET College of Engineering, Villupuram, IndiaView further author information

Ahmad Jalilid Department of Computer Engineering, Faculty of Basic Sciences and Engineering, Gonbad Kavous University, Gonbad Kavous, Golestan, IranView further author information

Mehdi Gheisarie Department of R&D, Shenzhen BKD Co.LTD, Shenzhen, China;f Department of Cognitive Computing, Institute of Computer Science and Engineering, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai, India;g Department of Computer Science, Islamic Azad University, Tehran, IranCorrespondence[email protected]
View further author information

Yasir Malikh Department of Computer science, Bishops University, Sherbrooke, Quebec, CanadaView further author information

Hsing-Chung Cheni Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung City, TaiwanView further author information

Ata Jahangir Moshayedij School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, ChinaView further author information

show all

Article: 2321555 | Received 29 Jul 2023, Accepted 09 Feb 2024, Published online: 14 Mar 2024

Cite this article
https://doi.org/10.1080/08839514.2024.2321555
CrossMark

In this article

ABSTRACT
Introduction
Related Works
Proposed Work
Experimental Result
Conclusion and Future Work
Acknowledgements
Disclosure statement
Additional information
References
Appendixes

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

In our approach, a hybrid machine learning model is proposed which uses Enhanced Vector Space Model (EVSM) along with Hybrid Support Vector Machine (HSVM) classifier. Initially the social media-based information is retrieved using Enhanced Vector Space Model (EVSM). EVSMs are employed in order to characterize the text content by mapping them into high-dimensional vector spaces, capturing the relationships between words and their contextual meanings. Rigorous feature selection methods are employed to designate texts for review, and a multiclass semantic classification algorithm, specifically the HSVM classifier, is utilized for categorization. Decision tree algorithm is used along with SVM to refine the selection process. To enhance sentiment analysis accuracy, sentiment dictionaries are not only presented but also extended through the expansion of Stanford’s GloVE tool. To enhance precision, the proposed work introduces weight-enhancing methods for processing renowned text weights. Sentiments are classified into positive, negative, and neutral categories. Notably, the achieved results demonstrate improved accuracy, attributed to the incorporation of an emotional sentiment enhancement factor for determining weights and leveraging sentiment dictionaries for word availability. The accuracy is obtained to be 92.78% with 91.33% positive sentiment rate and 97.32% negative sentiment rate.

Introduction

Sentiment analysis, which is otherwise called opinion mining, goes for extricating an individual’s dispositions from a few texts. At Present, networking websites end up being the most unmistakable supposition sharing source in light of its developing prominence (Karamoozian et al. Citation2022). So there persists a necessity for dissecting the sentiment proceeding social media, for example, it pulls considerable reaction of the personalities those attempt on the way to catch the feelings of people (Suresh Kumar et al. Citation2022). Amongst different kinds of opinion-rich resources derived from social media such as Twitter messages, Facebook messages, Instagram messages, and reviews of Movies are extremely famous to segment sees scheduled various concerns or else on explicit movie reviews (Shahabi et al. Citation2022). A large number of clients created surveys about items or administrations that were posted online regularly. With sentiment analysis for such audits from e-commerce or social websites, organizations could without much of a stretch get convenient inputs about things they or their rivals gave straightforwardly from clients, hence making improvements (Juneja and Mitra Citation2021). The abundance of the data is of extraordinary incentive to the individuals who can utilize it. For organizations, the data can give inputs of buyers, on the way to progress their items by finding different open doors toward the marketplace. The data can be utilized to screen the common suppositions further become more acquainted with a wide range of thoughts from people in general (Liu et al. Citation2023; Showrav et al. Citation2021). In a word, the expanding measure of the data from the Internet is of incredible approval. The vast measure of the data likewise draws in analysts to attempt activities to compose the data obviously (Karamoozian et al. Citation2024). Hence the instinctive text classification is concocted. Numerous specialists focus preceding the contemporary arrangement, attempting by sorting the archives by means of themes, for example, the advancement in the game class and the various other classes, by which it is useful in the adapted frameworks that are recommended by the other. Lately, the shopping done through online is growing rapidly (Showrav et al. Citation2021). The conclusion investigation of audits likewise, various sorts of information are essential. Supposition examination is likewise a sort of content classification. Positive and negative semantic direction are the two basic classes and also it can be represented with three classes the positive, negative, and impartial direction. In the majority of the cases, the utilization of short writings was utilized in a like manner for speaking to any feeling about the product or services (Mullen and Collier Citation2004). With the fast improvement of networking media, that is articulated by the microblog social networking sites that are utilized by developing a number of clients. Clients in the interpersonal organization can express their points of view or express their emotions. The short content of small-scale writes naturally has the shade of notion leanings, and its assumption investigation can burrow its conceivable social business and different highlights of evaluation (Mc Dermott Citation2015). The Microblog has a word limit of 140 with the goal that the content is separated by dividing the unpredictable articulation by separating different attributes, as the substance of microblogging is considered to be problematical, concerning administrative, educational, financial, communal, and some other parts. The customary guidelines of content, for microblogging sentiment analysis assignment, discover progressively troublesome.

The following section 2 explains the existing works pertained by various researchers in relevance to the context. The recent researches are highlighted in this section and the section 3 discusses about the proposed methodologies, the dataset used and the individual methodologies used which is suitable for implementation and the section 4 explores the outputs obtained by proper experimentation process conducted for each iteration. The section 5 discusses about the conclusion of the work and the future works to be done. This work is poised to have practical applications in various domains where understanding sentiment in short texts is essential.

Related Works

Today, as a run of the Natural language Processing task, several techniques were connected to take care of the sentiment analysis issue. Jie Li et.al recommended a modern way to deal with the investigation of the sentiment of a short text by not considering the connection among sentiment words and solely gathering the feelings of the words to acquire the concept of small content. The sentimentality structure is acquired from the reliance parsing with the relationship relocation and altered separation, which makes a decent commitment to understanding the opinion of short content (Liu et al. Citation2023). Nazma Iqbal et.al take on tweets and movie reviews by classifying several features is used in combination by feeding the polarity of several opinions in different machine learning algorithms for calculating the performance (Safari, Mursi, and Zhuang Citation2020). Chinese short texts were taken and their sentiment polarity is determined by using certain machine learning techniques by creating a dictionary which consists of a lot of short texts with its proper local language meaning using the tools provided by google (Sharma and Sharma Citation2020). It uses SVM Classifier to categorize the sentiments. Another one sentiment classification is done by Bo Pang by choosing the movie reviews as data sets and SVM classifier and Naïve Bayes along with Maximum Entropy classifiers are used and obtained a higher accuracy output by choosing eight different features (Pang, Lee, and Vaithyanathan Citation2002). The irrelevant and deluding content can be counteracted by naming every target sentence by discovering the least guts in charts (Pang and Lee Citation2004).

Information is analyzed and measures were selected for the adjectives and phrases and the actual resources from the topic being discussed and a support vector machine is used as the classifier [13]. Performing sentiment analysis in Chinese texts is done by segmenting the words as new words are generated every year. A specific tool is generated to identify the meanings of the Chinese texts by providing little combinations. The initial step is to consolidate the existed sentiment word references together. Then it is stretched by using the generated tool for word meanings. Next is to pick the highlights from the datasets and change them into vectors. The selected feature and its weight related to words are found in the dictionary which is an extended one. At that point, precision is improved (Xing et al. Citation2015; Zheng et al. Citation2023). Managing smaller scale blogging information which is described by its casual style and short content and dealing with imbalanced class appropriations issue of in supposition mining and the changeover of Arabian texts were considered and the forward embedding approach is widely used for the prediction of various language short texts (Al-Azani and El-Alfy Citation2017; Alwehaibi and Roy Citation2018). Mostly Stanford’s GloVE is used to identify many of the languages for determining the exact meaning. All this happens with the usage of machine learning classifiers. The synthetic minority over-sampling technique is the method used over the identification of Arabic texts. Analysis has been made further with the help of a machine learning-based deep neural network which is made to fix under the control of a short-term memory accessible recurrent neural network by considering some various word embedding models that are already trained which can affect the model’s accuracy in many aspects. The sample data sets are taken from the Twitter data’s (Li et al. Citation2023). Another approach is made with the unstructured data taken from the Urdu languages and a rule-based inclusive stemming technique is used. The similarity which procures the semantic text assumes a noteworthy job in regular linguistic preparing. As of late, scientists have given significant consideration to Semantic Text Similarity. A few leaps ahead the sentiment analysis have done in the universal language like English (Shancheng, Yunyue, and Fuyu Citation2018). Some of the drawbacks identified when these sentiment analysis models get connected with other foreign languages like Arabic and Chinese. Also, for national languages like Hindi, Malayalam, Telugu, and Tamil the same models face some difficulties in analyzing the sentiments because of the lack of dictionary items. Normally the semantic vagueness is considered effective for the single arrangement models without considering the vagueness of semantics. One of such things is an equivalent word that is polysemy (Huang et al. Citation2021). Here in these models, there is no option to think the stop words in Chinese that are more significant for the word division in the Chinese language in addition to the investigation of the voice, usage of comprehension in semantic consideration, etc. At the initial stage, the primary issues are cleared by using the two-fold grouping models for the short content which has some indistinguishable LSTM for making ready the couple content arrangement at the same time (Zheng et al. Citation2023). Also, the other issue faced in the Chinese text formulation is eradicated with the help of informational indexes that have been structured in order to prepare and impose tests on the corresponding model and the stop words are implied in enhancing the preparing process of the models (Gheisari et al. Citation2023). Finally, the usage of convolutional neural networks in association with the semantic text similarity model is being contrasted (Jafari et al. Citation2016). The same case is applied to the Arabic language also. The outcomes expelled out can demonstrate the effectiveness of the model by mentioning its noteworthy considering the exactness and the review rate. In addition, the capacity of speculation is modified or enhanced (Rezaeiye et al. Citation2012).

Deep learning techniques are acquainted with creating short content synopsis consequently, by considering the bywords of initial content that are improved in addition with the outcomes. The development of the Long-short term memory network along with the grouping model which is of consideration based is made available for consolidating the highlights of the character and the words that include the information sources (Jamil et al. Citation2011; Karamoozian and Hong Citation2022; Karamoozian and Zhang Citation2023; Tao, Gao, and Zhang Citation2011). The slogans of the small messages are much effectively used in order to address the expressions of the model that is produced with the proper run down.

Another model to adapt short content portrayals that can be utilized for different purposes is been proposed. The model comprises of two convolutional neural systems: one is in charge of removing the semantic portrayals of short content which words are ordinary requests, and the other is learning the portrayals of short content which is backward request (Song et al. Citation2020). Limiting the distinction between the two portrayals is focused. Additionally, the back guess of the semantic portrayals of short content is considered as Gaussian, by limiting the KL-dissimilarity to outline portrayals into low-dimensional spaces with Gaussian dispersions (Alzubi et al. Citation2018; Joby Citation2020; Motahari Kia et al. Citation2018; Yadav and Kumar Vishwakarma Citation2020).

Short texts are difficult to understand and it may not follow any specific syntax so the traditional tools do not yet support the framework. For understanding common language semantic information given by surely understood knowledgebase WordNet is utilized. In pre query natives embeds objection to framework and get a quick reaction to question with the assistance of knowledgebase and AI calculation (Motahari Kia et al. Citation2018). In post query framework examinations, the resident feels to deal with complaint level and appropriately organize the natives by supposition investigation. A new form generated will assist numerous associations with ensuring quality administration arrangement and consumer loyalty with less human endeavors (Alzubi et al. Citation2018; Fasihfar et al. Citation2023; Ghaderzadeh et al. Citation2022; Karamoozian et al. Citation2022; Karamoozian and Zhang Citation2023). In this work, the sentiment analysis of the short Texts suing SVMs and VSMs are performed based on Multiclass Semantic Classification. Here SVM is combined with Decision tree algorithm for smooth classification and Vector Space Model is enhanced by incorporating the most needed fine-grained attributes.

Proposed Work

In this work the challenges faced in analyzing the sentiments in short texts by implementing Support Vector Machines (SVMs) and Enhanced Vector Space Models (EVSMs) for multiclass semantic classification. The architecture for the proposed model is illustrated in .

Figure 1. Architecture of the proposed system.

Dataset Description

The detailed description of the dataset, which can be found at https://www.kaggle.com/code/paoloripamonti/twitter-sentiment-analysis and also the spam messages which includes the significance of each feature, the target variable (if it is relevant), and any specific factors that should be taken into consideration when making use of the data. For the purpose of researching SMS spam, this collection of messages that have been tagged with the SMS protocol was gathered. There are 7645 twitter and SMS messages written in English that have been categorized as either spam or comments.

Dataset Selection and Preprocessing

From the dataset the short text messages are identified and the relevant knowledge is acquired. For preprocessing, this word uses two preprocessing methodologies namely, initial preprocessing and final preprocessing. Initial preprocessing handle the related issues like the spelling error identification, tokenization, data splitting and noise removal. The Final preprocessing step involves scaling and normalization, Encoding, text vectorization, imbalance data handling and dimensionality reduction.

Feature Extraction Using EVSMs

Through the utilization of Vector Space Models, convert brief texts into numerical feature vectors. It is important to investigate various feature extraction methodologies, such as TF-IDF and word embeddings, in order to capture semantic connections successfully. The data’s for the process can be retrieved from social media like Twitter, Facebook, etc. using vector space model-based retrieval systems expending open source technology. The database was already created using oracle. The vector space model is enhanced using the probabilistic model in combination with the vector space model. This will provide a semantic-based approach for the retrieval of the most relevant content from bulk storage. A basic assessment was finished utilizing review and exactness to gauge the presence of the recovery framework. While review implies the capacity to recover whatever number records as could reasonably be expected that match or are identified with a question, exactness is the capacity to recover the most exact outcomes. Five chosen questions are tried against the significance judgment records made with the end goal of evaluation. Review and accuracy might be improved by phonetic preparing, for example, lemmatization, spell-checking, and equivalent word development. Other than that, other ordering strategies may likewise be investigated for better stockpiling and recovery results. Semantic-based inquiry extension is one of the most sizzling explores in the field of current data recovery, however despite everything it does not have a superior perceived arrangement. The conventional vector space model prevails with regards to taking care of the issue of pertinence coordinate between two reports, however, it disregards the semantics significance amongst the basic language elements which establish the vector. The connection between two vectors has numerous structures, for example, the internal item, cosine coefficient, and shakers coefficient, and that will be picked the cosine coefficient to portray and elucidate.

Y (a_{i}, a_{j}) = \frac{\sum_{k = 1}^{n} M_{ki} M_{kj}}{\sqrt{\sum_{k = 1}^{n} {M^{2}}_{ki} \sum_{k = 1}^{n} {M^{2}}_{kj}}}

Here, Y is the similarity coefficient a_i and a_j are the two documents used for the comparison of M that denotes the weight vector of fundamental language unit L in the document a_i. Vector space model creates changes in the calculation of the similarity coefficient and the cosine coefficient; it is depending on the weight M of the basic knowledge L in document D. The calculations of weight M, generally utilize the formula of TF – IDF.

M_{ki} = \frac{t f_{ki} + lo g_{2} (N / d f_{k})}{\sqrt{\sum_{k = 1}^{N} {[t f_{ki} + lo g_{2} (N / d f_{k})]}^{2}}}

M_ki denotes The TF-IDF weight.

tf_ki denotes the linguistic unit t_k that incidences in the document a_i.

df_k denotes the maximum amount of the information which contains the basic language unit t_k.

N denotes the number of the total documents.

Feature Selection

While considering the conclusion lexicon, these notion words must be included in the word division lexicon so the after effect of the word division that could comprise of the slant words in the lexicon could assist with the estimation investigation. While taking the data samples the class of marks like articles, punctuation marks, and certain symbols will not provide any of the sentiments and it is difficult to estimate the same. So, the chi-square value is computed and a threshold value is marked depending on that value and more words can be taken to obtain similar values as there in the dictionary. A feature is fixed to avert the sentiment words being uninhibited.

Enhanced Vector Space Model Implementation

Variant TF-IDF features are used for determining the feature weights and it defines the vector of the m^th review X_i as given below.

{\bar{X}}_{i} = (γ_{i1}, γ_{i2}, γ_{i3}, γ_{i4} \dots \dots \dots .. γ_{i (n-1)}, γ_{in})

For calculating the value of TDF, the following calculations need to be done for estimating the vector values in accordance with the selected features.

\begin{aligned} {γ^{'}}_{ij} = \frac{t f_{ij} \times id f_{j}}{\sqrt{\sum_{j} {(t f_{ij} \times id f_{j})}^{2}}} \\ = \frac{t f_{ij} \times \log \frac{B}{d f_{j}}}{\sqrt{\sum_{j} {(t f_{ij} \times \log \frac{B}{d f_{j}})}^{2}}} \end{aligned}

The heaviness of each element should be adjusted with the goal that can change the vector into a unit vector.

${γ^{'}}_{ij}$ is the obtained weight of the $j^{th}$ feature of the $i^{th}$ review and $t f_{ij}$ is the obtained weight of the $j^{th}$ feature of the $i^{th}$ review, and $id f_{j}$ is the inverse document frequency of the $j^{th}$ feature, $d f_{j}$ is the quantity of reviews that contain the $j^{th}$ feature. B is the complete number of evaluations.

Weight Enhancing Methods

Subsequently the completion of selection of features as well as the word separation the vectors are used to signify the appraisals and considering all the words connected to the semantic alignment. The different classes of words are there to give distinctive commitment to the semantic direction of the audit and the assumption words are firmly identified with the estimations. All words are considered in the dictionary and the weights of that are being strengthened to get better accurateness. The TF-IDF value of each word is calculated and the weights of each word are improved by presenting the best possible passionate upgrade variables to expand the term frequency of the words. Then the new obtained weight is estimated using the formula

\begin{aligned} {γ^{b}}_{ij} = \frac{t {f^{b}}_{ij} \times id {f^{b}}_{j}}{\sqrt{\sum_{j} {(t {f^{b}}_{ij} \times id {f^{b}}_{j})}^{2}}} \\ = \frac{L \times t f_{ij} \times id {f^{b}}_{j}}{\sqrt{\sum_{j} {(K \times t f_{ij} \times id {f^{b}}_{j})}^{2}}} \\ = \frac{L \times t f_{ij} \times \log \frac{M}{d f_{j} + 1}}{\sqrt{\sum_{j} {(K \times t f_{ij} \times \log \frac{M}{d f_{j} + 1})}^{2}}} \end{aligned}

The parameter L represents the emotional enhancement factor. To find the exact value of L number of iterations need to made more. The factor of words is not same as the demonstrative development word factor which is found next to the sentiment words. Hence, the L value is obtained separately.

Word Embedding

The high dimensionality of the bag-of-words model, as well as the absence of contextual relationships between individual words, are two of its prominent characteristics. We employ the Word2Vec word embedding model in order to learn the contextual relationships that exist between the words that are included in the training data. This enables us to more accurately represent the condensed content that is found in short texts. It is possible to further improve the computational efficiency of a word embedding model by employing a predetermined number of dimensions. Both the Continuous Bag-of-Words (CBOW) and the Skip-gram models serve as the foundation of the Word2Vec architecture. Word vectors that have been trained with the Word2Vec Skip-gram model are fed into the subsequent classification stage. This is due to the fact that the skip-gram model performs better in semantic analysis.

Hybrid Support Vector Machine (HSVM)

The classification (sentiment classification) of the sentiments is done possible using Hybrid Support Vector machines (HSVM). Initially the kernel function is selected for the sentiment’s prediction using short text data by try implicating various functions like linear, polynomial, and radial basis functions. Multiclass sentiment classification methodology is used for determining the sentiments in terms of good, negative and neutral. SVM will optimize the obtained results in terms of addressing the class in equality.

Stanford’s GloVE (Joby Citation2020) is a tool that is used to find the relative words. This tool gives a productive execution of the ceaseless set of words and its structures for registering vector depiction of words. These representations are utilized in various natural speaking languages and are used for further research. Text quantity is taken as inputs and word vectors are produced as outputs. First, it will construct expressions from the samples and then study the representations of the sequence of texts. These can be identified and processed using machine learning applications. This can be accompanied by finding the closest word which may be suited toward the corresponding word and can be calculated by determining the word length and identifying the most relevant words. In natural language processing calculation of words, distance is a regular task and it is the most affordable way to find the metrics. Considering the previous approaches, the distance from word embedding and the distance between the string can be calculated from the number of deletions, insertion, and substitution. Thus, the changing of one word to another is calculated.

One of the most prominent ways to find the word distance is Levenshtein Distance invented by Vladimir Levenshtein in 1965. The esteem alludes to the base number of activities (cancellation, inclusion, and substitution) required to change starting with a single word then onto the next word.

lev dis t_{x, y} (i, j) = \{\begin{matrix} maximum (i, j) & if \min (i, j) = 0, \\ minimum \{\begin{matrix} lev dis t_{x, y} (i - 1, i - 2, j) + 1 \\ lev dis t_{x, y} (i, j - 1, j - 2) + 1 \\ lev dis t_{x, y} (i - 1, j - 1) + 1_{(x_{i} \neq y_{j})} \end{matrix} & otherwise . \end{matrix}

A simple machine learning technique with a supervised learning model is used and it is managed by a model with related knowledge counts which is used for analyzing the data see structures, that is used for gathering and backslide examination. A great deal of planning models is given, each separate as having a spot with one of two classes. Multi-class classification is made with the help of Support Vector Machine (SVM classifier) and decision tree along with regression analyzer. First, from the given set of samples the calculation fabricates a model that relegates new precedent into the altering of the class by means of making this one as a non-probabilistic linear classifier for binary. These newly marked categories are then divided into like a wide gap as possible. These are then mapped into the same space indicating the contents in the machine learning library. LIBSVM is a new open-source library (Jamil et al. Citation2011) that provides an executable file for the Windows platform. TF-IDF is proposed on the way to replicate the importance of a text from a collection of words and this is the estimation considered numerically which is often needed for reflecting the necessity of the word seeking its importance for reporting in a social event. Regularly, it is utilized as a weighting component in data recovery and substance mining and given as

w_{i} = t f_{i} \times id f_{i} = t f_{i} \times \log \frac{N}{d f_{i}}

The value of this will increase as the increase in the word count that appears in the obtained document. The open-source library available is LIBSVM1 which is developed at the NTU and is adorned by the Lin Chih-Jen. In this work, the incorporated executable records on Windows are given, yet in addition, gives codes on different stages. LIBSVM implements an approach named one against one for multiclass classification.

Experimental Result

The experiment is carried out using python programming language with TensorFlow platform and implemented using Python Spyder IDE. The Pseudo code, corresponding to the proposed approach is given in Appendix A. The dataset is taken from the website Kaggle.com and the dataset contains spam messages and twitter comments which is of short text. Overall, 7645 tweets or messages are taken into consideration. From these a part is considered as test dataset and another part is mentioned as train dataset. The effectiveness of our proposed methodology was evaluated. This evaluation was carried out through three separate experimental trials. We began by reading reviews on twitter that were written in the English language. The dataset is comprised of 7645 review comments of these about 3234 of them have been reserved for training, and the remaining 4411 are for testing. We placed a limit of 46 reviews on each film in order to lessen the influence of views that were already in existence. From the analysis, the Experimental outcome provides the better results in terms of positive as well as the negative sentiment values as shown in .

Table 1. Experimental results with Positive/Negative sentiment-value corpus.

Download CSV Display Table

The table seemingly depicts the performance evaluation metrics for various sentiment analysis techniques on a given dataset. Each row represents a distinct method, while the columns contain data on different evaluation metrics. It is possible to infer, based on these values, the degree of accuracy with which the method of sentiment analysis differentiates between positive and negative perspectives. For the purpose of conducting a comprehensive analysis of the model’s performance, metrics such as recall, accuracy, and precision for both positive and negative sentiments are indispensable. Furthermore, the remaining rows, which illustrate various approaches to sentiment analysis, are also capable of undergoing analyses of a similar nature.

Then enhanced vector space model is used for the retrieval of the most valuable and suitably matching contents. The information is processed by a weighting scheme using TF-IDF. The value of TF-IDF is provided as one of the feature weights. The rate of features is reserved by implementing suitable experiments. The process of it is described in . From , it is clear that the reservation of 10% of the features will yield the maximum Instruction. The number of features will be more in the vector if the maximum dimensions are reserved. The fewer prevailing arguments are measured amid the passing judgment on procedure. Furthermore, the length of the survey isn’t excessively long with the goal that more highlights may assume a negative job on the outcome. After the figuring and counting process, the accuracy is obtained as 92.15%. To show progress in the accuracy of the above an emotional sentiment enhancement factor is considered for determining the weights and the words are available in the dictionary.

Figure 2. Tuning the rate.

The emotional enhancement factor L is upgraded as R₁ and R₂ to denote the factor. The weights of certain words are altered and feature rate are reserved again to obtain the improvement in the obtained instructions. The accompanying demonstrates the full scale with various rates.

Figure 3. Tuning the rate.

From , it is clear that the rate is made maximum to get the full-scale value to be high. Hence more words would be selected as the features and the better results may be obtained by that. The tendency demonstrates that most of words in the survey might be identified with the semantic direction of the review. Thus, the accuracy is improved much better by selecting the proper value R₁ and R₂. The obtained tuned information is then applied with HSVM classifier for the classification of sentiments. Some additional retrieval is possible while subjected to the classification of sentiments. The sentiments are classified as positive, negative, strongly positive, strongly negative and neutral. The proposed work shows some improvements in the accuracy when compared to other systems. The previous works only predicts the positive emotion rates and the negative emotion rates only and the proposed system determines all the possible cases of sentiments are analyzed accurately. The Negative sentiments are obtained from the proposed work and it is comparatively better when compared to the existing ones. The experimental results prove that the proposed method shows much better results when compared to the existing methods. The comparative statement for both the positive and the negative sentiment are shown below in .

Table 2. Experimental results for negative and positive sentiments.

Download CSV Display Table

The graph representing the comparison of the positive and negative sentiments against various algorithms are shown below in .

Figure 4. Graph representing the comparison of positive and negative sentiment rates.

The graph shown in shows the comparison of positive and negative sentiments of the proposed system and other existing systems. The x-axis shows various existing systems included in the experimentation process. The y-axis represents the positive sentiment rate and negative sentiment rate.

The score for evaluation which is used in the analysis of Twitter data is being set to be averaged in F1 score of the positive as well as the negative classes for the data sets taken. The illustrate the results of the experiments after the removal of lexical contents involving with the experiments that evaluate the exact value of the sentiments which is based on the PMI metrics used.

Table 3. F scores for the training sets and testing sets of twitter.

Download CSV Display Table

The twitter training dataset and the testing data set is considered. The F scores provides the valid references about the training datasets. The corresponding Al features removes the lexical parameters from the whole feature space. The various sentiment related features are taken into consideration and the evaluation score which is meant to use for the twitter dataset (both training and testing) is averaged with the F1 score value. The last row specifies the proposed system and from the analysis the proposed system shows some improvements in the performance. Based on the comparison it is very clear that the proposed system provides better results when compared to the existing works in case of accuracy and performance.

The presents the performance metrics of various models for sentiment analysis of short texts using HSVMs and EVSMs based on multiclass semantic classification. The models are evaluated based on four key metrics: Accuracy, Precision, Recall, and F Score and the obtained result is illustrated in . The HSVM+EVSM model stands out with the highest accuracy of 96.45%, indicating that the combination of hierarchical SVM and enhanced VSM provides excellent performance in sentiment analysis. Precision, Recall, and F Score values provide a more detailed understanding of model performance, especially in scenarios where imbalanced class distribution exists.

Figure 5. Performance measures of various methods against different performance indicators.

In future the various changes can be made to enhance the model’s performance, flexibility, and applicability. More advanced feature engineering techniques are explored for EVSM. This may involve experimenting with different word embeddings, contextual embeddings (e.g., BERT, GPT), or incorporating domain-specific knowledge to improve feature representation. The improved version of the deep learning models can be used like LSTM or other transformers to improve the classification results to better accuracy. The significant improvement must be shown by capturing the intricate patterns and dependencies in sequential data. The implication of the Ensemble methods like bagging or boosting could further enhance the overall performance and robustness of the sentiment analysis system. A hyperparameter tuning process is conducted for both the SVM and Decision Tree models to identify the optimal settings. This could involve using techniques like grid search or random search to efficiently explore the hyperparameter space. The domain adaptation techniques can be implemented for achieving the most robust model across various other domains, especially for multilingual contexts. Proper optimization mechanisms can be implemented real-time processing to make it suitable for applications requiring immediate sentiment analysis, such as social media monitoring or customer support systems. Also, the methodology is extended to handle multimodal data, incorporating information from both text and other modalities (e.g., images, audio). This can provide a more comprehensive understanding of sentiment in contexts where multiple types of data are available.

Conclusion and Future Work

In recent times, researchers have taken note of the widespread use of artificial intelligence algorithms across various fields. Numerous studies have demonstrated that these applications have significantly boosted productivity (Alzubi et al. Citation2018; Fasihfar et al. Citation2023; Karamoozian and Hong Citation2022; Karamoozian and Zhang Citation2023). This work proposes a way for expanding the dictionary of the sentiment words by the globally accepted tool named Stanford’s GloVE in association with the sentiment analysis. The sentimental word weights that are very close to the sentiment words which is normal is made to enhance with the appropriate model. The results made experimentally project the progress of the works in terms of accuracy making this approach an effective one. To improve the full-scale value an emotional sentiment factor is to be considered and thus the better value is chosen in accordance with the controlling variables method. In addition to that more features are introduced to improve the accuracy furthermore. Suitable information retrieval methods are imposed and is been implemented using Vector Space Model which ensures betterment in obtaining the accuracy. Considering the state-of-the-art methods, a tremendous increase in the accuracy is achieved and is improved to 92.78%. And considering the polarity of the sentiment, the maximum is achieved for both the positive and negative sentiments. The accuracy of the positive sentiment rate and the negative sentiment rate is maximized to 91.33% and 97.32 respectively.

Acknowledgments

We thank the scholars for their expertise and assistance throughout all aspects of our research and for their help in writing the manuscript.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The work was supported by the Islamic Azad University with grant number 1337132813612259031118.

References

Al-Azani, S., and E.-S. M. El-Alfy. 2017. Using word embedding and ensemble learning for highly imbalanced data sentiment analysis in short Arabic text. Procedia Computer Science 109:359–21.
Google Scholar
Alwehaibi, A., and K. Roy. 2018. Comparison of pre-trained word vectors for Arabic text classification using deep learning approach. 17th IEEE International Conference on Machine Learning and Applications (ICMLA), California, USA, IEEE.
Google Scholar
Alzubi, J. A. 2018. Improve heteroscedastic discriminant analysis by using CBP algorithm. In Algorithms and architectures for parallel processing. ICA3PP 2018. Lecture notes in computer science, ed. J. Vaidya and J. Li, vol. 11335, 130–144. Cham: Springer.
Google Scholar
Fasihfar, Z., H. Rokhsati, H. Sadeghsalehi, M. Ghaderzadeh, and M. Gheisari. 2023. AI-driven malaria diagnosis: developing a robust model for accurate detection and classification of malaria parasites. Iranian Journal of Blood and Cancer 15 (3):112–124.‏. doi:10.61186/ijbc.15.3.112.
Google Scholar
Ghaderzadeh, M., A. Hosseini, F. Asadi, H. Abolghasemi, D. Bashash, A. Roshanpoor, and S. Nazir. 2022. Automated detection model in classification of B-lymphoblast cells from normal B-lymphoid precursors in blood smear microscopic images based on the majority voting technique. Scientific Programming 2022:1–8.‏. doi:10.1155/2022/4801671.
Web of Science ®Google Scholar
Gheisari, M., T. Taami, M. Ghaderzadeh, H. Li, H. Sadeghsalehi, H. Sadeghsalehi, and A. Afzaal Abbasi. 2023. Mobile applications in COVID-19 detection and diagnosis: An efficient tool to control the future pandemic; a multidimensional systematic review of the state of the art. JMIR MHealth and UHealth. doi:10.2196/44406.
Google Scholar
Huang, C., Z. Han, M. Li, X. Wang, and W. Zhao. 2021. Sentiment evolution with interaction levels in blended learning environments: Using learning analytics and epistemic network analysis. Australasian Journal of Educational Technology 37 (2):81–95. doi:10.14742/ajet.6749.
Web of Science ®Google Scholar
Jafari, M., J. Wang, Y. Qin, M. Gheisari, A. S. Shahabi and X. Tao. 2016. Automatic text summarization using fuzzy inference. 2016 22nd International Conference on Automation and Computing (ICAC), 256–60. Colchester.
Google Scholar
Jamil, N., Jamaludin, N.A., Rahman, N.A. and Sabari, N. 2011. Implementation of vector-space online document retrieval system using open source technology. 2011 IEEE Conference on Open Systems, Langkawi, Malaysia. IEEE.
Google Scholar
Joby, P. P. 2020. Expedient information retrieval system for web pages using the natural language modeling. Journal of Artificial Intelligence 2 (2):100–10.
Google Scholar
Juneja, P., & T. Mitra (2021, May). Auditing e-commerce platforms for algorithmically curated vaccine misinformation. Proceedings Of The 2021 Chi Conference On Human Factors In Computing Systems, Yokohama, Japan, 1–27.
Google Scholar
Karamoozian, M., and Z. Hong. 2022. Using a decision-making tool to select the optimal industrial housing construction system in Tehran. Journal of Asian Architecture and Building Engineering 22 (4):2189–208. doi:10.1080/13467581.2022.2145205.
Web of Science ®Google Scholar
Karamoozian, A., C. A. Tan, D. Wu, A. Karamoozian, and S. Pirasteh. 2024. COVID-19 automotive supply chain risks: A manufacturer-supplier development approach. Journal of Industrial Information Integration 38:100576. doi:10.1016/j.jii.2024.100576.
Web of Science ®Google Scholar
Karamoozian, M., and H. Zhang. 2023. Obstacles to green building accreditation during operating phases: Identifying challenges and solutions for sustainable development. Journal of Asian Architecture and Building Engineering 1–17. doi:10.1080/13467581.2023.2280697.
Web of Science ®Google Scholar
Li, S., H. Chen, Y. Chen, Y. Xiong, and Z. Song. 2023. Hybrid method with parallel-factor theory, a support vector machine, and particle filter optimization for intelligent machinery failure identification. Machines 11 (8):837. doi:10.3390/machines11080837.
Web of Science ®Google Scholar
Liu, X., S. Wang, S. Lu, Z. Yin, X. Li, L. Yin, J. Tian, and W. Zheng. 2023. Adapting feature selection algorithms for the classification of Chinese texts. Systems 11 (9):483. doi:10.3390/systems11090483.
Web of Science ®Google Scholar
Liu, X., G. Zhou, M. Kong, Z. Yin, X. Li, L. Yin, and W. Zheng. 2023. Developing multi-labelled corpus of twitter short texts: A semi-automatic method. Systems 11 (8):390. doi:10.3390/systems11080390.
Web of Science ®Google Scholar
Mc Dermott, K. 2015. Towards a pedagogy of short story writing. English in Education 49 (2):130–149. doi:10.1111/eie.12062.
Web of Science ®Google Scholar
Motahari Kia, M. M., J. A. Alzubi, M. Gheisari, X. Zhang, M. Rahimi and Y. Qin. 2018. A novel method for recognition of Persian alphabet by using fuzzy neural network. IEEE Access, 6: 77265–71. doi:10.1109/ACCESS.2018.2881050.
Google Scholar
Mullen, T., and N. Collier. 2004 July. Sentiment analysis using support vector machines with diverse information sources. EMNLP 4:412–18.
Google Scholar
Pang, B., & L. Lee. 2004, July. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, 271.
Google Scholar
Pang, B., L. Lee, & S. Vaithyanathan. 2002. Thumbs up?: Sentiment classification using machine learning techniques. Proceedings of the ACL-02 Conference On Empirical Methods In Natural Language Processing, United States 10: 79–86.
Google Scholar
Rezaeiye, P. P., M. Fazli, M. Sharifzadeh, H. Moghaddam, and M. Gheisari. 2012. Creating an ontology using protege: Concepts and taxonomies in brief. Advances in Mathematical and Computational Methods 1 (3):115–20.
Google Scholar
Safari, Z., K. T. Mursi, and Y. Zhuang. 2020. Fast automatic determination of cluster numbers for high dimensional big data. Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis, Silicon Valley, CA, USA.
Google Scholar
Shahabi, A. S., M. Reza Kangavari, and A. Masoud Rahmani. 2022. A method for multi-text summarization based on multi-objective optimization use imperialist competitive algorithm. Journal of Computer & Robotics 15 (1): 9–17.
Google Scholar
Shancheng, T., B. Yunyue, and M. Fuyu. 2018. A semantic text similarity model for double short Chinese sequences. 2018 International Conference on Intelligent Transportation, Xiamen, China, Big Data & Smart City (ICITBS). IEEE, 736–739. doi:10.1109/ICITBS.2018.00190.
Google Scholar
Sharma, P., and A. K. Sharma. 2020. Experimental investigation of automated system for twitter sentiment analysis to predict the public emotions using machine learning algorithms. Materials Today: Proceedings.
Google Scholar
Showrav, D. G. Y., M. A. Hassan, S. Anam, and A. K. Chakrabarty. 2021. Factors influencing the rapid growth of online shopping during COVID-19 pandemic time in Dhaka City, Bangladesh. Academy of Strategic Management Journal 20 (2):1–13.
Google Scholar
Song, C., Wang, X.K., Cheng, P.F., Wang, J.Q. and Li, L. 2020. SACPC: A framework based on probabilistic linguistic terms for short text sentiment analysis. Knowledge-Based Systems. 105572.
Google Scholar
Suresh Kumar, K., C. Helen Sulochana, A. S. Radhamani, and T. Ananth Kumar. 2022. Sentiment lexicon for cross-domain adaptation with multi-domain dataset in Indian languages enhanced with BERT classification model. Journal of Intelligent & Fuzzy Systems 43 (5):6433–6450. doi:10.3233/JIFS-220448.
Web of Science ®Google Scholar
Tao, Q., L. Gao, and Z. Zhang. 2011. The study of semantic query expansion based on improved vector space model. 2011 IEEE 3rd International Conference on Communication Software and Networks, Xi’an, China, IEEE.
Google Scholar
Xing, L., Yuan, L., Qinglin, W. and Yu, L. 2015. An approach to sentiment analysis of short Chinese texts based on SVMs. 2015 34th Chinese Control Conference (CCC), China, IEEE.
Google Scholar
Yadav, A., and D. Kumar Vishwakarma. 2020. Sentiment analysis using deep learning architectures: A review. Artificial Intelligence Review 53 (6):4335–85.
Web of Science ®Google Scholar
Zheng, C., Y. An, Z. Wang, X. Qin, B. Eynard, M. Bricogne, J. Le Duigou, and Y. Zhang. 2023. Knowledge-based engineering approach for defining robotic manufacturing system architectures. International Journal of Production Research 61 (5):1436–54. doi:10.1080/00207543.2022.2037025.
Web of Science ®Google Scholar
Zheng, W., S. Lu, Z. Cai, R. Wang, L. Wang, and L. Yin. 2023. PAL-BERT: An improved question answering model. Computer Modeling in Engineering & Sciences 1–10. doi:10.32604/cmes.2023.046692.
Web of Science ®Google Scholar

Appendix A

Pseudocode for “Sentiment Analysis of Short Texts Using SVMs and VSMs-Based Multiclass Semantic Classification”

# Define functions and classes for preprocessing, feature extraction, and classification

# Function to preprocess the input text

function preprocess_text(text):

# Tokenization, stemming, stop word removal, etc.

# Return preprocessed text

# Function to extract features using Enhanced Vector Space Model (EVSM)

function extract_features_evsm(preprocessed_text):

# Implement EVSM feature extraction

# Return feature vector

# Function to train Support Vector Machine (SVM) classifier

function train_svm_classifier(features, labels):

# Implement SVM training

# Return trained SVM model

# Function to train Decision Tree classifier

function train_decision_tree_classifier(features, labels):

# Implement Decision Tree training

# Return trained Decision Tree model

# Function to predict sentiment using trained models

function predict_sentiment(models, preprocessed_text):

# Extract features using EVSM

features_evsm = extract_features_evsm(preprocessed_text)

# Use SVM model for prediction

svm_prediction = models[“svm”].predict(features_evsm)

# Use Decision Tree model for prediction

dt_prediction = models[“decision_tree”].predict(features_evsm)

# Combine predictions using some strategy (e.g., voting)

combined_prediction = combine_predictions(svm_prediction, dt_prediction)

# Return combined sentiment prediction

return combined_prediction

# Function to evaluate model performance

function evaluate_model(model, test_features, test_labels):

# Evaluate the model using metrics like accuracy, precision, recall, etc.

# Return evaluation results

# Main program

# Load and preprocess the training data

train_data = load_and_preprocess_training_data()

# Split the data into features and labels

train_features = extract_features_evsm(train_data.text)

train_labels = train_data.sentiment_labels

# Train the Support Vector Machine (SVM) classifier

svm_model = train_svm_classifier(train_features, train_labels)

# Train the Decision Tree classifier

decision_tree_model = train_decision_tree_classifier(train_features, train_labels)

# Store the trained models

trained_models = {

“svm:” svm_model,

“decision_tree:” decision_tree_model

}

# Load and preprocess the test data

test_data = load_and_preprocess_test_data()

# Split the test data into features and labels

test_features = extract_features_evsm(test_data.text)

test_labels = test_data.sentiment_labels

# Evaluate the SVM model

svm_evaluation_results = evaluate_model(svm_model, test_features, test_labels)

# Evaluate the Decision Tree model

dt_evaluation_results = evaluate_model(decision_tree_model, test_features, test_labels)

# Print or store the evaluation results

print(“SVM Evaluation Results:,” svm_evaluation_results)

print(“Decision Tree Evaluation Results:,” dt_evaluation_results)

# Example of using the trained models for prediction

input_text = “This is a positive example.”

preprocessed_input = preprocess_text(input_text)

prediction = predict_sentiment(trained_models, preprocessed_input)

print(“Sentiment Prediction:,” prediction)

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Sentiment Analysis of Short Texts Using SVMs and VSMs-Based Multiclass Semantic Classification

ABSTRACT

Introduction

Related Works