Search in:

Cogent Engineering Volume 10, 2023 - Issue 1

Submit an article Journal homepage

Open access

781

Views

CrossRef citations to date

Altmetric

Listen

Computer Science

A comprehensive comparison and analysis of machine learning algorithms including evaluation optimized for geographic location prediction based on Twitter tweets datasets

Hasti Samadi1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia

https://orcid.org/0000-0002-3689-6668 View further author information

Mohammed Ahsan Kollathodi1 School of Computing and Information Systems, The University of Melbourne, Parkville, AustraliaCorrespondence[email protected]

https://orcid.org/0000-0002-4229-0603 View further author information

Article: 2232602 | Received 17 Mar 2022, Accepted 25 Jun 2023, Published online: 04 Aug 2023

Cite this article
https://doi.org/10.1080/23311916.2023.2232602
CrossMark

In this article

Abstract
1. Introduction
2. Related work
3. Research method
4. Results and discussions
5. Conclusion
Acknowledgements
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Abstract

Given a tweet, a machine learning model when after undertaking the training, development and testing phase is able to predict where the author of the tweet is situated. Our notion is that a user’s tweet may consist of certain location-specific content which can indicatively consist of certain names or phrases related to the geographic location of the user. The primary purpose of the research was to identify a suitable algorithm to perform geographic location prediction accurately based on the Twitter tweets dataset. Geolocation prediction of Twitter users can be immensely helpful for demographic analysis, targeted advertising, location-based recommendation paving the way to enhanced user experience, advertising, location prediction during a time of crisis or disaster, and more. Knowing the location of users can also be helpful in finding employment as it would help determine the accessibility of certain users to their potential employees. Moreover, such a model would also help personalize someone’s newsfeed and tweets with greater accuracy helping users find what they really need. In today’s world, it can be observed that increasing amounts of data would result in a more precise location estimation, giving users belief in the soundness and continued refinement of location prediction using the user tweets data. Through increasing the enormous human-powered sensing capabilities of Twitter and associated microblogging services with content-derived location data, the algorithms can overcome the dispersion of geo-enabled features in these services and bring augmented scope and breadth to surfacing location-based personalized information services. With these objectives in mind, we propose and evaluate various different machine learning algorithms and models for predicting a tweeter’s geographical location. In addition to this, this paper would primarily analyze the efficacy of many different algorithms on the problem of determining a tweeter’s location by undergoing a comparative analysis of the machine learning algorithms based on different performance and evaluation metrics like accuracy, precision, recall, and f1-score which gives us an indication which algorithm would be most suitable for a specific dataset. Once analyzing the results, it was found that for a particular chosen dataset random forest classifier was producing the best performance metrics and was most suitable to perform prediction of user geographic location.

Keywords:

machine learning classifiers
location prediction
twitter datasets
performance evaluation
performance metrics

PUBLIC INTEREST STATEMENT

Machine learning models can be employed to make predictions on Twitter datasets consisting of user tweets which will be helpful to predict the user location that will enable to identify where the author of the tweet is located. Identifying the most suitable algorithm to perform geo location prediction is very essential in order to obtain accurate locations. Geolocation prediction of Twitter users can have many different types of applications like demographic analysis, targeted advertising, location-based recommendation, disaster crisis recovery, and more. Mostly such prediction is more feasible due to the large amount of user data that is present on the Internet. With such a large amount of user data continued refinement of location prediction can be performed. Through this paper, the evaluation of various different machine learning algorithms were performed for predicting a certain tweeter’s geographic location to determine which model would produce more accurate predictions through different performance metrics for a dataset.

1. Introduction

Through this paper, the focus of our study is to compare the performance of the machine learning algorithms in an efficient and sound manner reinforcing users on how to proceed to choose the most appropriate classification technique according to the investigation goals. This involves a comprehensive analysis of some of the existing machine learning algorithms which are KNN, Logistic Regression and Random Forest classifier regarding Twitter Geolocation prediction using many different performance and evaluation metrics. User geolocation is crucial to different applications like event spotting. A multiview geolocation model which is based on Graph Convolutional Networks (GCN) can make use of both text and network information to predict the user's location. A contrast made between the state-of-the-art Graph Convolutional network and associated baselines would prove that the model is very efficient on three different benchmark datasets when sufficient supervision is given. Artificial intelligence has become a major technology enabling computing power to attain what human minds could not achieve in such a short period of time. Artificial intelligence can be considered as a technology that analyses data from large databases for calculations and then makes further predictions. From such a large collection of data, AI learns about the world, its surroundings, different aspects of day-to-day life, and the system to which it’s introduced, and simulates and develops human intelligence based on these. AI has been a concept that was present a long time ago, right from the beginning when it was initially introduced by Alan Turing. In computer terms, AI can be considered as software that has the capability to undertake cognitive tasks such as recognizing objects in certain images, recognizing handwritten texts, and make predictions. These days AI has become an integral part of different industries. Companies employ AI to automate processes, reduce costs, and make predictions. This can also in turn improve a business’s profitability, perform calculations that a bare human mind cannot, make weather predictions and help save human lives, make self-driving cars and enhance transportation, produce robots that can manufacture goods quicker and more efficiently, AI can make predictions that can help solve social and economic challenges. As a matter of fact, Artificial Intelligence is designed to perform tasks that are not usually possible by humans (Frermann & Lapata, Citation2016; Rahimi et al., Citation2018).

In today’s world, we can find many different applications for Artificial Intelligence (AI). AI and machine learning in sentiment analysis can be employed to spot fatigue to prevent overwork. Artificial neural networks are used as clinical decision support systems for medical diagnosis, such as in concept processing technology in EMR software. The Da Vinci Surgical System is a robotic surgical system that uses a minimally invasive surgical approach. The system is used for prostatectomies, and increasingly for cardiac valve repair, and renal and gynecologic surgical procedures. We live in a world that’s dominated by social media. In today’s world, social media have become ubiquitous and highly prevalent. With such a large number of users, many social media have become very influential where it can help connect people, share their updates, help people find employment, socialize virtually and improve one’s mental health. Truth be told, these days many major life decisions are made based on the content in people’s news feeds. Surprisingly social media has become a trillion-dollar a year industry with all of the valuable user data. As a matter of fact, social media are omnipresent, at large and have the scope to create greater results with little or less effort involved, through a stable network they provide different opportunities for different forms of communication. Additionally, these platforms can allow organizations and individuals to communicate in different ways (Cheng et al., Citation2010; DiMaio et al., Citation2011; Ismail et al., Citation2016; Rahimi et al., Citation2018).

Twitter's popularity has grown in the last few years making an influence on the social, political, and business aspects of life. Therefore, sentiment analysis research has put a special focus on Twitter Training machine learning classifiers from tweets data often faces the data sparsity problem due to the large variety of tweets. The rapid growth of geotagged social media would elevate numerous dissimilar computational prospects for analyzing geographic linguistic dissimilarities. When a multi-level generative model is connected to a dataset of geotagged microblogs, the model is able of redeeming coherent topics and their regional alternatives while recognizing geographical regions of linguistic dependability (DiMaio et al., Citation2011; Eisenstein et al., Citation2010; Ismail et al., Citation2016; Pavalanathan & Eisenstein, Citation2015).

In an earlier work by Afshin Rahimi et al. “A label propagation approach to geolocation prediction based on modified adsorption” was introduced with major enhancements including the incorporation of text-based geolocation priors for the test users. Experiments over the Twitter benchmark datasets achieved state-of-the-art results and demonstrated the effectiveness of the enhancements (Rahimi et al., Citation2018).

In another work by Lianhua Chi et al. “Geolocation prediction in Twitter using location indicative words and textual features”, an algorithm to predict the location of Twitter users and tweets using multinomial naive bayes trained on location-indicative words and various textual features including city or country, names, #hashtags, and @mentions were proposed, and their proposed approach was compared against various baselines based on location-indicative words, city or country names, #hashtags, and @mentions as individual feature sets. The experimental results showed that their approach outperformed these baselines in terms of classification accuracy, mean and median error distance (Chi et al., Citation2016; Pennington et al., Citation2014).

In a similar work by Bo Han et al. which involves “Text-Based Twitter User Geolocation Prediction”, the task of geolocation prediction of Twitter users was investigated and improved. Through their research, they presented an integrated geolocation prediction framework and then investigated what factors would impact the prediction accuracy. Initially, a range of feature selection methods was investigated to obtain “location indicative words”. In the following experiment, the impact of non-geo-tagged tweets, language, and user declared metadata on geolocation prediction were further investigated. Furthermore, the impact of temporal variance on model generalization was also evaluated and discussed on how users differed in terms of their geolocatability (Han et al., Citation2014; Lim et al., Citation2018).

Through this project, our work would evaluate a novel approach to compare and contrast different classification algorithms that would be befitting for Geolocation prediction based on Twitter tweets dataset. A comparative analysis of different classification algorithms for Twitter geolocation prediction was not performed before while identifying the right classification algorithm to predict the geolocation of users based on the Twitter tweets dataset would produce accurate results and help predict the location of users without any bias or errors. This can be immensely helpful to know the location of users, for surveillance, recommendation of location-based items or services, locality prediction and detection of people in the time of crisis or disasters, demographic analysis, targeted advertising, and more. Such work is highly essential to determine the optimal algorithm to predict accurate locations of users based on different user data, as accurate results would produce the exact locations of users reducing bias, variance, and thus error in the result set. Our main contributions include a detailed analysis and report on various different algorithms and their usability for geolocation prediction choosing the optimal classifier for geolocation prediction based on the results of performance metrics. Further, we experimented on glove300 feature engineered dataset the performance of various different classifiers to estimate their performance and to determine an optimal classifier to perform location prediction. Through our experiment, we could determine that the random forest classifier was the best candidate to perform geolocation prediction based on Twitter tweets producing the best results with higher accuracy, precision, recall and an optimal f1-score (Eisenstein et al., Citation2010; Pavalanathan & Eisenstein, Citation2015; Pennington et al., Citation2014).

2. Related work

Z.Cheng et al. in their work on “You Are What You Tweet : A Content Based Approach for Geo location Twitter Users” have introduced and estimated a Twitter user’s city-level location on the basis of the content of the user’s tweets, even during the absence of other geospatial cues. By combining the human-centered sensing capabilities of Twitter social network and related blogging services with content-derived location information, this framework would overcome the challenge of geo enabled features in these services and pave the way to newer location-based personalized information services, targeting regional advertisements, and more (Cheng et al., Citation2010).

Afshin Rahimi et al. in their work on “Semi-supervised user geolocation via Graph Convolutional Networks” have explained about a multi view geolocation model based on Graph Convolutional Networks that would implement both text and network context. A comparison was made of the graph convolutional network to the state of the art, and to two different baselines that were proposed and it was shown that the model has achieved or is competitive with the state of the art and the three benchmark geolocation datasets when sufficient supervision was provided (Rahimi et al., Citation2018).

Jeaseock Yun in their work on “A comparative analysis of Deep Learning and Machine Learning on Detecting Movement Directions using PIR sensors” have explained about machine learning and their greater role in building intelligent systems. Through their research, they conducted a quantitative analysis carrying out a comparative analysis of the performance between the classical machine learning and deep learning algorithms with a human movement direction detecting application based on analog pyroelectric infrared (PIR) sensor signals. Through their study, the results have shown that the classical machine learning methods would show better performance in real-time detection. In addition to it, a simple deep learning as proposed in their research had achieved an accuracy of around 90 percent for detecting the moving directions even with the data sets from only three of the subjects and a single PIR sensor(Bozkurt et al., Citation2015; (Yun et al., Citation2020).

Pin Wang et al. in their work on “A comparative analysis of Image classification Algorithms based on traditional machine learning and Deep Learning” have described Image classification as an important field of image processing research. By considering SVM and CNN as examples, their work initially compares and analyzes the traditional machine learning and deep learning image classification algorithms. Their study has found that when using a large sample of the MNIST dataset, the accuracy of CNN was found to be higher, while when using a smaller sample of the dataset, the accuracy of the SVM classifier was found to be higher. Their work concluded that traditional machine learning algorithms had a more pronounced effect on smaller sample datasets, while deep learning frameworks had higher recognition accuracy on larger sample datasets (Wang et al., Citation2020).

Sinem Bozkurt et al. in their work on “A comparative study of Machine Learning Algorithms for Indoor positioning” has organized a comparative study of different machine learning algorithms for indoor positioning. Through their work, they have compared selected machine learning algorithms in terms of positioning accuracy and computation time. Through their work they had used “UJIIndoorLoc” indoor positioning database. The experimental results obtained from their research have explained that k-Nearest Neighbor or the kNN algorithm is most optimal while carrying out positioning. Additionally, ensemble algorithms like AdaBoost and Bagging were also applied to enhance the decision tree classifier performance nearly similar to that of kNN that produced the best classifier for indoor positioning (Bozkurt et al., Citation2015).

Saba Inam et al. in their work on “Multisource Data Integration and Comparative Analysis of Machine Learning Models for On-Street Prediction”, through their research they have conducted a comparative analysis of well-known machine learning and Deep Learning techniques including that of multi-Layer perceptron, Random Forest, Decision Tree, K-Nearest Neighbor, Gradient Boosting, Adaptive Boosting, and Linear SVC for prediction of on-street parking space availability. Their results have shown that through the comparative analysis, lesser complex algorithms like Random Forest, Decision Tree, and k-Nearest Neighbor had outperformed complex algorithms like Multi-Layer Perceptron in terms of prediction accuracy. Through their experiment, they were able to find that all four data sources had positively impacted the prediction and their proposed solution could determine the best possible parking slot based on weather conditions, traffic flow and pedestrian volume. Their study was also scalable for larger time frames and faster predictions that could be implemented for IoT-based big data-driven environments for on-street and off-street parking (Inam et al., Citation2022)

Tansu Alpcan et al. in their work on “Large-scale Strategic Games and Adversarial Machine Learning” have explained that decision-making in modern large-scale and complex systems such as communication networks, smart electronic grids, and cyber-physical systems would motivate novel game theoretic approaches. Their work has investigated big strategic games where a finite number of individual players each having a large number of continuous decision variables and input data points. Motivated by computational and information limitations that constrain the direct solution of big strategic games, their investigation revolved around reductions using linear transformations such as random projection methods and their effect on Nash equilibrium. Specific analytical results were also provided for quadratic games and approximations. In addition to it, an adversarial learning game was also presented where random projection and sampling schemes were investigated (Alpcan et al., Citation2016).

S. Procter et al. in their work on “Design of a Biomimetic BLDC Driven Robotic Arm for Teleoperation & Biomedical Applications” have explained robotics research and development, about the development of a robotic arm through 3D printed parts and other components. The arm was designed to be biomimetic with features and capabilities very similar to a real human elbow. Such an arm is designed to have many applications including testing tele-operative control and more (Rahimi et al., Citation2015; Procter et al., Citation2021).

3. Research method

3.1. Twitter geo location prediction

Computing the location of a social media user and where a message is sent from is principal for a Twitter geolocation prediction system, crisis identification, management, and demographic analysis. According to statistics, Twitter networking sites raise more than 550 million tweets on a daily basis. Tweets are fundamentally small messages with 140 characters or less, which would also include # hashtags and also stipulate the topic of the tweet, and also it comprises of @ mentions for reference. And within this large dataset of tweets, there are various different tweets that would primarily comprise the geographical location of users either in words or as part of distinct messages. Once these tweets are prepared and cleaned, the datasets consisting of these tweets as strings can be established into contrasting machine learning algorithms which once undergone training, development, and testing would have the potential for predicting the location of the user based on separate textual features like city or country names, region names, and location indicative words. Once we have geolocation labelled datasets of tweets, they can be utilized to train and test the classifiers which later could predict the region precisely. Within the datasets, once each tweet is labelled with one of the four possible regions which can be MIDWEST, NORTHEAST, SOUTH, OR WEST, the datasets can be employed to train the classifiers. Once this is done, they would explicitly be capable of predicting the geographical location of the user (Chi et al., Citation2016; DiMaio et al., Citation2011; Krantiwadmare, Citation2021; Pennington et al., Citation2014; Zhao et al., Citation2020).

3.2. Machine learning classifiers

Several machine learning classifiers have been proposed to do the geolocation prediction based on Twitter tweets, most of these methods can essentially be categorized into supervised, semi-supervised or unsupervised. In this paper, We mostly use supervised machine learning algorithms namely Logistic regression, random forest, and k-Nearest neighbor (kNN).

3.2.1. Logistic regression

Logistic regression is a statistical model in its fundamental form which would utilize a logistic function to model a binary dependent variable. In regression analysis, logistic regression would approximate the parameters of a logistic model which is a type of binary regression. Mathematically a binary logistic model has a dependent variable with two possible values, such as on or off which is represented by an indicator variable, where the two values are labelled as “0” and “1”. It’s an addition to the linear regression model for classification problems. In machine learning, logistic regression is a supervised learning classification problem that is utilized to predict the probability of a target function. This kind of target or dependent variable is dichotomous which would mean that there would be only two feasible classes in the output. In general, logistic regression would infer binary logistic regression possessing binary target variables. In such a type of classification, a dependent variable will have only two possible categories that can take a value of 0 or 1 (Cheng et al., Citation2010; DiMaio et al., Citation2011; Ismail et al., Citation2016; Krantiwadmare,Citation2021; Samadi, Citation2011).

3.2.2. K nearest-neighbor algorithm

K-nearest neighbours algorithm (kNN) is a non-parametric classification technique that is employed for classification and regression. In the kNN classification, the output is a class membership. In the kNN algorithm, the function is only calculated locally and all of the computation is delayed until function evaluation. Euclidean distance is used as one of the commonly used distance metrics for the kNN Algorithm. As the algorithm mostly depends on distance for classification, if the features would constitute discrete physical units or exist in distinct scales then normalizing have to be done to enhance accuracy. It’s one of the simple methods to implement a supervised machine learning algorithm that can be used to solve both classification and regression problems. A supervised machine learning algorithm as compared to an unsupervised machine learning algorithm is one that would depend on labeled input data to learn a function that furnishes a suitable output when provided with new unlabeled data (Cheng et al., Citation2010; DiMaio et al., Citation2011; Ismail et al., Citation2016; Logunova et al., Citation2022; Sharrab et al., Citation2021).

Algorithm 1

Pseudocode for kNN Algorithm

1. Load the input data
2. Initialize K to the selected number of neighbors
3. For each instance in the data:
- a. Calculate the distance between the query instance and the present instance from the data
- b. Add distance and the index of the instance to an ordered collection
- c. Arrange the ordered collection of distances and indices from smallest to the largest
4. Pick the first K entries from the sorted collection
5. Obtain the labels of the selected K entries
6. If the adopted method is regression, return the mean of the K labels.
7. If the adopted method is classification, return the mode of the K labels.

3.2.3. Random forest

The Random Forest is a supervised learning algorithm that is employed for both classification and regression techniques. However, although most of the time it is used for Classification Algorithms. It is based on the concept of ensemble learning which is a process of combining multiple classifiers to solve a complex problem and improve the performance of the model. Random forest is a classifier that would contain a number of decision trees on various different subsets of the given dataset and then takes the average to improve the predictive accuracy of the dataset. Instead of relying on one decision tree, the random forest would take the prediction from each tree and based on the majority votes of the predictions, it predicts the final output. The more the number of trees in the forest, it would lead to higher accuracy and it prevents the problem of overfitting. Random forest would work in two different phases, the first phase is to create the random forest by combining the N decision tree, and the second is to make predictions for each tree created in the first phase. The working process of the random forest algorithm can be explained in the below steps:

Select random K data points from the training set.
Build the decision trees associated with the selected data points or subsets.
Choose the number N for decision trees that need to be built.
Repeat 1 and 2 earlier steps.
For the new data points find the predictions in each decision tree, and assign new data points to the category that wins the majority votes.

Random forest is mostly used in the banking sector for the identification of loan risk. In supervised learning, the classifier is trained using data that is well “labelled” or the classifiers would have a prior knowledge of a closed set of classes and set out to discover and categorize new instances according to those classes (Cheng et al., Citation2010; Han et al., Citation2014; Ismail et al., Citation2016; Lim et al., Citation2018; Zhao et al., Citation2020).

3.3. Datasets

The dataset that is obtained from a dataset of tweets gathered from Twitter’s streaming API can be used as part of this work. Initial filtering can be done to remove the retweets and repetition of the previously posted messages. To eliminate spam and automated accounts, tweets containing the URLs can be removed. Model evaluation and model selection were initially performed where a holdout strategy was chosen. During model evaluation, the primary motive is to ensure that the predictions that are made are correct. A holdout method can be used for both model evaluation and model selection. Initially while performing model evaluation a hold-out method is chosen where the entire dataset is split into training and testing datasets. The model is initially trained on the training set and then tested on the testing set to get the most optimal model. A 50–50% partition is used to split the dataset into training and testing datasets. After Model evaluation, the Model selection is performed. A holdout method for model selection or hyperparameter tuning was adopted which involved splitting the data into different sets one for training, and other sets for validation and testing. During feature selection, within the provided training, development, and test sets, different feature representations of the tweets were created of which the glove300 representation was selected. In the hold-out method for model selection, the dataset is split into three different sets including training, validation, and test dataset. Through such a process it was ensured that training, validation, and test dataset were representations of the entire dataset. One of the feature-engineered datasets that were implemented in this project is glove300, where each word is mapped to a 300-dimensional Glove “embedding vector”. These vectors were used to capture the meaning of each of the words. This was followed by averaging the vectors of each word in a tweet to obtain a single 300-dimensional representation of the tweet. Example would include User_ID, 2.05549970e-02 where the vector 2.05549970e-02 would represent a 300-dimensional list of number (Frermann & Lapata, Citation2016; Pavalanathan & Eisenstein, Citation2015; Pennington et al., Citation2014).

Lexical blending is a highly productive and frequent process by which new words enter a language. A blend is formed when two or more source words are combined, with at least one of them shortened, as in brunch (breakfast+lunch). Datasets consisting of blend words can also be used for the geolocation prediction of tweets. We would use the existing Twitter geolocation dataset which is GEOTEXT and the blend datasets. They are all pre-partitioned into training, development, and test sets. To aid in initial experiments many different feature engineering methods were applied to the raw tweets. Within the datasets, each tweet is labelled with one of the four possible regions which can be MIDWEST, NORTHEAST, SOUTH OR WEST (Eisenstein et al., Citation2010; Pavalanathan & Eisenstein, Citation2015; Pennington et al., Citation2014).

3.4. Hyperparameters

Model parameters are the set of configuration parameters that are internal to the model and that can be learned from the historic training data. The value of those parameters is estimated from the input data. The model parameters would specify how the input data would be transformed into the desired output, while the hyperparameters would define the structural model. In machine learning hyper parameter optimization or tuning is the process of choosing an optimal set of hyperparameters for a learning algorithm which is when the hold-out method is used for model selection. A three-way holdout method for model selection was chosen which involved splitting the data into different sets one for training and other sets for validation and testing. The hold-out method for training the machine learning model is a technique that involves splitting the data into different sets including one set for training, and other sets for validation and testing. It is a known fact that the holdout method can be used for both model evaluation and model selection (Eisenstein et al., Citation2010; Pavalanathan & Eisenstein, Citation2015; Pennington et al., Citation2014).

3.5. Feature selection

A feature is an attribute or a property that is shared by all of the independent units on which the analysis or prediction is to be done. Feature selection is the process of selecting a subset of original features so that the feature space is optimally reduced to the evaluation criterion. A feature selection would select a subset of relevant features. What makes a feature good is another interesting area of research. Many different techniques like spatial data mining and text mining can be done to the datasets to extract the required features (Cheng et al., Citation2010; Ismail et al., Citation2016).

3.6. Proposed methodology, Experimental setup, design and Performance evaluation

In the comparative study, we consider multiple models of three different classifiers which are kNN, Logistic regression and random forest and apply them to the given datasets which is glove300 which is the dataset of choice. Initially, we can see that within the datasets that were provided, there is a big class imbalance between the tweets that are made in the Northeast and South regions and those made in the Midwest and West regions. To deal with class imbalance problems, we would do Upsampling and Downsampling to the training set quite often. The study was carried out by performing a comparative analysis of different classifiers based on different performance metrics like accuracy, f1score, precision, and recall by developing the code in Python in jupyter notebook. Upsampling is a procedure where synthetically generated data points corresponding to the minority class are injected into the dataset. This equalization procedure would prevent the model from linking toward the majority class. While Downsampling or undersampling is the mechanism that reduces the count of training samples falling under the majority class. For performance evaluation we would use the performance metrics like Precision, recall, Accuracy, and Macro-averaged f1score and Confusion matrix. Precision also called positive predicted value is the fraction of relevant instances among the retrieved instances while recall is the fraction of relevant instances that were retrieved. Both precision and recall are calculated on the basis of relevance. Accuracy is defined as the percentage of correct predictions for the test data. While the macro averaged f1-score is primarily used to compute the quality of problems with many different binary labels or multiple classes. It would simply calculate the harmonic mean of the macro averaged precision and macro-averaged f1-score. A confusion matrix can also be employed for computing the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. K-fold cross-validation can be used to approximate the skill of a model. The objective of k-fold cross-validation is to test how well the model is trained upon a given data and test it on unseen data. The project was primarily setup in Python and jupyter notebook. All the code references are made to appropriate code in Python. As we begin analyzing the dataset, something to note is that some tweets or their glove vector representations are duplicated across the dataset (Abdul Lateef & Eranna, Citation2022; Alpcan et al., Citation2016; Burton et al., Citation2021; Dingler & Pielot, Citation2019; Samadi, Citation2011; Sharrab et al., Citation2021; Wang et al., Citation2018).

Above is the distribution of the target class labels of rows with duplicated tweets (As shown in ). Within the dataset, there is no need to reduce the number of training data points for the Western and Midwestern regions so a good idea would be to drop the duplicates of those rows in the Northeastern and the Southern regions. This would slightly balance the class imbalance and remove those duplicates.

Figure 1. The frequency distribution for classes of duplicated tweets.

This is followed by shuffling the data to make sure that there is no order to the target class labels and that they are not clumped together. While preprocessing the development data it can be observed that there aren’t too many data points in the development set that have empty glove vectors. Once we implement the one rule baseline with a function that would implement the baseline model to return the label of the class that would appear the most often in the training data, it can be observed that the baseline accuracy for the training dataset is about 39 percent and the baseline accuracy for the development dataset is about 37 percent. While the baseline macro averaged f1-score is about 14 percent for both the training and development sets (As shown in ).

Figure 3. A comparison between training set performance and the development set performance.

Figure 2. Histogram of the Target Labels.

Naive Bayes is based on applying the Bayes Theorem with strong independence assumptions between the features. As the training feature is the averaged word vector for the tweet, each variable is not independent from the rest. Logistic regression would handle features that are not fully independent better, and it allows for negative feature values as well so using it as the first sample model would be a better idea. It would handle larger data less quickly than Naïve Bayes does or it takes longer to converge its best performance so we will have to likely increase the maximum number of training iterations parameter. When the logistic regression classifier is run with a maximum iteration of 1000, we can see the training and test performance as follows (As shown in .). Throughout this paper, confusion matrix and performance metrics like precision, recall, f1-score, and support were used to estimate the performance (Ismail et al., Citation2016; Mohammed et al., Citation2020).

Figure 4. Training and development set confusion matrix.

The LR model would give an accuracy of 45 percent on the training set and 44 percent on the development set. It barely captures any of the data points with the MIDWEST and WEST as target labels. Those are the target labels that aren’t well represented in the training data. The class imbalance can be rectified by up-sampling the minority class (As shown in ).

Figure 5. Development set performance.

3.6.1. The LR Classifier

The Logistic regression Classifier can be retrained to see if any of the up-sampling is upheld. After this, once we analyze the training and development set performance it can be observed that the accuracy of the LR classifier is not in an appreciable range. It is a known fact that Logistic regression cannot handle a larger number of categorical features or variables. Additionally, the LR classifier does not perform well when the dataset would consist of independent variables which are not correlated to the target variable or when there are variables that are closely correlated to each other (As shown in ).

Figure 6. Training set performance after upsampling.

Figure 7. Training and development set confusion matrix.

Up-sampling would help a bit but not nearly enough. The majority of data points are still getting a prediction of Northeast or South. After this, we would down-sample the majority class to see if that would help.

After this process is complete, we retrain the Logistic regression classifier to see if any of the down-sampling processes had helped. Once up-sampling and down-sampling is complete, the Logistic regression classifier has to be retrained to see if any of the downsampling have helped. The training set performance is given (after upsampling and downsampling) As shown in and .

Figure 9. Class distribution after upsampling and down sampling.

Figure 8. Class distribution after upsampling and downsampling.

Upsampling and downsampling reduced the accuracy of the overall LR model, but made the recall across the four target classes more giving us a higher macro averaged f1 score. This would make the model more suitable. However, we need to analyze other possibilities as well. The confusion matrices show that a lot of the class labels are not getting correctly predicted. Running a One vs. other Logistic Regression classifier to reduce the class bias is something that had to be investigated at this stage. (As shown in Figure , Figure , Figure , and Figure ).

Figure 12. Development set performance after upsampling and downsampling.

Figure 13. Confusion matrix after another round of upsampling and downsampling.

Figure 10. Confusion matrix after upsampling and down sampling.

Figure 11. Training and development set performance after upsampling and down sampling.

3.6.2. Comparison between the Logistic regression classifiers using performance metrics

Figure 14. Development set performance on different classifiers (precision, recall and f1-score).

Different Logistic regression classifiers that were built did not differ to a great extent performance-wise and produced almost similar results (As shown in Figure ). At this stage, further classifiers had to be explored.

3.6.3. The K Nearest neighbour classifier

When compared to other classifiers, Thee kNN classifier that was built took a very long time to label a data point and also produced lesser accurate predictions. The shown below are the development set performance for the KNN classifier. It is a well-known fact that kNN is more memory-consuming than many of the other popular classifying algorithms as it would require to loading the entire dataset to run the computation which would essentially improve computation time and costs. This was evident from the fact that, the running time for the kNN classifier was higher when compared to the other classifiers (As shown in , ). Additionally, it is also known that the KNN algorithm would perform worse on more complex tasks such as text classification (Rahimi et al., Citation2018; Sharrab et al., Citation2021).

Figure 15. The development set performance for k equal to 3.

Figure 16. The Development set performance for k equal to 5.

Figure 17. Development set performance for k equal to 10.

With number of neighbours equal to 3

With number of neighbours equal to 5

With number of neighbours equal to 10

3.6.4. Random Forest Classifier

Based on different parameters, we prepare different random forest classifiers to see which one would yield the best results.

Model: rfc_1

For the model rfc_1 that was prepared, the Random Forest classifier is performing almost perfectly on the training set, but not so well on the test set. This would infer that it is overfitting the data or starting to memorize it as opposed to learnt generalized rules about it. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on the new data (As shown in ).

Figure 18. Training and development set performance for rfc_1.

Model: rfc_2

It looks like it is still overfitting, maybe we could run a small Grid Search to find the best parameters. Grid Searching is the process of scanning the data to configure optimal parameters for a given model. Depending on the type of the model utilized, certain parameters are necessary, Grid searching can be applied across machine learning to calculate the best parameters to use for any of the given models. Once we have three folds for each of the 12 candidates, it would be a total of 36 fits (As shown in Figure ). The training and development set performance for more models were also assessed (As shown in Figures Figure , Figure , Figure , Figure and Figure ).

Figure 19. Training and Development set performance of rfc_2.

Figure 20. Training, development set performance and confusion matrix for rfc_3 (when after upsampling and downsampling) have been applied.

Figure 21. Training and development set performance for rfc_4.

Figure 22. Training and development set performance evaluation for rfc_5.

Model: rfc_3

Upsampling and downsampling were performed and improvements in performance metrics were observed (As shown in Figure ).

Model: rfc_4

After upsampling and downsampling, the feature array, the training and development set performance would have a maximum depth of 12.

Model: rfc_5

It seems that Midwest and west regions are getting confused with the NorthEast and South. Upsampling can be performed after this, to analyze what’s happening with the provided data set. The class distribution at this stage is shown in Figure .

Figure 23. Class distribution with Upsampling.

Model: rfc_6

Figure 24. Confusion matrix after upsampling and downsampling.

The Northeastern class also seems to dominate in predictions. So, we will have to down sample those slightly.

Model: rfc_7

Figure 25. Training and development set performance for rfc_7 random forest classifier.

The difference between the training and development metrics is very different, in such a case we may need to make some additional hyperparameters to make our forest work the best. In addition to it, the oob_score can also be set equal to True such that the generalization score of the model after fitting can be estimated. And after this, the max_features equal to None parameter can be set, so that it uses all of the features in its trees training set when looking for where to split. This is mostly done because each feature in an averaged word has a lot of latent information in it and it is a bit not comparable such that the model does not look like overfitting. Out of bag score is a way of validating the random forest model. It is an average error for each calculated parameter using predictions from the tree that do not contain bootstrap aggregation or bagging.

Model: rfc_8

For the model rfc_8, we would build 100 iterations for the random forest classifier, the training and development set performance is given as (As shown in and Figure ),

Figure 26. Training and development set performance for rfc_8.

Figure 27. Confusion matrix after two rounds of upsampling and downsampling.

Model: rfc_9

Random Forest classifier with parameters number of estimators equal to 500, max_depth of 15, number of jobs equal to − 1 and verbose equal to 2 is shown as in .

Figure 28. Confusion matrix for training and development set (after upsampling and downsampling).

Figure 29. Training and Development set performance for rfc_9.

Model: rfc_10

Random Forest classifier with parameters number of estimators equal to 500, max_depth of 15, number of jobs of − 1 and verbose equal to 2 is as shown in was also estimated.

Figure 30. Training and development set performance after upsampling and downsampling.

Figure 31. Confusion matrix for the training set and development set, After upsampling and downsampling.

Model: rfc_11

Random Forest classifier with parameters number of estimators of 500, max_depth equal to 15, number of jobs equal to −1 and verbose equal to 2 was also estimated (As shown in ).

Figure 32. Training and development set performance for rfc_11.

Figure 33. Confusion matrix for training and development set (after upsampling and downsampling).

4. Results and discussions

Through the comparative study, we considered multiple models including kNN, Logistic regression, and the random forest classifiers on the provided glove300 dataset. Upsampling and downsampling were performed on the training datasets quite often to deal with the class imbalance problems. Many additional metrics like precision, recall, accuracy, and macro averaged f1-score were also considered for performance evaluation. Additionally, a confusion matrix was also taken into account for computing the performance of the classification model, which compared the actual target values with those predicted by the machine learning model. A K-fold cross-validation once performed on the dataset produced the performance of different models. Both the precision and recall were calculated on the basis of relevance. The table below ( shows the experimental results from different classifiers including the KNN classifier, logistic regression, Random Forest, and Naïve Bayes.

Table 1. Optimal training set performance

Download CSV Display Table

The optimal training set performance is given as follows,

While the optimal development set performance is observed as following (As shown in ),

Table 2. Optimal development set performance

Download CSV Display Table

The results show that accuracy and other metrics like precision, recall, and f1-score for certain random forest classifiers are higher than that of certain other types of classifiers. The training time for the KNN Classifier was higher than the other classifiers which was expected. In general, the KNN algorithm does more computation on test time than the train time which was observed on the provided data set. This is mostly because KNN would require large memory for storing the entire dataset for prediction. Precision can be seen as a measure of quality, and recall as a measure of quantity. Higher precision means that an algorithm returns more relevant results than irrelevant ones and high recall means that an algorithm returns the most of the relevant results whether or not irrelevant ones are also returned. Both the precision and recall were calculated on the basis of relevance. Accuracy was also considered as a performance metric which was the percent of correct predictions for the test data. While a macro averaged score was primarily used to compute the quality of problems with different binary labels or multiple classes. A confusion matrix was also employed for computing the performance of the model, where N was the number of target classes. Initially, to correct the class imbalance problem, duplicates were removed as one of the measures which was followed by shuffling of data to ensure that they were not clumped together. Naïve Bayes classifier is quite often used with larger text datasets. For the Naïve Bayes Classifier, the accuracy turned out to be 45 percent which was expected as the dataset of interest did have correlated features and the continuous variables were not handled properly. Additionally, choosing hyperparameter and hyperparameters tuning did not help improve the performance of the naïve Bayes classifier or the accuracy of the classifier as the classifier has a very limited parameter set. Data preprocessing and feature selection were already performed to produce glove300 dataset on which all different classifiers were run.

However, such methods did not improve the naïve Bayes classifier accuracy significantly. The LR model would give an accuracy of 45 percent on the training set and 44 percent on the development set, the class imbalance was further rectified by Upsampling the minority class. The logistic regression classifier was also retrained to observe if any of the Upsampling is upheld or not. This was also followed by downsampling the majority class to observe if that would help or not. As a matter of fact, Upsampling and downsampling did not help improve the accuracy of the LR model but did improve recall and subsequently, the macro averaged f1-score. When compared to other classifiers kNN took a lot of time to complete and label a data point, and also did not produce accurate predictions, with a greater number of neighbors 3,5,10 the accuracy fell from 29 to 24 percent.

For the KNN algorithm, for choosing the optimal K value we would run the KNN algorithm multiple times with different K values. We would use accuracy as the metric for evaluating the K performance, if the value of accuracy of K would change proportionally to the change in K, then it would be a good candidate for the K value. The more features and groups in our dataset, the larger selection will need to be made in order to find an appropriate value of K. For the value of K equal to 1, the predictions are usually less stable, for better results the value of K was increased until the F-measure value was higher than the threshold. However, through our experiment with the dataset of choice, although the K value was varied from 2 to 10, the f1-score did not improve well and remained almost the same which inferred that the model performance did not improve significantly with the increase in f1-score. With an optimal K value, a 100 percent accuracy could have been expected. The Random Forest classifier performed the best producing the best results in regards to the performance metrics including accuracy, f1-score, precision, and recall.

The Random Forest classifier gave an average of 93 percent for precision, recall, and f1-score. When the initial random forest classifier was model prepared, the random forest classifier was almost fitting perfectly on the training set, but not so well on the test set which infers that the model is overfitting the data or starting to memorize it as opposed to learning generalized rules about it. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on the new data. It looks like it is still overfitting, maybe we could run a small Grid Search to find the best parameters. Grid Searching is the process of scanning the data to configure optimal parameters for a given model. Depending on the type of the model utilized, certain parameters are necessary, Grid searching can be applied across machine learning to calculate the best parameters to use for any of the given models. Once we have 3 folds for each of the 12 candidates, it would be a total of 36 fits. After upsampling and downsampling, the feature array, the training, and the development set performance would have a maximum depth of 12. It seems that Midwest and West regions are getting confused with the North East and South. So, upsampling can be performed to investigate further. The Northeastern class also seems to dominate in predictions. So, downsampling will have to be performed slightly. The difference between the training and development metrics are very different, in such a case, some additional hyperparameters will have to be made in order for the forest to work the best (Ren, Citation2021).

In addition to it, the oob_score equal to True can be set in the source code such that the generalization score of the model can be investigated after fitting. And after this, max_features equal to None is set so that it uses all of the features in its trees training set when looking for where to split. This is done because each feature in an averaged word has a lot of latent information in it and that it is a bit not comparable such that the model will not be overlooking. The out of bag score is a way of validating the random forest model. It is an average error for each calculated parameter using predictions from the tree that do not contain bootstrap aggregation or bagging (Logunova, Citation2022).

While building models through random forest classifier, For the model rfc_8, Initially a build of 100 iterations were made for the random forest classifier, while the training set performance is mostly estimated by considering the metric of accuracy for different Random Forest models with parameters like a number of estimators with the value of 500, having a max_depth of 15, with a number of jobs equal to − 1, and verbose equal to 2. Random Forest classifier with parameters number of estimators equal to 500, with a max_depth of 15, number of jobs equal to − 1, and verbose equal to 2 were also investigated further. The random Forest classifier with parameters number of estimators equal to 500, with a max_depth of 15, number of jobs equal to −1, and verbose equal to 2 with predictions from the tree that do not contain bootstrap aggregation or bagging were also investigated further. In a random forest classifier, the prerequisites for the random forest to perform well are that the predictions and the errors that are made by the individual trees need to have low correlations with each other, The metrics calculated on the training set infer what the in-sample performance of the model is, while the metrics calculated on the test set infer what the out-of-sample performance of the model is, the metrics calculated on the training set infer what the model performs on the seen data, if the performance is poor a certain model might be underfitting. If the model is performing well on the training set, but poorly on the test set, then the model would be overfitting to the training data, so it is very essential to consider the metrics on both of these sets. Training or calibration data would provide useful information regarding an optimal fit. Once performing analysis using Python libraries and jupyter notebook, through the competitive study it could be found that most of the algorithms performed well while doing prediction. However, relatively, the random forest classifier was performing the best and produced accurate predictions.

Figure 34. The concept of Precision and recall.

A higher value of recall and precision would essentially imply a higher f1-score and such a model can always be preferred. Provided the dataset that was implemented for creating the model is imbalanced, the f1-score can be a great metric to estimate the classifier performance. The f1-score would be a measure of the model’s accuracy on a dataset that initially evaluates binary classification systems classifying samples as either positive or negative. One of the major advantages of considering the f1-score is that it combines precision and recall into a single metric and as such it can be relevant while performing grid search or automated optimization and give better information about what the most optimal classifier will be for a particular data set. The comparison between different machine learning classifiers is given as follows (As shown in ).

Figure 35. Final comparative results of performance among different classifiers.

It looks like the random forest model rfc_5 would have the highest precision, recall, and thus a higher macro score for the development set. This can be used to make predictions on the test set,rfc_4 also exhibits very similar characteristics, however, varies in many different parameters. Our results would show that the Random Forest classifier would outperform the other classifiers with the best accuracy and also the best macro averaged f1-score in the first phase over LR and kNN models. Hence it can be chosen as the model to do predictions and can be considered as the optimal classifier for the provided dataset. Moreover, the results also show that all of the classifiers would give good accuracy, however when doing a comparative analysis random forest classifier can be a good choice for performing geolocation predictions based on the twitter tweets dataset.

5. Conclusion

Twitter is one of the most widely used social networks where the users are capable of tweeting about discrete topics within the 140 character point. Having Twitter datasets with location-tagged tweets would enable geolocation prediction of users. The gathered results from the application of the machine learning classifiers for geolocation prediction suggest that the random forest classifier is better in terms of performance and is capable of providing a quick response and more accurate results. For future work We would like to broaden the scope of our experiment and run the classifiers on more than one type of dataset to have more preferable results (Cheng et al., Citation2010; Han et al., Citation2014; Ismail et al., Citation2016; Khoudry et al., Citation2020; Sharrab et al., Citation2021; Zalk et al., Citation2011).

Many experiments were conducted before in the area of machine learning in regard to geolocation prediction based on the Twitter tweets dataset, this project would help identify the best classifier would be to perform geo location prediction based on the Twitter tweets dataset. Following the latest research trend, through this project, many different types of research were performed in regards to Twitter geolocation prediction. Afshin Rahimi and etal. in their work on “Twitter User Geolocation Using a Unified Text and Network Prediction Model” have described about a label propagation approach to geolocation based on modified adsorption with major enhancements including the incorporation of text-based geolocation priors for test users have explained that quite often the geolocation that is obtained from text-based profile locations are very noisy and only a small proportion of tweets are geotagged which would mean that geolocation will have to be extracted from other sources such as the tweet text and network relationships (Cheng et al., Citation2010; Logunova, Citation2022; Heiser et al., Citation2012; Rahimi et al., Citation2015; Ranjan et al., Citation2018; Sharrab et al., Citation2021)

Major limitations include Privacy, User data being not obtained due to insufficient permissions, which limits the amount of data that can be obtained from a particular source, also as a result of this there would be a longer time while scraping data to prepare datasets, while at times it can be difficult to obtain the required data compromising user confidentiality. The scope of the project includes User Geolocation prediction which will help to know the location of users, surveillance, recommendation of location-based items or services, locality prediction and detection of people in the time of crisis or disasters, demographic analysis, targeted advertising, and more. Our main contributions include a detailed analysis and report on various different algorithms and their usability for geolocation prediction choosing the optimal classifier for geolocation prediction based on the results of performance metrics. Further, we experimented on the glove300 feature-engineered dataset, the performance of various different classifiers to estimate their performance and to determine an optimal classifier to perform location prediction. Through our experiment, it could be determined that the random forest classifier was the best candidate to perform geo location prediction based on Twitter tweets. The scope of the project is limited to finding limited information from the Twitter tweets dataset (Bagchi et al., Citation2020; Dingler & Pielot., Citation2019; Goudarzi et al., Citation2021; Moumen et al., Citation2021; Sharrab et al., Citation2021; Wang et al., Citation2020; Yun & Woo, Citation2020).

Author contributions

The project was designed and developed by the first and second authors. The authors confirm contribution to the paper and sole responsibility for Data Collection, Analysis Interpretation of results, and draft manuscript preparation. The authors have carefully reviewed the results and approved it for the final version of the manuscript.

Acknowledgments

God, Father, Mother, Brothers and Sisters for continued support throughout the journey and my dearest teachers. In particular, I would like to express special thanks to Hasti Samadi, Lea Frermann, and the Introduction to Machine learning teaching team at the University of Melbourne. In addition to this, a special gratitude towards my university for providing all required resources to complete this project.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The work was supported by the University of Melbourne .

Notes on contributors

Hasti Samadi

Hasti Samadi has been working as a Consultant at The University of Melbourne for the past 8 years. She is a researcher and also a subject tutor for the subject Introduction to Machine Learning (COMP90049) at the CIS, University of Melbourne. Her main areas of Research include Machine Learning and Deep Learning. Earlier she had worked as a Business analyst and in many other roles for different companies based in Australia.

Mohammed Ahsan Kollathodi

Mohammed Ahsan Kollathodi is a Master of Information Technology student (with specialisation in Cyber Security) at the University of Melbourne. He has published papers in different Scopus-indexed journals and also has worked in companies and start-ups based in Australia, Asia, and the Americas. He also has worked for Australian Government as a software project team lead intern and the Department of General Practice in relation with to Future Health Today project at the University of Melbourne. He is truly passionate about computer science and is involved in different projects all the time. His Areas of Interest include Cyber Security, Machine Learning, The Internet of Things, Network Security, and Software Development. Research-Gate: https://www.researchgate.net/profile/Mohammed-Ahsan-Kollathodi Linked-In: https://au.linkedin.com/in/mohammed-ahsan-kollathodi-128b72116

References

Abdul Lateef, P. S., & Eranna, U. (2022). A simplified machine learning approach for recognizing human activity. International Journal of Electrical and Computer Engineering (IJECE), 9(5), 3465–27. https://doi.org/10.11591/ijece.v9i5.pp3465-3473
Google Scholar
Alpcan, T., Rubinstein, B. I. P. and Leckie, C. (2016). Large-scale strategic games and adversarial machine learning. 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 4420–4426, https://doi.org/10.1109/CDC.2016.7798940.
Google Scholar
Bagchi, S., Tay, K. G., Huong, A., & Debnath, S. K. (2020). Image processing and machine learning techniques used in computer-aided detection system for mammogram screening-A review. International Journal of Electrical and Computer Engineering (IJECE), 10(3), 2336–2348. https://doi.org/10.11591/ijece.v10i3.pp2336-2348
Google Scholar
Bozkurt, S., Elibol, G., Gunal, S. and Yayan, U. (2015). A comparative study on machine learning algorithms for indoor positioning. 2015 International Symposium on Innovations in Intelligent Systems and Applications (INISTA), Madrid, Spain, pp. 1–8, https://doi.org/10.1109/INISTA.2015.7276725.
Google Scholar
Burton, W. S., Myers, C. A., & Rullkoetter, P. J. (2021). Machine learning for rapid estimation of lower extremity muscle and joint loading during activities of daily living. Journal of Biomechanics, 123, 110439. https://doi.org/10.1016/j.jbiomech.2021.110439
PubMed Web of Science ®Google Scholar
Cheng, Z., Caverlee, J., and Lee, K. (2010). You are where you tweet: A content- based approach to geo-locating twitter users. Proceedings of the 19th ACM international conference on Information and knowledge management Toronto, Canada, (pp. 759–768).
Google Scholar
Chi, L., Lim, K. H., Alam, N., & J.Butler, C. (2016). Proceedings ofthe 2nd Workshop on Noisy User-generated Text, Osaka, Japan (pp. 227–234).
Google Scholar
DiMaio, S., Hanuschik, M., & Kreaden, U. (2011). The da Vinci Surgical System. In J. Rosen, B. Hannaford, & R. Satava (Eds.), Surgical Robotics (pp. 199–217). Springer US. https://doi.org/10.1007/978-1-4419-1126-1_9
Google Scholar
Dingler, T., & Pielot, M. (2019). I’ll be there for you: Quantifying Attentiveness towards Mobile Messaging. Proceedings of the MobileHCI'15, Copenhagen, Denmark, 15, 144–154. ACM. https://doi.org/10.1145/2785830.2785840
Google Scholar
Eisenstein, J., O’Connor, B., Smith, N. A., and Xing, E. (2010). A latent variable model for geographic lexical variation. Proceedings of the 2010 conference on empirical methods in natural language processing Cambridge, United States of America, (pp. 1277–1287).
Google Scholar
Frermann, L., & Lapata, M. (2016). A bayesian model of diachronic meaning change. Transactions of the Association for Computational Linguistics, 4, 31–45. https://doi.org/10.1162/tacl_a_00081
Google Scholar
Goudarzi, M., Wu, H., Palaniswami, M., & Buyy a, R. (2021). An application placement technique for concurrent iot applications in edge and fog computing environments. IEEE Transactions on Mobile Computing, 20(4), 1298–1311. https://doi.org/10.1109/TMC.2020.2967041
Web of Science ®Google Scholar
Han, B., Cook, P., & Baldwin, T. (2014). Text-based twitter user geolocation prediction. The Journal of Artificial Intelligence Research, 49, 451–500. https://doi.org/10.1613/jair.4200
Web of Science ®Google Scholar
Heiser, G., Murray, T., & Klein, G. (2012). It’s time for trustworthy systems. IEEE Security & Privacy, 10(2), 67–70. https://doi.org/10.1109/MSP.2012.41
Web of Science ®Google Scholar
Inam, S., Mahmood, A., Khatoon, S., & Nawaz, N. (2022). Multisource Data Integration and Comparative Analysis of Machine Learning Models for On-Street Parking Prediction. Sustainability, 14(7317). https://doi.org/10.3390/su14127317
Google Scholar
Ismail, H. M., Harous, S., & Belkhouche, B. (2016). A comparative analysis of machine learning classifiers for twitter sentiment analysis. Research in Computing Science, 110(1), 71–83. https://doi.org/10.13053/rcs-110-1-6
Google Scholar
Khoudry, E., Belfqih, A., Ouaderhman, T., Boukherouaa, J., & Elmariami, F. (2020). A real-time fault diagnosis system for high-speed power system protection based on machine learning algorithms. International Journal of Electrical and Computer Engineering (IJECE), 10(6), 6122–6138. https://doi.org/10.11591/ijece.v10i6.pp6122-6138
Google Scholar
Krantiwadmare. Logistic regression in machine learning. Medium, (2021, May 30). [Retrieved: February 4, 2023] [Online]. Available: https://medium.com/analytics-vidhya/logistic-regression-in-machine-learning-f3a90c13bb41.
Google Scholar
Lim, K. et al., (2018). “The Grass is Greener on the Other Side”, Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW ’18. Available: https://doi.org/10.1145/3184558.3186337 [Retrieved July 10, 2021].
Google Scholar
Logunova, I. K-Nearest neighbors (KNN) algorithm for machine learning. Serokell Software Development Company, (2022, September 20). [Retrieved: July 10, 2023 2022] [Online]. Available: https://serokell.io/blog/knn-algorithm-in-ml.
Google Scholar
Mohammed, Z., Hanae, C., & Larbi, S. (2020). Comparative study on machine learning algorithms for early fire forest detection system using geodata. International Journal of Electrical and Computer Engineering (IJECE), 10(5), 5507–5513. https://doi.org/10.11591/ijece.v10i5.pp5507-5513
Google Scholar
Moumen, R., Chiheb, R., & Faizi, R. (2021). Real-time Arabic scene text detection using fully convolutional neural networks. International Journal of Electrical and Computer Engineering (IJECE), 11(2), 1634–1640. https://doi.org/10.11591/ijece.v11i2.pp1634-1640
Google Scholar
Pavalanathan, U. and Eisenstein, J. (2015). Confounds and consequences in tagged twitter data. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal (pp. 2138–2148).
Google Scholar
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, (pp. 1532–1543).
Google Scholar
Procter, S., & Secco, E. (2021). Design of a Biomimetic BLDC Driven Robotic Arm for Teleoperation & Biomedical Applications. Journal of Human, Earth, and Future, 2(4), 345–354. https://doi.org/10.28991/HEF-2021-02-04-03
Google Scholar
Rahimi, A., Cohn, T., & Baldwin, T. (2015) Twitter user geolocation using a unified text and network prediction model. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), Beijing, China. Association for Computational Linguistics.
Google Scholar
Rahimi, A., Cohn, T., and Baldwin, T. (2018). Semi-supervised user geolocation via graph convolutional Networks. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia (pp. 2009–2019).
Google Scholar
Ranjan, R., Rana, O., Nepal, S., Yousif, M., James, P., Wen, Z., Barr, S., Watson, P., Jayaraman, P. P., Georgakopoulos, D., Villari, M., Fazio, M., Garg, S., Buyya, R., Wang, L., Zomaya, A. Y., & Dustdar, S. (2018). The next grand challenges: Integrating the internet of things and data science. IEEE Cloud Computing, 5(3), 12–26. May Jun. https://doi.org/10.1109/MCC.2018.032591612
Web of Science ®Google Scholar
Ren, Y. (2021). Python machine learning: Machine learning and deep learning with python (Book Review). International Journal of Knowledge-Based Organizations, 11(1), 67–70. https://doi.org/10.4018/IJKBO
Google Scholar
Samadi, M. (2011) Problems in the translation of legal terms from persian into english. The International Journal - Language Society and Culture, 3(33), 108–114. .
Google Scholar
Sharrab, Y. O., Alsmirat, M., Hawashin, B., & Sarhan, N. (2021). Machine learning- based energy consumption modeling and comparing of H.264 and google VP8 encoders. International Journal of Electrical and Computer Engineering (IJECE), 11(2), 1303–1310. https://doi.org/10.11591/ijece.v11i2.pp1303-1310
Google Scholar
Wang, Y., Dai, B., Kong, L., Erfani, S. M., Bailey, J., &, and Zha, H. (2018). Learning deep hidden nonlinear dynamics from aggregate data. Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence. UAI 2018, Monterey, USA, 1, 83–92.
Google Scholar
Wang, P., Fan, E., & Wang, P. (2020). Comparative Analysis of Image Classification Algorithms Based on Traditional Machine Learning and Deep Learning. Pattern recognition letters, 1. https://doi.org/10.1016/j.patrec.2020.07.042
Google Scholar
Yun, J., & Woo, J. (2020). A comparative analysis of deep learning and machine learning on detecting movement directions using PIR Sensors. IEEE Internet of Things Journal, 7(4), 2855–2868. https://doi.org/10.1109/JIOT.2019.2963326
Web of Science ®Google Scholar
Zalk, M., Bosua, R., & Sharma, R. (2011). Improving knowledge sharing behaviour within organizations: Towards a mode. Proceedings of the 19th European Conference on Information Systems, ECIS 2011, Helsinki, Finland. AISeL. http://aisel.aisnet.org/ecis2011/212
Google Scholar
Zhao, P., Adnan, K. A., Lyu, X., Wei, S. and Sinnott, R. O. (2020). Estimating the size of crowds through deep learning. 2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), pp. 1–8, https://doi.org/10.1109/CSDE50874.2020.9411377.
Google Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

A comprehensive comparison and analysis of machine learning algorithms including evaluation optimized for geographic location prediction based on Twitter tweets datasets

Abstract

PUBLIC INTEREST STATEMENT

1. Introduction

2. Related work

3. Research method

3.1. Twitter geo location prediction

3.2. Machine learning classifiers

3.2.1. Logistic regression

3.2.2. K nearest-neighbor algorithm

3.2.3. Random forest

3.3. Datasets

3.4. Hyperparameters

3.5. Feature selection

3.6. Proposed methodology, Experimental setup, design and Performance evaluation

3.6.1. The LR Classifier

3.6.2. Comparison between the Logistic regression classifiers using performance metrics

3.6.3. The K Nearest neighbour classifier

3.6.4. Random Forest Classifier

4. Results and discussions

Table 1. Optimal training set performance

Table 2. Optimal development set performance

5. Conclusion

Author contributions

Acknowledgments

Disclosure statement

Notes on contributors

Hasti Samadi

Mohammed Ahsan Kollathodi

References

Information for

Open access

Opportunities

Help and information

A comprehensive comparison and analysis of machine learning algorithms including evaluation optimized for geographic location prediction based on Twitter tweets datasets

Abstract

PUBLIC INTEREST STATEMENT

1. Introduction

2. Related work

3. Research method

3.1. Twitter geo location prediction

3.2. Machine learning classifiers

3.2.1. Logistic regression

3.2.2. K nearest-neighbor algorithm

3.2.3. Random forest

3.3. Datasets

3.4. Hyperparameters

3.5. Feature selection

3.6. Proposed methodology, Experimental setup, design and Performance evaluation

3.6.1. The LR Classifier

3.6.2. Comparison between the Logistic regression classifiers using performance metrics

3.6.3. The K Nearest neighbour classifier

3.6.4. Random Forest Classifier

4. Results and discussions

Table 1. Optimal training set performance

Table 2. Optimal development set performance

5. Conclusion

Author contributions

Acknowledgments

Disclosure statement

Additional information

Funding

Notes on contributors

Hasti Samadi

Mohammed Ahsan Kollathodi

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date