1,631
Views
2
CrossRef citations to date
0
Altmetric
Research Article

Spatio-temporal intention learning for recommendation of next point-of-interest

ORCID Icon, ORCID Icon, , & ORCID Icon
Pages 384-397 | Received 12 Aug 2022, Accepted 07 Feb 2023, Published online: 05 Apr 2023

ABSTRACT

Next point-of-interest (POI) recommendation has been applied by many internet companies to enhance the user travel experience. Recent research advocates deep-learning methods to model long-term check-in sequences and mine mobility patterns of people to improve recommendation performance. Existing approaches model general user preferences based on historical check-ins and can be termed as preference pattern models. The preference pattern is different from the intention pattern, in that it does not emphasize the user mobility pattern of revisiting POIs, which is a common behavior and kind of intention for users. An effective module is needed to predict when and where users will repeat visits. In this paper, we propose a Spatio-Temporal Intention Learning Self-Attention Network (STILSAN) for next POI recommendation. STILSAN employs a preference-intention module to capture the user’s long-term preference and recognizes the user’s intention to revisit some specific POIs at a specific time. Meanwhile, we design a spatial encoder module as a pretrained model for learning POI spatial feature by simulating the spatial clustering phenomenon and the spatial proximity of the POIs. Experiments are conducted on two real-world check-in datasets. The experimental results demonstrate that all the proposed modules can effectively improve recommendation accuracy and STILSAN yields outstanding improvements over the state-of-the-art models.

1. Introduction

In the past few decades, we have seen the rapid development of location-based social networks (LSBNs). Users on LSBNs post their check-ins and share life experiences through mobile devices. A check-in record usually includes point-of-interest (POI) visited in a specific spatio-temporal context, which reflect users’ mobility pattern. The next POI recommendation is one of the most important applications of LBSNs (Li, Westerholt, and Zipf Citation2018; Aslam et al. Citation2021). It recommends the next POI for users to visit based on historical check-in records.

Recent research advocates deep learning methods to model long-term check-in sequences and mine mobility patterns of people to improve recommendation performance. POI recommendation is different from the general recommendation task in that it typically considers the spatial, temporal, preference, and intention influence. Existing approaches model general user preferences based on historical check-ins and can be termed as preference pattern models. For example, the self-attention mechanism (Vaswani et al. Citation2017) has been used to model user historical behavior sequences (Kang and McAuley Citation2018). The model calculates attention weights between the next item and other items in the historical sequence, and thus, long-term dependencies can be captured effectively. As a result, existing research shows that self-attention can achieve good performance on the next POI recommendation (Guo and Qi Citation2020; He, Qi, and Ramamohanarao Citation2020; Huang et al. Citation2020; Lian et al. Citation2020; Luo, Liu, and Liu Citation2021).

However, the preference pattern is different from the intention pattern, in that it does not emphasize the user mobility pattern of revisiting POIs, which is a common behavior and kind of intention for users. RepeatNet (Ren et al. Citation2019) incorporates the repeat-explore mechanism into the gated recurrent units (GRUs) for session-based recommendation. The purpose of the work is to remind users of items that they have seen or consumed. Similarly, revisiting the same place is a common behavior of users. Existing approaches seem to ignore the learning of such a mobility intention pattern. Compared with other recommendation fields, such as product recommendation, repeated behaviors are more obvious in POI recommendation scenarios and have spatio-temporal characteristics. We take the visited POI which was revisited by the users as intentional behavior. For example, users arrive at work or home during the morning and evening rush hour. Meanwhile, exploring new POIs that may be of interest is regarded as a preference behavior. For instance, the users are willing to go to some new gourmet restaurants during the weekend. Therefore, when making a POI recommendation at the next moment, deep learning not only learns user preferences for POIs but also needs to consider that users with travel intentions may go to places visited before.

Although long-term dependencies have been captured by the self-attention mechanism, the spatial influence could be improved in the modeling. For example, Tobler’s first law of geography (Tobler Citation1970) points out that nearby things are more closely related. Existing research has demonstrated that spatial proximity affects check-in behavior. Users tend to visit places nearby. Therefore, it is necessary to encode geo-coordinates of POI based on distance. However, embedding learning for each POI position will greatly increase the cost of model training. How to balance training cost with effective modeling of POI spatial features is a major challenge.

To address the aforementioned issues, we propose the Spatio-Temporal Intention Learning Self-Attention Network (STILSAN) with spatial clustering-based embedding for next POI recommendation, which yields outstanding improvements over the state-of-the-art models. STILSAN includes the following: The first phase is to capture user revisiting intention. A spatial-temporal revisiting mechanism is used to learn user revisiting intention that can calculate the recommendation probability of revisiting POIs in the intention pattern. The second phase discusses the improvement of POI recommendation performance by spatial factor. We first construct a set of virtual centroids by clustering POI locations. Each POI is associated with a virtual centroid. Then, we pretrain a spatial feature model for virtual centroids via variational graph auto-encoders (VGAE) (Kipf and Welling Citation2016), whereby spatial embeddings are learning for each POI.

The contributions of this paper can be summarized as follows:

  • We propose the STILSAN for the next POI recommendation. To the best of our knowledge, STILSAN is the first model that explicitly models the user mobility pattern of intention.

  • A spatial feature pretrained model based on VGAE is proposed to learn POI spatial embedding. This model simulates the spatial clustering phenomenon and preserves the spatial proximity of the POIs.

  • An optimal embedding combination and form are explored in the embedding module. This can provide heuristics for self-attention-based models in next POI recommendation task.

  • Tested on two real POI recommendation datasets, the STILSAN model achieves the SOTA performance compared to other state-of-the-art methods.

2. Related work

In this section, we summarize related work from two aspects, which are next POI recommendation and self-attention for recommendation.

2.1. Next POI recommendation

Early next POI recommendation models mainly include Markov chains (Rendle Citation2010; Rendle, Freudenthaler, and Schmidt Citation2010) and its extended models (Cheng et al. Citation2013; Feng et al. Citation2015), but Markov chains cannot capture contextual information well. Later on, the Recurrent Neural Network (RNN) has become the mainstream method for modeling the POI check-in sequences (Sun et al. Citation2020; Zhao et al. Citation2020a). ST-RNN (Liu et al. Citation2016) applies an extended RNN to the location prediction problem. ST-LSTM (Huang et al. Citation2021) proposes an attention-based LSTM network to model sequential patterns of check-in behavior. Although RNN effectively improves the recommendation performance, it still suffers from the long-term dependency problem (Chung et al. Citation2014). Compared to other scenarios, POI recommendation usually utilizes temporal and spatial relations to assist in modeling user dynamic preferences. Nitu, Coelho, and Madiraju (Citation2021) take into account a user’s most recent interest by incorporating time-sensitive recency weight into the model. Liu et al. (Citation2016) model spatio-temporal contexts to improve model performance using time-specific transition matrices and distance-specific transition matrices. Time-LSTM (Zhu et al. Citation2017), ST-LSTM (Zhao et al. Citation2018), STGN (Zhao et al. Citation2020b), and RTPM (Liu et al. Citation2022) adopt similar ideas, which add time gate and distance gate to the structure of LSTM to better capture user preferences in different spatio-temporal contexts. HGMAP (Zhong et al. Citation2020), GARG (Wu et al. Citation2020), GGLR (Chang et al. Citation2020), and DGCN (Wang et al. Citation2022b) take a similar approach, which independently learns geographic influences by graph neural network or representation learning. Others, for example, Yu et al. (Citation2020) consider the POI categorical influence. Yang, Liu, and Zhao (Citation2022) develop a novel time-aware category context embedding to capture the diverse temporal patterns of POI categories. The aforementioned models all demonstrate that the combination of various influences can effectively improve the recommendation performance.

2.2. Self-attention for recommendation

In recent years, self-attention has been introduced into the field of sequential recommendation and achieved state-of-the-art results (Kang and McAuley Citation2018; Huang et al. Citation2020; Li, Westerholt, and Zipf Citation2018). Self-attention has been shown to be effective in next POI recommendation. Most of the studies integrate the spatio-temporal patterns of user check-ins into the self-attention mechanism to improve POI recommendation. Considering the spatio-temporal effect of non-adjacent locations and nonconsecutive check-ins, Luo, Liu, and Liu (Citation2021) propose a spatio-temporal self-attention-based model called STAN. Lian et al. (Citation2020) design a geography-aware POI sequential recommendation model based on self-attention so that some spatial phenomena can be effectively captured. Wang et al. (Citation2022a) measure the spatial-temporal interval as users’ acceptance to distance and time with self-attention. He, Qi, and Ramamohanarao (Citation2020) focus on the impact of temporal contexts and proposed a time-aware self-attention network for next POI recommendation. Compared to RNN, self-attention usually has better performance because it calculates the correlation of each check-in with other check-ins in the sequence. Therefore, the self-attention mechanism has obvious advantages in learning user long-term preferences

Our approach to next POI recommendation differs from the works listed earlier. STILSAN considers not only the spatio-temporal pattern but also the user behavior pattern of revisiting POIs. At the same time, STILSAN adopts advanced spatial feature pretrained model to achieve further improvements in recommendation accuracy.

3. Problem formulation

Let U=u1,u2,,uU and L=l1,l2,,lL denote a set of users and a set of POIs. Here U and L are the number of users and POIs. Given the user uU, the tuple of its historical check-in sequence is denoted as Su=l1,c1,t1,l2,c2,t2,,ln,cn,tn, where ci=loni,lati is the geo-coordinates to which POI belongs, ti is the timestamp and i is the position of this check-in in the sequence. The next POI recommendation task is formulated as calculating the recommendation score of the n+1-th POI, which is defined as Pln+1|Su. By introducing the spatial-temporal revisiting mechanism which assumes that the user behavior pattern is controlled by preference pattern and revisiting intention pattern, the conditional probability is factorized as

(1) Pln+1|Su=Pp|SuPln+1|p,Su+Pr|SuPln+1|r,Su(1)

where p and r denote check-in preference pattern and revisiting intention pattern, respectively. Pp|Su and Pr|Su represent probabilities of spatial-temporal transition. Pln+1|p,Su and Pln+1|r,Su represent the recommendation probabilities of the n+1-th POI in the preference pattern and the revisiting intention pattern.

4. Approach

In this section, we introduce our model in detail. shows the overall architecture of STILSAN, including the embedding module, preference-intention module, and spatial encoder module. First, in the embedding module, we concatenate POI ID embedding, user embedding, and spatial embedding by the noninvasive method as the input and feed it into the preference-intention module. We get the spatial embedding using a pretrained model in the spatial encoder module. Then, we calculate the recommendation probabilities for all POIs in the preference-intention module. Finally, we recall the POIs from the candidate set that are most likely to be recommended. Each module will be elaborated in the following sections.

Figure 1. The architecture of our STILSAN model.

Figure 1. The architecture of our STILSAN model.

4.1. Embedding module

4.1.1. Noninvasive embedding

We transform the check-in sequences of all users to a fixed length m sequence, where m represents the maximum sequence length that the model can handle. If the number of check-ins in the original sequence nu is greater than m, we select m check-ins in the most recent time. If not, we pad in zeros on the left side of the sequence until the sequence length is m. We randomly initialize the feature embedding of POI ID and user ID, defined as elRdd and euRd where d is the feature dimension.

To incorporate the representation of POI into the spatial feature, we combine the POI spatial representation Z from the spatial encoder module. In existing work, the ways to combine multiple features mainly include feature concatenation. It may be inappropriate to input the combined results directly into the self-attention module. This behavior is considered as an invasive method, which may confuse the feature space of POI ID and reduce the recommendation performance. We interpret the reasons for confusing the feature space of POI ID as: (1) combining additional features into the feature of POI ID will overwhelm the POI ID information; and (2) feeding the combined features into self-attention will output a mixed feature space. This output is combined with the POI candidate set that represents the hypothesis space to compute the recommendation ranking. The feature space of the POI candidate set is consistent with POI ID but inconsistent with output by the invasive method. Therefore, it may be confusing and difficult to calculate the recommendation ranking.

To solve this problem, we design a noninvasive embedding module for self-attention-based models in next POI recommendation. The noninvasive embedding module largely keeps the feature space of output and the candidate set the same. Specifically, we redefine the query Q, key K, and value V in self-attention. The embedding of Q and K is defined as eqk=concatel,eZ,euWqk, where concat () represents concatenation operation, eZ is the spatial embedding corresponding to POI l. The embedding of V is defined as eV=elWv. WqkR3d×d, WvRd×d are learnable parameters. We show a comparison between the traditional self-attention structure in and the noninvasive self-attention structure in .

Figure 2. Comparison of self-attention by invasive method and noninvasive method.

Figure 2. Comparison of self-attention by invasive method and noninvasive method.

Some work considers temporal features in the embedding module. For example, Luo, Liu, and Liu (Citation2021) divide the week into 7 × 24 = 168 h. Therefore, the index size of temporal embedding is 168. However, we drop the temporal feature, because we find that it will reduce the performance of the model in experiments.

4.1.2. Spatio-temporal intervals embedding

In this paper, we calculate the time intervals and distance intervals between check-ins in the sequence that are regarded as the displayed spatial-temporal relation. For Su, ΔijsRn×n and ΔijtRn×n represent the distance interval and time interval between i-th and j-th check-ins, respectively. Then, we directly adopt a method that can learn spatio-temporal transition matrices based on a linear interpolation method proposed by STAN (Luo, Liu, and Liu Citation2021). For Su, spatial and temporal intervals embeddings TΔt,SΔsRn×n can be calculated as:

(2) TΔt=TLΔtUΔtΔt+TUΔtΔtLΔtUΔtLΔtSΔs=SLΔsUΔsΔs+SUΔsΔsLΔsUΔsLΔsΔ=TΔt+SΔs(2)

where Δ is the spatio-temporal intervals embedding, U and L denote the upper bound and lower bound of Δt,s, TL and TU are the learnable parameters, which denote the unit embeddings of the upper bound and lower bound.

4.2. 4.2. Preference-intention module

We integrate the spatial-temporal revisiting mechanism into the self-attention model. Specifically, we subdivide user mobility behaviors into check-in preference pattern and revisiting intention pattern. The preference pattern models the user preferences based on historical check-ins. The revisiting intention pattern enables the model to recognize the user's intention to revisit some specific POIs. We design a preference encoder and an intention encoder to calculate the recommendation probabilities for POIs in the two patterns, respectively. At the same time, the spatio-temporal transition probabilities are proposed as a soft switch to decide which pattern the next check-in is. Then, we can get the final recommendation scores by combining the spatio-temporal transition probabilities and the recommendation probabilities. Finally, we rank the recommendation scores. The POIs with higher scores in the candidate set are more likely to be accessed by users in the next moment.

4.2.1. 4.2.1 Preference pattern encoder

We design a preference pattern encoder based on self-attention to model the check-in sequences. The input is the latent representation of the check-in sequences through the embedding module, which is defined as E=e1,e2,,emRm×d. Here, e is a tuple eqk,ev. Then, E is linearly transformed into EWQ, EWK, EWV. Three of them represent query, key, and value, respectively. WQ, WK, WVRd×d are learnable parameters. Based on the self-attention mechanism, the new representation of check-in sequences denoted as E=e1,e2,,emRm×d is calculated through SA and FFN, which is defined as follows:

(3) O=SAQ,K,V=softmaxQKTdV(3)

Then, the output O is fed into a point-wise two-layer Feed-Forward Network (FFN), which is defined as follows:

(4) E =FFNO=ReLUOW1+b1W2+b2(4)

where W1, W2Rd×d are the learnable parameters, and b1, b2 are the biases.

Next, we employ dot product to compute the recommendation probabilities of the i-th POI in the candidate set as follows:

(5) Pln+1|p,Su=expciTemi=1LexpciTem(5)

where ci is the embedding of the i-th POI ID in the candidate set.

It is worth noting that in preference pattern encoder we need two masks to shield the sequence positions that affect the attention calculation. The first mask is used to adapt to user check-in sequences of different lengths. Its size is mnu, which is the length of zero padding on the left side of the sequence. Another mask shields future check-in behaviors related to the predicted POI.

4.2.2. Revisiting intention pattern encoder

We adopt additive attention to calculate the recommendation probabilities of revisiting POIs in the intention pattern encoder. The probabilities corresponding to POIs that do not appear in historical check-ins are set to 0. Therefore, the recommendation probabilities of revisiting the n+1-th POI is defined as:

(6) aτr=VrTtanhW1rem+W2reτ+Δ(6)
(7) Pln+1|r,Su=λln+1expaτrτ=1mexpaτrln+1Su0       other(7)

where Vr, W1r, W2r are the learnable parameters, eτ (τ1,m) is the τ-th embedding in the E. λ is a penalty coefficient to prevent overfitting. ln+1expaτr represents the sum of the recommendation probabilities of POI ln+1, because the same revisited POIs will appear multiple times in the historical sequence.

4.2.3. Recommendation with preference and revisiting intention

We consider probabilities of spatial-temporal transition, which are used to measure whether the next visit is more biased toward the preference pattern or the intention pattern. We compute probabilities of spatial-temporal transition through a spatial-temporal attention mechanism. First, we obtain the last embedding em in the E that is the output of the preference pattern encoder. Next, we compute the additive attention scores between the last embedding and the other embedding in the E based on the additive attention mechanism. At the same time, we introduce the spatial-temporal relation matrix into the additive attention mechanism, which makes the spatial-temporal transition to fully consider the influence of different spatio-temporal contexts, as shown in EquationEq. 8. Finally, we transform the additive attention scores to probabilities of spatial-temporal transition by a linear variation and a softmax operation, as shown in EquationEq. 9.

(8) Pp=VTtanhW1pem+W2peτ+Δ(8)
(9) Pp|Su,Pr|Su=softmaxW3pPp(9)

where V, W1p, W2pRd×d, W3pR2×d are learnable parameters, Δ is spatio-temporal interval embedding.

4.3. Spatial encoder module

The modeling of spatial dependencies is an important influencing factor for next POI recommendation. Learning the spatial embedding of each POI directly may lead to excessive model training cost. To this end, we first map POIs to virtual centroids through a spatial clustering algorithm and then learn the spatial embedding of each virtual centroid based on VGAE. We treat the spatial encoder module as a pretrained model for POI spatial embedding. This supports downstream recommendation tasks.

4.3.1. Virtual centroids clustering

We adopt the K-means clustering algorithm to aggregate the POIs according to their spatial distribution. Each POI is associated with a cluster. Then, we compute the mean of the geo-coordinates of all POIs within each cluster and define it as the virtual centroid V=v1,v2,,vV, where V is the prespecified optimal number of clusters. In particular, we try different clustering algorithms to achieve the same goal, including K-means, DBSCAN (Ester et al. Citation1996), and OPTICS (Ankerst et al. Citation1999). In contrast to K-means, which is a distance-based clustering algorithm and tends to generate spherical clusters, DBSCAN and OPTICS are density-based clustering algorithms that can find clusters of arbitrary shapes and identify noisy points. Although DBSCAN and OPTICS have the advantage of not needing to set the number of clusters, it is difficult to obtain satisfactory clustering results. One reason is that DBSCAN and OPTICS do not work well for clusters of different densities. The spatial distribution of POI locations is usually inhomogeneous. The other reason is that virtual centroids are more suitable to reflect the spatial feature of POIs within spherical clusters.

4.3.2. Spatial embedding learning

Spatial distance is the most important factor for modeling correlations between POIs. We propose a spatial embedding learning model using the VGAE. VGAE is a framework for unsupervised learning on graph-structured data based on a variational auto-encoder. Based on VGAE, the spatial encoder module trains a spatial latent embedding influenced by location distance. Specifically, we propose the spatial embedding learning model including three steps of graph construction, encoder by Graph Convolutional Network (GCN), and decoder by reconstruction.

First, we use the distance to construct the virtual centroids graph G=V,E, where E is the set of edges in the G. An edge ei,j denotes the distance feature between virtual centroids i and j. We calculate the distance feature between virtual centroids as follows:

(10) ei,j=1Distance(10)

where Distance stands for the distance between virtual centroids i and j. The closer the distance, the greater the value of ei,j. Then, we construct the adjacency matrix ARV×V from the virtual centroids graph G.

Second, the encoder of VGAE is denoted by Z=fX,A, where XRV×d is a randomly initial node embedding matrix, d is the size of node embedding. A˜=D12AD12 is the normalized adjacency matrix, where D is the degree matrix of A. Here, ZRV×d is the output of the encoder representing the node embedding, d is the size of the embedding. f calculates the mean μ and variance σ by two-layer GCN as follows:

(11) μ=GCNμ(X,A˜)(11)
(12) logσ=GCNσ(X,A˜)(12)
(13) GCN(X,A˜)=A˜ReLUA˜XW0W1(13)

where W0 is shared in GCNμ and GCNσ, but W1 is different.

Next, we sample the mean μ and variance σ using reparameterization as follows:

(14) Z=μ+\isinσ(14)

where \isin is the noise variable that follows a normal distribution.

Third, the decoder of VGAE uses the output Z of the encoder to reconstruct the virtual centroids graph through the inner product as follows:

(15) Aˆ=σZZT(15)

where Aˆ is the reconstructed adjacency matrix, and σ is the logistic sigmoid function.

To better understand the spatial pretrained model, the pseudocodes are shown in Pseudocode 1.

Pseudocode 1. The spatial pretrained model

4.4. Model training

4.4.1. Pretraining for spatial encoder

For VGAE, the purpose of model training is to hope that the reconstructed graph is as similar as possible to the original graph. The loss function is defined as follows:

(16) Lspatial=Lbce+LKL(16)

where Lbce is the binary cross-entropy loss function that is used to measure the difference between the reconstructed graph and the original graph as follows:

(17) Lbce=EqZ|X,AlogpA|Z(17)

where LKL is the Kullback–Leibler (KL) divergence which measures the difference between the node embedding distribution qZ|X,A and the normal distribution pZ as follows:

(18) LKL=KLqZ|X,A||pZ(18)

4.4.2. Training for STILSAN

STILSAN is trained cyclically, and its training times are the length of the training data ltrain at each step. Each cycle uses the first n POIs to predict the n+1-th POI until n+1=ltrain. Therefore, there is only one ground truth per cycle. We take the final recommendation scores rn+1,1,rn+1,2,,rn+1,L as output and a POI ground truth lq as expected output. For the positive sample lq, we randomly sample j POIs as negative samples that neither appear in the historical sequence. These POIs are defined as N=N1,N2,,Nj. The cross-entropy loss function of POI is defined as:

(19) SuSlogσrn+1,q+NiNjlog1σrn+1,Ni(19)

5. Experiments

5.1. Experimental setup

5.1.1. 5.1.1 Datasets

We use two real-world check-in datasets in New York City (NYC) and Tokyo (TKY) from 12 April 2012 to 16 February 2013 provided by Yang et al. (Citation2015). We filter out POIs visited fewer than 10 times. We delete those users who check in with less than 10 POIs in sequence. shows the number of users, POIs, and check-ins in each dataset after preprocessing. We take the first 80% check-ins as the training set, the latter 10% as the validation set, and the other real check-ins as the test set.

Table 1. Statistics of datasets.

5.1.2. Baselines

We select several representative models as the baselines from multiple subdivisions, including RNN-based model, self-attention-based model, category-aware model, and time-aware model. We summarize the baseline models and their characteristics as follows:

  • STRNN (Liu et al. Citation2016) extends the RNN method which takes into account local temporal and spatial contexts.

  • LSTPM Sun et al. Citation2020) is a long-term and short-term sequence model that uses a nonlocal network and geo-dilated RNN to model user long-term and short-term preferences.

  • CatDM (Yu et al. Citation2020) incorporates POI category and geographical influence.

  • SASRec (KangKang and McAuley 2016) uses the self-attention mechanism to model a user's historical behavior sequence that does not consider temporal and spatial factors.

  • RepeatNet (Ren et al. Citation2019) incorporates the repeat-explore mechanism into the GRU. RepeatNet implements the general session-based recommendation task without considering the spatial-temporal features.

  • STAN (Luo, Liu, and Liu Citation2021) is the first model to apply self-attention to POI recommendations, which considers spatio-temporal features.

5.1.3. Parameter and metrics

We set the maximum sequence length m as 100, which represents the maximum sequence length that the model can handle. We observe that the median check-in number of users in the NYC and TKY datasets is 97 and 133, respectively. Therefore, it is reasonable to set the parameter m to 100, which can balance the impact of retaining and padding operations in Section 4.1.1. There are two kinds of hyperparameters: (1) common hyperparameters that are shared by all models; (2) unique hyperparameters that depend on our proposed STILSAN. To other common hyperparameters for our model, we train these hyperparameters on a simple self-attention network and then apply them to our models, which helps reduce the training burden and find the optimal hyperparameter settings quickly. The learning rate, drop-out rate, penalty coefficient, the number of negative samples, and batch size are set to 0.001, 0.2, 0.02, 10, and 25. The POI, spatial, user, and other embedding dimensions are set to 40 and 25 in NYC and TKY, respectively. The unique hyperparameter in our model is the number of virtual centroids. We set the number of virtual centroids in TKY VNYC and NYC VTKY are set to 150 and 200, respectively. We discuss the set of hyperparameters VNYC and VTKY for different datasets and the corresponding influences in Section 5.3. We use an Adam optimizer to optimize our model. For other baseline models, we follow the optimal hyperparameter settings suggested by the authors. In the experiment, Recall@k and NDCG@k are used to evaluate the recommendation performance of the model. Recall@k evaluates whether there are correct POIs in the top k recommendation results. Normalized Discounted Cumulative Gain (NDCG) is a measure of ranking quality. We set k = 5, and 20 in the experiment.

5.2. Overall performance

On the TKY and NYC datasets, our proposed model outperforms all baseline models on recall, as shown in . Compared with the optimal baseline, STILSAN has increased by 5.5% and 8% on the two datasets in terms of Recall@5, respectively. This improvement is significant. First, based on the experimental results, we find that all self-attention-based models are better than non-self-attention models. The reason is that self-attention can better model long-term dependence than RNN. LSTPM is better than STRNN because LSTPM considers the impact of user short-term and long-term preferences for the next POI. Meanwhile, we observe that STRNN performs worst. As an RNN-based POI recommendation model, the STRNN may lie in insufficient modeling of spatio-temporal information. Although CatDM additionally introduces POI category information, it is lower than STILSAN in all indicators. The possible reason is that the user revisiting behavior is more obvious during check-in activities. In addition, we also observe that a large number of check-ins in the two datasets are concentrated in specific POI categories, such as home and subway. This uneven data distribution makes it difficult to model category preferences. CopyNet achieves considerable performance because it models repeat behavior. However, due to the lack of spatio-temporal modeling and the insufficient ability of the GRU to capture long-term preferences, CopyNet does not achieve optimal performance. Moreover, we conclude from comparative experiments that STAN is better than the STRNN and LSTPM, but the recommendation performance is lower than the STILSAN. For example, in terms of Recall@k, STILSAN is almost 5.8%–8% higher than STAN on the NYC dataset and 4.5%–5.5% higher than STAN on TKY dataset. We attribute the improvement in recommendation performance to the embedding module, preference-intention module, and spatial encoder module that we designed.

Table 2. Comparison of STILSAN with baselines in Recall@k and NDCG@k.

5.3. Stability study

For the K-means algorithm, we need to set the number of virtual centroids in advance. Therefore, it is necessary to study the effect of the number of virtual centroids on the model performance. We set the number of virtual centroids VNYC and VTKY to 50, 100, 150, 200, and 250 while keeping other optimal hyperparameters unchanged. presents specific changes in the model performance as the number of virtual centroids changes. First, it can be observed that the performance of the two different datasets is greatly different. This is due to the difference in the spatial distribution and density of POIs between the TKY and NYC datasets. The number of POIs in the TKY dataset is much more than that in the NYC dataset. While the spatial extent of the TKY dataset is smaller than that of the NYC dataset. It is not difficult to know that the POI density in the TKY dataset is greater than that in the NYC dataset. Second, the model shows low performance when VTKY and VNYC are small. This is because when the number of virtual centroids is small, POIs within a large area are aggregated into a cluster which makes the spatial feature between distant POIs not easy to be distinguished. Thirdly, we can observe that the performance of the model hits the peak when VTKY=150 and VNYC=200. And then, as the number of virtual centroids increases, the performance of our model begins to decline. This is because when the number of virtual centroids is large, the adjacent POIs are separated into different clusters, which makes the model underfit and yield suboptimal performance.

Figure 3. The impact of the number of clusters.

Figure 3. The impact of the number of clusters.

5.4. Stability study

5.4.1. Embedding combinations and forms

The embedding module is considered in almost recommendation models. However, few studies have specifically compared the model performance between different embedding combinations and forms. Specifically, embedding forms refer to invasive method and noninvasive method. To find the optimal embedding combination and form, we conduct a comparison experiment. We expect to provide more general heuristics to all other self-attention-based models for next POI recommendation. We define the Original Model (OM), which is without the preference-intention module. The OM only uses POI ID embedding in the embedding module. We consider several variants of embedding module based on the OM as shown in . For example, variant (VIII) concatenates spatial embedding and user embedding with POI ID embedding by noninvasive method. shows the experimental results for all variants of the embedded module.

  • Finding 1: Noninvasive method tends to perform better compared to invasive methods. For example, when comparing these two variants (I, II), the noninvasive method achieves up to 1.2%–2.8% and 2.4%–3.7% improvement over the invasive method, on TKY and NYC, respectively. Comparing variants (III, IV) and variants (V, VI), the noninvasive method also outperforms the invasive method. The reasons have been explained in Section 4.1.1.

  • Finding 2: Adding user embedding, and time embedding by invasive methods will degrade model performance. We observe that directly fusing other embeddings into the POI ID embedding leads to performance degradation by comparing OM with variants (I, III). A similar conclusion has also been demonstrated in another paper (Lian et al. Citation2020). Lian et al. (Citation2020) propose the GeoSAN model whose embedding module adopted invasive method. They also demonstrated that adding user embedding or time embedding did not lead to performance improvement. We observe that adding user embedding by invasive method has the greatest negative impact on performance, while spatial embedding by invasive method has a minimal negative impact. The possible reason is that the spatial embedding represents the inherent property of POI.

  • Finding 3: Adding user embedding and spatial embedding by noninvasive method will get a significant performance boost. Although user embedding performs poorly on invasive method, it performs well on noninvasive method. Adding user embedding by noninvasive method not only avoids the mismatch of feature space but also realizes personalized modeling check-in sequences. We find that it produces a similar effect to the revisiting mechanism. Adding spatial embedding by noninvasive also has a good improvement. The reason might be that POIs that have spatial proximity to POIs of historical check-ins are given higher attention scores in the self-attention model.

  • Finding 4: The optimal embedding combination is ASUEN (VIII). Comparing these two variants (VIII, IX), time embedding is not suggested to be added to our combination. Comparing these two variants (VII and VIII), we again demonstrate that the performance gap between noninvasive and invasive methods is significant. Our embedding combination by noninvasive method dramatically improves the accuracy of recommendation.

Table 3. The variants of embedding module.

Table 4. Comparison of the OM with their variants in Recall@k and NDCG@k.

5.4.2. 5.4.2 Preference-intention module and spatial encoder module

To verify the effectiveness of key parts designed in our model, including the preference-intention module and spatial encoder module, we conduct an ablation study. Our base model is ASUEN (VIII), which has optimal embedding combination and form. We consider the following variants of the model:

  • X. Add Preference-intention Module (APM): Based on the OM, we add the preference-intention module.

  • XI. STILSAN: The variant (XI) is the complete model we proposed called STILSAN, which adds the preference-intention module to ASUEN.

  • XII. Adopt Randomly Initialized Embedding (ARIE): Based on STILSAN (XI), we replace pretrained spatial embedding with randomly initialized embedding.

  • XIII. Intention Learning Self-Attention Network (ILSAN): Based on STILSAN (XI), we remove spatio-temporal interval embedding so that revisiting mechanism does not consider spatial-temporal relation.

shows the results of the ablation study. We can get the following findings.

  • Finding 5: Preference-intention module leads to performance improvement. Comparing models (OM, APM) and variants (ASUEN, STILSAN), the models with preference-intention module all show higher performance. We can conclude that the preference-intention module is beneficial and plays an indispensable role in the model. This is because we capture the human mobility pattern of revisiting the POIs. Comparing ILSAN (XIII) and STILSAN (XI), we find that the model with spatial-temporal revisiting mechanism has higher performance than the model with revisiting mechanism. This is because the spatio-temporal intervals are explicitly incorporated into the preference-intention module, which helps to better mine user mobility patterns for revisiting.

  • Finding 6: Spatial encoder module boosts the performance. Take Recall@20 as an example. STILSAN outperforms APM (Ⅹ) by about 2.5% and 1.3% on TKY and NYC, respectively. Meanwhile, STILSAN outperforms ARIE (Ⅻ) by about 0.98% and 1.7% on TKY and NYC, respectively. Therefore, spatial embedding, which is aware of spatial clustering phenomenon and distance, is crucial for improving recommendation performance. Randomly initialized embedding cannot effectively represent the spatial distance features of POIs.

Table 5. Comparison of the base model with their variants in Recall@k and NDCG@k.

5.5. Interpretability study

To understand the mechanism of preference-intention module, we adopt an interpretability study to prove the necessity and correctness of the preference-intention module. is based on a slice of a real user trajectory example. The sequential sequence of the trajectory examples is shown in . The POI l6 and POI l1 as ground truths will be visited by the user at t7 and t8, respectively. We visualized the trajectory of the user on OpenStreetMap. Due to the large number of candidate POIs, we are unable to show them all on the map. Therefore, we select some typical candidate POIs including the historical POIs and non-historical POIs. POI l1 to POI l5 are historical POIs, and POI l6 to POI l10 are non-historical POIs which indicate that the user has not visited these POIs. Moreover, we collect the recommendation probabilities of intention and preference generated by STILSAN when the models reach convergence. To better visualize, we standardize the recommendation probabilities of the above 10 candidate POIs and took the standardized result as the score. Then, we display the preference scores and intention scores using vertical bars in the lower part of . The height of each vertical bar denotes the size of recommendation scores, i.e., a candidate POI with higher vertical bars indicates that is more likely to be accessed by users in the next moment.

Figure 4. The Interpretability study of STILSAN. (a) a slice of a real user trajectory example. (b) the recommendation scores for 10 candidate POIs at the time t7. (c) the recommendation scores for 10 candidate POIs at the time t8. The POI with higher vertical bars indicates that are more likely to be accessed by users in the next moment.

Figure 4. The Interpretability study of STILSAN. (a) a slice of a real user trajectory example. (b) the recommendation scores for 10 candidate POIs at the time t7. (c) the recommendation scores for 10 candidate POIs at the time t8. The POI with higher vertical bars indicates that are more likely to be accessed by users in the next moment.

By query of the exact GPS, we find that the POI l1 and l5 are different gyms. The POI l2, l4, and l6are drink shop, delicatessen, and Italian restaurant, respectively, which are related to dining. The rest of the POIs are of other categories. It is important to note that POI categories are only used to better understand user movement scenarios. The proposed STILSAN emphasizes the intention to revisit historical POIs that have been visited but not the intention of POI categories. We can observe that the user often goes to the POI l1 around noon, which belongs to the gym, as shown in . After a trip to the gym, the user is more likely to go to a location associated with the restaurant. We can observe from the user trajectory segment that the user explored different gourmet restaurants. The vertical bars in show that POI l6 as ground truth gets more preference scores and is recommended preferentially. At the same time, the user usually revisits the gym he has visited before. For example, we can observe that the user has the travel intention to go to the POI l1 at the exact time. The vertical bars in show that POI l1 as ground truth gets a lot of intention scores and a few preference scores. One reason is that preference and intention are often intertwined and both would be reflected by the human mobility, and the other reason is that it is difficult for the model to accurately distinguish the user’s movement pattern at each step. By comparison, we can observe that the proportion of intention score in is larger than that in . This means that the model successfully identifies that the user’s movement behavior is more inclined to preference at the time t7 and more inclined to intention at the time t8. These examples further illustrate that STILSAN can capture both the user mobility pattern of intention and preference successfully and it has the capability to automatically recognize different scenarios.

6. Conclusions

In this paper, we propose a model called STILSAN for next POI recommendation task. In this model, spatio-temporal intention learning is integrated into the self-attention network. The user mobility pattern of revisiting POIs can be modeled explicitly. This work puts forward a new spatial encoder to model spatial feature so that spatial clustering phenomenon and distance can be considered in modeling. Experimental results indicate that STILSAN shows more competitive recommendation performance than existing state-of-the-art models on two real-world datasets. Besides, different embedding combinations and forms are evaluated to examine their influence on the recommendation performance. It reveals that adding spatial embedding and user embedding by noninvasive method is the optimal embedding combination and form. In addition, through ablation study, we show that the preference-intention module and spatial encoder module can improve recommendation performance significantly. One of the shortcomings of our model is that it is difficult to deal with the extremely cold-start issue in next POI recommendation. So, in the future, more context information such as POI category and social network is worth being incorporated into our model to mitigate the cold-start problem.

Acknowledgements

The authors appreciate the efforts of the anonymous reviewers and the editor. The authors thank Jian Li, Kai Yan, and Fan Yu for their contributions to the revision of the manuscript. The authors thank Chongqing Changan Automobile Co., Ltd., Dongfeng Motor Corporation, and Dongfeng Changxing Tech. Co., Ltd. for their technical guidance

Disclosure statement

No potential conflict of interest was reported by the authors.

Data availability statement

The data that support the findings in this study, as well as the code to process it with the methods presented in this paper, are available on Github at https://github.com/leehommlee/STILSAN.

Additional information

Funding

This work is supported by Chongqing Technology Innovation and Application Development Project [grant number cstc2021jscx-dxwtBX0023], and funding from Chongqing Changan Automobile Co., Ltd., Dongfeng Motor Corporation, and Dongfeng Changxing Tech Co., Ltd.

Notes on contributors

Hao Li

Hao Li is a PhD candidate at the School of Remote Sensing and Information Engineering, Wuhan University. His research interest is POI recommendation and knowledge graph.

Peng Yue

Peng Yue is a Professor at the School of Remote Sensing and Information Engineering, Wuhan University. He serves as the deputy dean at the School of Remote Sensing and Information Engineering, the director at the Hubei Province Engineering Center for Intelligent Geoprocessing (HPECIG), and the director at the Institute of Geospatial Information and Location Based Services (IGILBS), Wuhan University. His research interests include Earth science data and information systems, Web GIS and GIServices, and GIS software and engineering.

Shangcheng Li

Shangcheng Li is an engineer at Dongfeng Changxing Tech. Co., Ltd. His research interest is travel data mining.

Chenxiao Zhang

Chenxiao Zhang is an associate research fellow in the School of Remote Sensing and Information Engineering at Wuhan University. He received a PhD from the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University. His research interests include geographic information system, remote sensing, and deep learning.

Can Yang

Can Yang is a postdoctoral researcher in the School of Remote Sensing and Information Engineering, Wuhan University. He received a PhD from KTH, the Royal Institute of Technology. His research interests include trajectory data mining and movement analysis.

References

  • Ankerst, M., M. M. Breunig, H. -P. Kriegel, and J. Sander. 1999. “OPTICS: Ordering Points to Identify the Clustering Structure.” In ACM SIGMOD International Conference on Management of Data, Philadelphia, USA, May 31, 49–60.
  • Aslam, N., M. Ibrahim, T. Cheng, H. Chen, and Y. Zhang. 2021. “ActivityNet: Neural Networks to Predict Public Transport Trip Purposes from Individual Smart Card Data and POIs.” Geo-Spatial Information Science 24 (4): 711–721. doi:10.1080/10095020.2021.1985943.
  • Chang, B., G. Jang, S. Kim, and J. Kang. 2020. “Learning Graph-Based Geographical Latent Representation for Point-Of-Interest Recommendation.” In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, Ireland, October 19-23, 135–144.
  • Cheng, C., H. Yang, M. R. Lyu, and I. King. 2013. “Where You Like to Go Next: Successive Point-Of-Interest Recommendation.” In Twenty-Third international joint conference on Artificial Intelligence, Beijing, China, August 3-9, 2605–2611.
  • Chung, J., C. Gulcehre, K. Cho, and Y. Bengio. 2014. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.” arXiv preprint: 1412.3555. doi:10.48550/arXiv.1412.3555.
  • Ester, M., H. Kriegel, J. Sander, and X. Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining, Portland, USA, August 2-4, 226–231.
  • Feng, S., X. Li, Y. Zeng, G. Cong, Y. Chee, and Q. Yuan. 2015. “Personalized Ranking Metric Embedding for Next New POI Recommendation.” In Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, July 25-31, 2069–2075.
  • Guo, Q., and J. Qi. 2020. “SANST: A Self-Attentive Network for Next Point-Of-Interest Recommendation.” arXiv preprint: 2001.10379. doi:10.48550/arXiv.2001.10379.
  • He, J., J. Qi, and K. Ramamohanarao. 2020. “TimeSan: A Time-Modulated Self-Attentive Network for Next Point-of-Interest Recommendation.” In 2020 International Joint Conference on Neural Networks, Glasgow, United Kingdom, July 19-24, 1–8.
  • Huang, L., Y. Ma, Y. Liu, and K. He. 2020. “DAN-SNR: A Deep Attentive Network for Social-Aware Next Point-Of-Interest Recommendation.” ACM Transactions on Internet Technology 1 (2020): 1–27. doi:10.1145/3430504.
  • Huang, L., Y. Ma, S. Wang, and Y. Liu. 2021. “An Attention-Based Spatiotemporal LSTM Network for Next POI Recommendation.” IEEE Transactions on Services Computing 14 (6): 1585–1597. doi:10.1109/TSC.2019.2918310.
  • Kang, W., and J. McAuley. 2018. “Self-Attentive Sequential Recommendation.” IEEE International Conference on Data Mining, Singapore, November 17-20, 197–206.
  • Kipf, T. N., and M. Welling. 2016. “Variational Graph Auto-Encoders.” arXiv preprint: 1611.07308. doi:10.48550/arXiv.1611.07308.
  • Lian, D., Y. Wu, Y. Ge, X. Xie, and E. Chen. 2020. “Geography-Aware Sequential Location Recommendation.” In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, Virtual Event, USA, July 6-10, 2009–2019.
  • Liu, Q., S. Wu, L. Wang, and T. Tan. 2016. “Predicting the Next Location: A Recurrent Model with Spatial and Temporal Contexts.” In Thirtieth AAAI conference on artificial intelligence, Phoenix, Arizona, USA, February 12–17, 194–200.
  • Liu, X., Y. Yang, Y. Xu, F. Yang, Q. Huang, and H. Wang. 2022. “Real-Time POI Recommendation via Modeling Long- and Short-Term User Preferences.” Neurocomputing 467: 454–464. doi:10.1016/j.neucom.2021.09.056.
  • Li, M., R. Westerholt, and A. Zipf. 2018. “Do People Communicate About Their Whereabouts? Investigating the Relation Between User-Generated Text Messages and Foursquare Check-In Places.” Geo-Spatial Information Science 21 (3): 159–172. doi:10.1080/10095020.2018.1498669.
  • Luo, Y., Q. Liu, and Z. Liu. 2021. “STAN: Spatio-Temporal Attention Network for Next Location Recommendation.” In Proceedings of the 2021 world wide web conference, Ljubljana, Slovenia, April 19-23, 2177–2185.
  • Nitu, P., J. Coelho, and P. Madiraju. 2021. “Improvising Personalized Travel Recommendation System with Recency Effects.” Big Data Mining and Analytics 4 (3): 139–154. doi:10.26599/BDMA.2020.9020026.
  • Ren, P., Z. Chen, J. Li, Z. Ren, J. Ma, and M. Rijke. 2019. “RepeatNet: A Repeat Aware Neural Recommendation Machine for Session-Based Recommendation.” In Proceedings of the AAAI Conference on Artificial Intelligence, Hawaii, USA, January-February 27-1, 4806–4813.
  • Rendle, S. 2010. “Factorization Machines.” IEEE International Conference on Data Mining, Nevada, USA, July 12-15, 995–1000.
  • Rendle, S., C. Freudenthaler, and L. Schmidt. 2010. “Factorizing Personalized Markov Chains for Next-Basket Recommendation.” In Proceedings of the 19th international conference on World wide web, Raleigh North Carolina, USA, April 26-30, 811–820.
  • Sun, K., T. Qian, T. Chen, Y. Liang, and Q. Nguyen. 2020. “Where to Go Next: Modeling Long- and Short-Term User Preferences for Point-Of-Interest Recommendation.” In Proceedings of the AAAI Conference on Artificial Intelligence, New York, USA, February 7–12, 214–221.
  • Tobler, W. R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (sup1): 234–240. doi:10.2307/143141.
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. “Attention is All You Need.” In Advances in Neural Information Processing Systems, Long Beach, CA, USA, December 4-9. 5998–6008.
  • Wang, X., G. Sun, X. Fang, J. Yang, and S. Wang. 2022a. “Modeling Spatio-Temporal Neighbourhood for Personalized Point-Of-Interest Recommendation.” In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, July 23-29, 3530–3536.
  • Wang, Z., Y. Zhu, H. Liu, and C. Wang. 2022b. “Learning Graph-Based Disentangled Representations for Next POI Recommendation.” In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11–15, 1154–1163.
  • Wu, S., Y. Zhang, C. Gao, K. Bian, and B. Cui. 2020. “GARG: Anonymous Recommendation of Point‑of‑interest in Mobile Networks by Graph Convolution Network.” Data Science and Engineering 5 (4): 433–447. doi:10.1007/s41019-020-00135-z.
  • Yang, S., J. Liu, and K. Zhao. 2022. “GETNext: Trajectory Flow Map Enhanced Transformer for Next POI Recommendation.” In Proceedings of the 45th International ACM SIGIR Conference on research and development in information retrieval, Madrid, July 11-15, 1144–1153.
  • Yang, D., D. Zhang, V. W. Zheng, and Z. Yu. 2015. “Modeling User Activity Preference by Leveraging User Spatial Temporal Characteristics in LBSNs.” IEEE Transactions on Systems, Man, and Cybernetics: Systems 45 (1): 129–142. doi:10.1109/TSMC.2014.2327053.
  • Yu, F., L. Cui, W. Guo, X. Lu, Q. Li, and H. Lu. 2020. “A Category-Aware Deep Model for Successive POI Recommendation on Sparse Check-In Data.” In Proceedings of the web conference, Ljubljana, Slovenia, April 19-23, 1264–1274.
  • Zhao, P., A. Luo, Y. Liu, F. Zhuang, J. Xu, Z. Li, V. Sheng, and X. Zhou. 2020a. “Where to Go Next: A Spatio-Temporal Gated Network for Next POI Recommendation.” IEEE Transactions on Knowledge and Data Engineering 34 (5): 2512–2524. doi:10.1109/TKDE.2020.3007194.
  • Zhao, K., Y. Zhang, H. Yin, J. Wang, K. Zheng, X. Zhou, and C. Xing. 2020b. “Discovering Subsequence Patterns for Next POI Recommendation.” In Proceedings of the Twenty-Ninth international joint conference on artificial intelligence, Yokohama, Japan, January 7-15, 3216–3222.
  • Zhao, P., H. Zhu, Y. Liu, Z. Li, J. Xu, and V. S. Sheng. 2018. “Where to Go Next: A Spatio-Temporal LSTM Model for Next POI Recommendation.” arXiv preprint: 1806.06671. doi:10.48550/arXiv.1806.06671.
  • Zhong, T., S. Zhang, F. Zhou, K. Zhang, G. Trajcevski, and J. Wu. 2020. “Hybrid Graph Convolutional Networks with Multi-Head Attention for Location Recommendation. “ World Wide Web 23 (6): 3125–3151. doi:10.1007/s11280-020-00824-9.
  • Zhu, Y., H. Li, Y. Liao, B. Wang, Z. Guan, H. Liu, and D. Cai. 2017. “What to Do Next: Modeling User Behaviors by Time-LSTM.” In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, August 19–25, 3602–3608.