585
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Length-squared L-function for identifying clustering pattern of network-constrained flows

ORCID Icon, , ORCID Icon, , , , ORCID Icon, ORCID Icon & show all
Pages 4191-4211 | Received 02 Mar 2023, Accepted 24 Sep 2023, Published online: 11 Oct 2023

ABSTRACT

The network-constrained flow is defined as the movement between two locations along road networks, such as the residence-workplace flow of city dwellers. Among flow patterns, clustering (i.e. the origins and destinations are aggregated simultaneously) is one of the cities’ most common and vital patterns, which assists in uncovering fundamental mobility trends. The existing methods for detecting the clustering pattern of network-constrained flows do not consider the impact of road network topology complexity on detection results. They may underestimate the flow clustering between networks of simple topology (roads with simpler shapes and fewer links, e.g. straight roads) but with high network intensity (i.e. flow number per network flow space), and determining the actual scale of an observed pattern remains challenging. This study develops a novel method, the length-squared L-function, to identify clustering patterns of network-constrained flows. We first use the L-function and its derivative to examine the clustering scales. Further, we calculate the local L-function to ascertain the locations of the clustering patterns. In synthetic and practical experiments, our method can identify flow clustering patterns of high intensities and reveal the residents’ main travel behavior trends with taxi OD flows, providing more reasonable suggestions for urban planning.

1. Introduction

Spatial movement, such as population migration, economic mobility, and disease spread, can be naturally abstracted into a simple origin-to-destination (OD) model called geographical flow (hereafter, flow) (Guo et al. Citation2012; Zhu and Guo Citation2014; Zhao et al. Citation2020). Flow can be classified into two types concerning space. One is free flow, whose origin and destination can be located anywhere within the planar (see a), e.g. the OD flow of a typhoon. The other is constrained flow, whose origin and destination can be located only at restricted locations (see b). Many social and economic activities in urban areas are subject to location restrictions imposed by road structures; thus, the network-constrained flow can better depict the movement of goods and people in cities than the free flow. A network-constrained flow is noted as (XO,YO,XD,YD), it can be treated as a point in a network-constrained flow space (or network flow space), which is seen as the Cartesian product of two networks (i.e. the O network and D network) (Gao et al. Citation2018; Jiang et al. Citation2022). Studying how people and things move within urban networks and analysing how these flows are distributed across space is crucial for gaining insights into spatial interactions and human activities.

Figure 1. The clustering patterns in free flow space (located in yellow planar area pairs) and in network flow space (located in yellow network segment pairs). (a) In free flow space, cluster ① has higher intensity (flow number per flow space) than ②. (b) In network flow space, the flow cluster ③, which is from and to complex roads, has a lower intensity than cluster ④, which is between the simple roads.

Figure 1. The clustering patterns in free flow space (located in yellow planar area pairs) and in network flow space (located in yellow network segment pairs). (a) In free flow space, cluster ① has higher intensity (flow number per flow space) than ②. (b) In network flow space, the flow cluster ③, which is from and to complex roads, has a lower intensity than cluster ④, which is between the simple roads.

Clustering (or aggregation) is one of the most significant flow patterns. Identifying flow clustering in network flow space is essential in quantifying urban research. For instance, the cluster of transportation flows with a small scale might represent the strong traffic demand between the two locations, which may help to customize a more optimized traffic plan (Chen et al. Citation2013); detecting aggregated stolen-recovery flow of crimes would provide a perspective for understanding the scope and pattern of criminal events, thus achieving better crime prevention (Tao and Thill Citation2016); examining the variation of aggregated human flow distribution before and after extreme weather events could analyse the impact of the events on the residents’ life (Chen et al. Citation2020).The size of the aggregation is referred to as the clustering scale (or aggregation scale); it is the basis for identifying the underlying clusters (Kiskowski, Hancock, and Kenworthy Citation2009; Shu et al. Citation2021). The clustering scale is usually measured based on the radius of the cluster, namely the distance between the flow at the center of the cluster and the flow farthest from it; this kind of radius-based scale is also called the planar scale (Fang et al. Citation2023). For example, the planar scales of the four clusters in are r under the maximum distance of flows (i.e. the maximum value of the distance between O points and the distance between D points of the two flows). Similar to detecting aggregated free flows, if the number of flows within the distance r of a particular flow is more than that expected under completely random assumption, clustering pattern centered at this flow is considered to be significant, and the corresponding planar scale is r. For instance, the flow sets in can be recognized as clustering patterns centered at central flows (colored in red) (Shu et al. Citation2021).

Progresses have been achieved in identifying free flow clustering by extending the existing methods of detecting point clustering in planar. Existing methods can be grouped into three categories (Song et al. Citation2019). The first is the hierarchical clustering method proposed for cluster identification and visualization (Doantam et al. Citation2005; Lee, Han, and Whang Citation2007; Wood, Dykes, and Slingsby Citation2010; Adrienko and Adrienko Citation2011). This approach can be used to classify flows into clusters with multiple scales according to the parameters set for cluster numbers (Zhu and Guo Citation2014). However, the optimal parameter selection is worth further studying, and even an outlier flow would be included in a certain cluster. The second is the density-based method (Pei et al. Citation2015; Nanni and Pedreschi Citation2006; Zhu et al. Citation2013), which is typically developed by altering the distance, density, and reachability using the density-based spatial clustering of applications with noise (DBSCAN) algorithm. DBSCAN can detect arbitrarily shaped clusters and reject outlier flows. However, determining the parameters MinPts and ε is realized in an interactive way. The third method, namely the spatial statistic-based method aiming to analyse heterogeneity, can overcome the previous problems to a certain extent. It extended existing spatial statistics of point data to flow data, such as Moran’s I (Liu, Tong, and Liu Citation2015) and Getis-Ord G statistics (Berglund and Karlström Citation1999). Though this kind of method can be seen as a nonparametric method, it mainly focuses on the spatial association of proximate flows. In conclusion, all the methods mentioned above may suffer the same weakness: they are not concerned with the scale characteristics of clustering. So the existing methods cannot accurately estimate the intensity of the clustering pattern, which is the key to pattern detection.

Among existing spatial statistic-based methods for free flows, the flow K-function is regarded as the most effective one. K-function method is able to detect clustering patterns and examine the scale properties (Tao and Thill Citation2016). In the K-function family, the incremental K-function was proposed to detect the clustering scale (Yamada and Thill Citation2007; Tao, Thill, and Yamada Citation2015). Further, the flow K-function was normalized by its expected value under CSR to flow L-function, which further magnifies the difference between observed distribution and random distribution. L-function’s first derivative can effectively reveal the clustering scale without manually setting parameters (Sterner, Ribic, and Schatz Citation1986; Haase Citation1995; Shu et al. Citation2021). Though the K-function for network-constrained flows (network flow K-function) (Kan, Kwan, and Tang Citation2021) was proposed based on studies about the K-function for network-constrained points (Yamada and Thill Citation2004; Lu and Chen Citation2007; Okabe and Yamada Citation2001), it does not consider the effect of network topology complexity. This defect makes it unable to detect the flow clustering pattern of high intensity in network flow space.

To detect clustering patterns with high intensity in the network, the existing methods still have some limitations: First, the network flow K-function detects clustering patterns under the radius-based planar scale originally designed for planar space, leading to the under or over-estimation of flow intensity in networks (Fang et al. Citation2023). This is because the planar scale r indicates the same spatial size in planar flow space (see the yellow planar area pairs in a), but the spatial size in network flow space indicated by this scale is different and is dependent on the complexity of the network topology (see the yellow network segment pairs in b). Specifically, the flow set located at road networks with complex topology (cluster ③ in b) has larger flow space and thus has the chance to possess more flows at the scale r than the one between simple roads (cluster ④ in b). So, the former flow set is more likely to be identified as a clustering pattern, even though the latter one has a higher intensity in networks. Second, flaws exist in the definition of the flow volume, i.e. the size of the network flow space. Considering that the flow space is the Cartesian product of two networks, simply measuring the size of the network flow space by the length of the network where the flows are located is broad and inaccurate. This definition needs to be rectified for a sounder mathematical foundation for network-constrained flows. Third, although there exist methods of detecting the clustering scale in planar space using the incremental K-function or the first derivative of the L-function, these methods have yet to be extended to determine the clustering scale of the network-constrained flow.

To overcome the aforementioned challenges, we proposed the length-squared K and L-function inspired by the idea of the length-based network scale (Fang et al. Citation2023). Unlike the radius-based planar scale used in the existing methods, where the clustering scale is represented by the radius of the cluster, the network scale considers the network lengths (i.e. the length of an imaginary straight line that is extended from the network) of the O and D points ranges. For example, the clustering scales under the network scale for cluster ③ and ④ in b should be related to 4r and 2r, respectively. Hence, the methods based on the network scale would obtain the actual spatial size, helping detect the true-to-life intensity of the flow set.

In the rest of this paper, we correct the defects in the concepts of network flow space, helping define the K-function under the network scale, namely the length-squared K-function. Then, the expected value of this K-function under complete spatial randomness (CSR) is derived, which is the base of the length-squared L-function. The length-squared L-function could determine clustering patterns at a series of scales, its first derivative can be used to detect the scales of clusters, and its local version help identify the location of clustering patterns. In both synthetic and practical experiments, we found that the existing techniques using the radius-based planar scale may focus on the clustering pattern between complex road networks but with lower intensity, while the proposed length-squared L-function is able to detect clusters between relatively simple roads but with higher intensity, helping to supplement a new perspective to flow clustering detection and find location pairs with close interaction in the city.

2. Network flow space under the length-based network scale

Network flow space is defined as the Cartesian product of two road networks where flow events happen, which is similar to the free flow space definition (Shu et al. Citation2021). The network where O points of flows occur is the O network, and the network where D points occur is the D network. A flow in network, fi, can be represented by the ordered combination of rectangular coordinates of point O(xiO,yiO) and point D(xiD,yiD), i.e. (xiO,yiO,xiD,yiD).

The following subsections will first define the distance and metrics of flows, then introduce the two kinds of neighborhoods under different metrics, which are designed to detect planar and network scales of clustering patterns, respectively. At last, the volume of the network flow space and the subset of it, i.e. h2-neighborhood, are deduced to help derive the flow intensity in the network flow space. These definitions support the subsequent derivation for the L-function.

Definition 1. Flow distance and metrics: In networks, the network distance between point pi(xi,yi) and point pj(xj,yj) is denoted as dij: the shortest path distance between two points along the road network, calculated with the aid of Dijkstra’s algorithm (Dijkstra Citation1959). For flow fi and fj, we calculate and reserve both the network distance between O points, i.e. dijO, and the network distance between D points, i.e. dijD, as the distance matrix, i.e. Dij=[dijO,dijD] (see ). In previous studies, metrics related to flows typically synthesize the distribution of O and D points. The most common ones are the additive metric and the maximum metric, which represent the linear combination and maximum value of a certain measurement in the O and D networks, respectively (Tao and Thill Citation2016). For example, the distance with additive metric (i.e. additive distance) is denoted as Dijadd=dijO+dijD, and the distance with maximum metric (i.e. maximum distance) is denoted as Dijmax=max(dijO,dijD). The former distance can comprehensively consider both the proximity between the O points and the D points, while the latter focuses on the scope of the flows.

Figure 2. Network distance between flows. The path colored in orange is the shortest path distance between O points of flow fi and fj, i.e. dijO; the path colored in yellow is the shortest path distance between D points of flow fi and fj. i.e. dijD. The distance between flows has usually been designed to be the linear combination or extreme value of dijO and dijD.

Figure 2. Network distance between flows. The path colored in orange is the shortest path distance between O points of flow fi and fj, i.e. dijO; the path colored in yellow is the shortest path distance between D points of flow fi and fj. i.e. dijD. The distance between flows has usually been designed to be the linear combination or extreme value of dijO and dijD.

Definition 2. R2-neighborhood and h2-neighborhood: they are both subsets of the network flow space, which center around a certain flow, namely, the central flow. They are designed to detect the planar and network scales of clustering patterns, respectively. For r2-neighborhood, it is a flow space where the distance between an arbitrary flow and the central flow is not larger than r. For example, the yellow network segment pairs in b refer to r2-neighborhoods under the maximum metric in networks. For h2-neighborhood, we first introduce the maximum-metric-based h2-neighborhood as an example (see a). It follows the following two points: First, the total length of the road network occupied by its projections in the O and D network should be the same as h. Second, the radius r (i.e. the distance between each endpoint and the central point) should be the same in the corresponding projection. However, due to the complex configuration of road networks, r will change with the central flow’s position. For example, as the h2-neighborhood shown in a, the central flow is from a straight road to a place close to crossroads. The neighborhood has rO (radius of its projection in O network) and rD (radius of its projection in D network) as h2 and 38h, respectively. Although the shapes and structures of the projections in the O and D network vary with the position of central flows, their total length are consistent as h. This fundamental and vital property provides a guarantee for the subsequent derivation.

Figure 3. The maximum-metric-based h2-neighborhood and the method for counting flows in it. (a) An example of h2-neighborhood. When counting flows in the h2-neighborhood of fi, for each flow fj (j=1,,n,ji,n is the number of flows in the research area), calculate the length of hr=dijOO in the O network and hr=dijDD in the D network, flow fj counts when both hr=dijOO and hr=dijDD are within the extent of h2-neighborhood. For example, (b) flow fp is within while (c) the flow fq is without the h2-neighborhood of flow fi.

Figure 3. The maximum-metric-based h2-neighborhood and the method for counting flows in it. (a) An example of h2-neighborhood. When counting flows in the h2-neighborhood of fi, for each flow fj (j=1,…,n,j≠i,n is the number of flows in the research area), calculate the length of hr=dijOO in the O network and hr=dijDD in the D network, flow fj counts when both hr=dijOO and hr=dijDD are within the extent of h2-neighborhood. For example, (b) flow fp is within while (c) the flow fq is without the h2-neighborhood of flow fi.

Counting flows in the neighborhood is one of the most important steps in K and L-function. For the maximum-metric-based h2-neighborhood centered on a certain flow fi, the method to count flow in it is as follows: First, for each flow, i.e. fj(j=1,,n,ji), where n is the flow numbers in the research area, calculate the length of h2-neighborhood’s projection in the O and D network based on dijO and dijD, i.e. hr=dijOO and hr=dijDD. Then, flow fj counts when both hr=dijOO and hr=dijDD are less than h. For example, in b and c, hr=dipOO and hr=dipDD, represented by the yellow path in the O and D network, respectively, are both within the extent of the projections of h2-neighborhood centered on flow fi (road segments colored in blue). Thus, flow fp will be counted within this h2-neighborhood. However, the flow fq in c will not be included since hr=diqDD exceeds the projection of the h2-neighborhood in the D network. Similarly, the corresponding flow counting method for the additive-metric-based h2-neighborhood can be deduced in the same manner as above.

Definition 3. Network flow volume: The previous study proposed a method that uses the total length of the road network in the experimental area to represent the volume of the network flow space (Kan, Kwan, and Tang Citation2021). This definition has defects because the network flow space should be regarded as the Cartesian product of the O and D network. Suppose the total length of the O network to be LO and the total length of the D network to be LD; the volume of this network flow space should be V=LOLD. The volume of h2-neighborhood under different metrics can be derived by the shell integration method (Shu et al. Citation2021). This method consists of the inner and the outer integrations, which depicts the range of h2-neighborhood under certain metric. The inner integration integrates the sub-network from the O network to the D network, while the outer integration integrates the results in inner integration. The relevant integration processes under the maximum and the additive metric are detailed in . Suppose a maximum-metric-based h2-neighborhood in network flow space centered on a central flow fc is Ω={fi|max(hr=dicOO,hr=dicDD)h}. Based on the metric defined in Definition 1, the volume of a h2-neighborhood under the maximum metric is: (1) Vh2(max)=0hdu0hdv=h2(1) When using the additive metric, the h2-neighborhood based on a central flow fc is Ω={fi|hr=dicOO+hr=dicDDh}, and the volume of it is: (2) Vh2(add)=0hdu0hudv=12h2(2)

Figure 4. Calculating flow volume of h2-neighborhood under maximum metric and additive metric. The inner integration integrates the sub-network from the O network to the D network, while the outer integration integrates the results in inner integration.

Figure 4. Calculating flow volume of h2-neighborhood under maximum metric and additive metric. The inner integration integrates the sub-network from the O network to the D network, while the outer integration integrates the results in inner integration.

Definition 4. Flow process intensity: Flow process intensity is the expected number of flows per unit volume in the research flow space. It can usually be calculated as λ^f=nLOLD, where LO and LD are the length of road networks covered by O points and D points, respectively. In the real world, however, it is sometimes difficult to precisely identify the covering network flow space of the flow process; thus, the intensity parameter can also be estimated using the second-order properties of flows (Pei et al. Citation2015; Pei et al. Citation2007; Shu et al. Citation2021). Due to the limited length of the paper, the detailed deduction is not mentioned here.

As introduced in Definition 1, the maximum metric can better measure the spatial extent of a flow cluster since it binds the distribution range of O and D points, satisfying the demand to quantify the clustering scale of flows. Therefore, we use the maximum metric in the following derivation and experiments.

3. Network flow K-function and its incremental function

As with Ripley’s K-function in point-pattern analysis (Ripley Citation1976), network flow K-function should be defined as the expected number of additional flows within the r2-neighborhood of an arbitrary flow normalized by flow intensity in the network (Kan, Kwan, and Tang Citation2021). The derived formula for the network flow K-function K(r) is: (3) K(r)=λf1ijσijf(r)n,(i,j=1,2,,nij)(3) (4) σijf(r)={1,Dijmaxr0,Dijmax>r(4) where r is a designated planar scale, n is the number of flows in the research area, λf is the flow intensity in the network, Dijmax is the maximum distance between two flows, and σijf(r) is used to count flows within the r2-neighborhood centered at flow fi.

The local network K-function for each flow is defined as. (5) Ki(r)=λf1jσijf(r)n,(j=1,2,,nji)(5) where r, λf, n, and σijf(r) in EquationEquation (5) have the same meanings as in Equations (Equation3) and (Equation4).

The incremental network K-function for constrained points was proposed based on the network K-function to weaken the cumulative effect and thus help quantify clustering scales, which examines the number of points by unit increments of planar scale r rather than the total number of points within r-neighborhood (Yamada and Thill Citation2007). Therefore, the incremental network flow K-function Kincre(rt) can also be derived as the difference between the network flow K-function value at two consecutive scales: (6) Kincre(rt)=K(rt)K(rt1)t=2,3,,Kincre(rt)=K(rt)t=1(6) where K(r) is calculated as EquationEquation (3) and rt represents the tth scale in ordered scales.

4. Length-squared K-function and L-function

Length-squared K-function should be defined as the expected number of additional flows within h2-neighborhood of an arbitrary flow normalized by flow intensity in the network. The derived formula for the length-squared K-function K(h) is: (7) K(h)=λf1ijσijf(h)n,(i,j=1,2,,nij)(7) (8) σijf(h)={1,max(hr=dijOO,hr=dijDD)h0,max(hr=dijOO,hr=dijDD)>h(8) where h is a designated network scale, n is the number of flows in the research area, λf is the flow intensity in the network, and σijf(h) is used to count flows within the h2-neighborhood centered at flow fi.

In point pattern analysis, CSR describes a point process whereby point events occur within a given study area in a completely random distribution. CSR is often applied as the standard or benchmark against which datasets are tested (Tao, Thill, and Yamada Citation2015; Yamada and Thill Citation2007; Okabe and Yamada Citation2001). In the present study, inferences about CSR flow assist in deriving the L-function from the K-function. CSR flows are defined as a homogeneous spatial Poisson process. Given a subset S of flow space, the number of flows in S obeys the Poisson distribution as λfVS, where λf is the intensity of this flow process in networks and VS is the volume of flow space S.

As the definition of K-function, the expected number of additional flows within an h2-neighborhood is λfK(h). Combining that with the property of CSR flows, we can conclude that the expected value of the length-squared K-function under CSR equals the neighborhood’s volume at each scale. When using the maximum metric for flows, we get that λfK(h)=λfh2. Thus, the expectation for the length-squared K-function with a homogeneous Poisson flow process under the maximum metric is: (9) E(K(h))=λfh2λf=h2(9) By normalizing the length-squared K-function with its expectation, we can obtain the length-squared L-function, the expectation of which is 0 at any scale for CSR flows. In this way, we can amplify the deviation from the flow CSR (Besag Citation1977): (10) L(h)=K(h)h=λf1ijσijf(h)nh,(i,j=1,2,,nij)(10) The local length-squared L-function for each flow is defined as. (11) Li(h)=Ki(h)h=λf1jσijf(h)nh,(j=1,2,,nji)(11) where h, λf, n and σijf(h) in Equations (Equation9), (Equation10), and (Equation11) have the same meanings as in Equations (Equation7) and (Equation8).

The benchmark we use in this study is flow CSR, which assumes unbounded space (e.g. the Poisson point process), while our study is performed in the context of bounded networks, which leads to the problem of edge effects. In the context of networks, edge effects do not represent a critical issue as the data are always located in a reduced area in the networks, making the edge correction not necessary (Okabe and Sugihara Citation2012).

5. Simulation experiment

In this section, we first test the correctness of the benchmark (i.e. expectation value under CSR) for length-squared L-function using Monte Carlo simulation. Then, methodologies of using the existing planar-scale-based network K-function and the proposed length-based length-squared L-function to identify clustering are introduced. At last, we generate flow datasets with designed clustering patterns to examine and compare the ability of network flow K-function and length-squared L-function.

5.1. Verifying the correctness of Null models

As deduced before, the expectation of K(h) is h2, and the expectation of L(h) is 0 at any scale under CSR, which are also called the theoretical curves for the corresponding functions. To prove that the complexity of the road network does not influence these null models, we designed two synthetic networks with different structure complexities. Monte Carlo simulation repeated 99 times was used to test the null models. In each simulation, the flow dataset contains 400 CSR flows. For these two synthetic networks, the data example and the results are shown in , where the blue line with stars in b and e, the purple line with stars in c and f correspond to the theoretical K(h) and L(h) curves; the black lines correspond to the simulated means of K(h) and L(h) with 99 runs. The grey belts are the 95% confidence bands based on the Monte Carlo simulations. Both the theoretical K(h) and L(h) curves fall between the 95% confidence intervals with 99 runs, and the simulated means match the theoretical curves. Therefore, the benchmark of K(h) and L(h) under CSR for any network can be the derived expectations (i.e. K(h)=h2 and L(h)=0), which proves that the derivation of L-function is correct and provides a reference for the significance test of L-function.

Figure 5. Monte Carlo simulation with CSR flows of the length-squared K-function and the length-squared L-function in the context of the (a) simple and (d) complex road networks: (b) and (e) are the theoretical curves and the simulated curves of the length-squared K-function, (c) and (f) are the theoretical curves and the simulated curves of the length-squared L-function.

Figure 5. Monte Carlo simulation with CSR flows of the length-squared K-function and the length-squared L-function in the context of the (a) simple and (d) complex road networks: (b) and (e) are the theoretical curves and the simulated curves of the length-squared K-function, (c) and (f) are the theoretical curves and the simulated curves of the length-squared L-function.

5.2. Detecting clustering scale and identifying clustering

To test the ability of length-squared L-function to detect clustering patterns in the network, we designed three flow datasets. As the OD maps (Wood, Dykes, and Slingsby Citation2010) show in , Case 1 contains 60 aggregated flows from a straight road to another straight road, the network scale is h=1.0, the planar scale is r=0.5, and the intensity is I=601×1=60; 400 randomly distributed flows. Case 2 contains 140 aggregated flows from an intersection with seven exits to another, the network scale is h=3.5, the planar scale is r=0.5, and the intensity is I=1403.5×3.5=11.43; 400 randomly distributed flows. Case 3 contains the clustering pattern in Cases 1 and 2 (note that the clustering pattern in Case 1 has a higher intensity than the clustering pattern in Case 2) and 400 randomly distributed flows.

Figure 6. Synthetic flow data. (a) Case 1 consists of a flow cluster between straight roads and 400 CSR flows. (b) Case 2 consists of a flow cluster between seven-fork intersections and 400 CSR flows. (c) Case 3 consists of the flow clusters in Cases 1 and 2, and 400 CSR flows.

Figure 6. Synthetic flow data. (a) Case 1 consists of a flow cluster between straight roads and 400 CSR flows. (b) Case 2 consists of a flow cluster between seven-fork intersections and 400 CSR flows. (c) Case 3 consists of the flow clusters in Cases 1 and 2, and 400 CSR flows.

The methodology is divided into three steps. The first step is to use the network K-function and the length-squared L-function to judge whether there is a clustering pattern at a certain scale. The second step is to use the derivative of these two functions to discern the clustering scales. The third step is to identify the location of the cluster, which is achieved by calculating the local function values of these two functions based on the detected clustering scales. The following paragraphs will introduce these three steps in detail.

In the first step, we calculate the network K-function and the length-squared L-function at a series of scales. For these two functions, if the function value of the observed data is greater than the benchmark on a certain scale, then there exhibits a significant clustering pattern at this scale. We use 99 times Monte Carlo simulation to get the benchmark (i.e. 95% confidence interval) for these two functions. For each simulation, randomly distributed flows are generated, and the number of the generated simulated flows is consistent with the number of observed data.

In the second step, we use the incremental network K-function to detect the planar scales. We calculate the first derivative of the L-function, namely the L’-function, to discern the network scales of the clustering pattern. The maxima of the incremental network flow K-function can be regarded as a planar clustering scale if the function value at this scale is higher than its benchmark (which is also stemmed from the Monte Carlo simulation method as in the K-function). The transformation from L to L’-function is also about quantifying the L-function’s increment rate. Specifically, the scale h that minimizes L’-function indicates a clustering scale and h2 is the network scale of the clustering pattern (Shu et al. Citation2021).

In the third step, we calculate the local function values for each flow at the clustering scales r and h, respectively. Local function value reveals the clustering pattern’s location; if a flow shows a high local function value at a certain clustering scale, then there is a flow cluster with the size of the detected scale.

For Case 1, the network K-function detected clustering scale at r=0.4 and 0.6 (a). The other maxima of incremental network flow K-function are unable to reflect clustering scales since the function values corresponding to them are smaller than the benchmark. The length-squared L-function detected the clustering scale at h=0.22=0.1 and h=2.02=1.0 (b). The other larger scales by the minima of the L’-function are omitted since they may represent clustering patterns at larger scales but consist of several independent ones, which could be verified by calculating the local function value at these scales (Shu et al. Citation2021). c and d show the local function value at detected scales; the smaller ones, namely r=0.4 and h=0.1 are omitted since they can only reveal subsets of the clustering. In this case, both methods can identify the clustering between straight roads. However, the length-squared L-function has the advantage that it could amplify the difference between the observed data with CSR data as the L-function is larger than the benchmark (theoretical curve and the simulated envelope). In contrast, the difference between the network K-function for the observed and simulated data is slight. For Case 2, the detected clustering scale is r=0.7 by network K-function and h=4.82=2.4 by length-squared L-function (a and b). Local function values at these detected scales illustrate that these two methods can also identify the clustering pattern between complex roads (c and d). For Case 3, the detected clustering scale is r=0.7 by network K-function, h=1.92=0.95 and h=4.62=2.3 by length-squared L-function (a and b). Thus, the network flow K-function only detected the clustering pattern between intersections (c) and ignored the one between straight roads with higher intensity. However, our method identified the two clustering patterns with two different network scales (d and e).

Figure 7. Results of the network K-function and the length-squared L-function for Case 1. The first row consists of (a) the network K-function and the incremental network K-function (the scale that maximizes the incremental network K-function is represented by grey dotted lines), (b) the length-squared L-function and the L’-function (the minimum of L’-function is represented by the grey dotted line). The second row consists of (c) local network K-function and (d) local length-squared L-function values for each flow at the detected clustering scales.

Figure 7. Results of the network K-function and the length-squared L-function for Case 1. The first row consists of (a) the network K-function and the incremental network K-function (the scale that maximizes the incremental network K-function is represented by grey dotted lines), (b) the length-squared L-function and the L’-function (the minimum of L’-function is represented by the grey dotted line). The second row consists of (c) local network K-function and (d) local length-squared L-function values for each flow at the detected clustering scales.

Figure 8. Results of the network K-function and the length-squared L-function for Case 2 (see for the explanation for the legend).

Figure 8. Results of the network K-function and the length-squared L-function for Case 2 (see Figure 7 for the explanation for the legend).

Figure 9. Results of the network K-function and the length-squared L-function for Case 3 (see for the explanation for the legend).

Figure 9. Results of the network K-function and the length-squared L-function for Case 3 (see Figure 7 for the explanation for the legend).

Cases 1 and 2 demonstrate the ability of both the existing and proposed methods to identify clustering patterns in road networks. Case 3 demonstrates the superiority of our method in detecting high-intensity clusters. The network K-function failed to detect the high-intensity cluster because the r2-neighborhood is based on the radius of the flow set. Specifically, the r2-neighborhood starting and ending at crossings (relatively complex road networks) would have the chance to include more flows than that between straight roads (relatively simple road networks). The flows between crossings would have the priority to be identified as clustering patterns by network K-function.

6. Experiment with real data

The behavior mode depicts the main human daily activities in cities, such as modes between residential, working, medical, entertainment and transit areas. Understanding the behavior mode in a region helps perceive the travel demand and support better urban planning. By analysing the distribution range of O points and D points of the taxi OD flow clustering in areas, we can understand the behavior mode behind it. To study the clustering pattern of residents’ travel demand, we use the length-squared L-function to identify the clustering pattern compared with the network flow K-function. This study helps analyse the behavior mode in regions combined with the facilities around the detected clustering patterns.

In this case study, we conducted experiments using taxi OD flow. We chose Wangjing Area and Financial Street as study areas (see ) because these zones are in the main urban area of Beijing and contain facilities with distinctive functions; they can better reflect the practicability of our method in analysing the dynamics of urban residents. All experimental OD data are matched to the road network according to the travel trajectory of the taxis. Due to the large size of the taxi dataset, we randomly sampled 2000 flows of each study area from October 20th to 24th, 2014, which were workdays.

Figure 10. Sub-areas and selected flow data in Beijing. (a) Study areas; (b)selected flows in Wangjing Area and (d) the corresponding OD map; (c)selected flows in Financial Street and (e) the corresponding OD map.

Figure 10. Sub-areas and selected flow data in Beijing. (a) Study areas; (b)selected flows in Wangjing Area and (d) the corresponding OD map; (c)selected flows in Financial Street and (e) the corresponding OD map.

The results of the real data using network K-function and length-squared L-function are demonstrated in . The network flow K-function detected obvious clustering patterns from r=1500m and the length-squared L-function detected clustering patterns at all scales for these two areas. According to the result of incremental network K-function, for the Wangjing Area, network flow K-function obtained the planar scale of clustering pattern at r=1950,2650,3150,3500,3600,3850m. According to the result of the first derivative of length-squared L-function, detected network scales are h=275,825,2100,3175,4000,4375m. For the Financial Street, the network flow K-function method obtained the planar scale of the clustering patterns at r=1900,2000,2200,2600,2700,2900m and length-squared L-function method detected the network scale at h=1025,1575,2075,2625,3025,3325m. The corresponding local function values at detected clustering scales are shown in and .

Figure 11. Results of the case study. Each row corresponds to the sub-areas in (see for the explanation for the legend).

Figure 11. Results of the case study. Each row corresponds to the sub-areas in Figure 10 (see Figure 7 for the explanation for the legend).

Figure 12. Local network flow K-function (NK) and local length-squared L-function (LL) for each flow at the detected clustering scales for Wangjing Area.

Figure 12. Local network flow K-function (NK) and local length-squared L-function (LL) for each flow at the detected clustering scales for Wangjing Area.

Figure 13. Local network flow K-function (NK) and local length-squared L-function (LL) for each flow at the detected clustering scales for Financial Street.

Figure 13. Local network flow K-function (NK) and local length-squared L-function (LL) for each flow at the detected clustering scales for Financial Street.

We further compare the difference between the result of the network flow K-function and length-squared L-function. For each detected clustering scale, we extracted flows that possess local function values in the top 25% of the entire value range, revealing the clustering patterns’ locations (see ). For Wangjing Area, the clustering patterns detected by network K-function are concentrated in dense road networks around Wangjing Central Park (see a). This is because the r2-neighborhood within this area covers a more complex road network (longer length of the roads) than other sparse road networks and has the opportunity to contain more flows. Thus, the corresponding local network K-function values for flows here are higher. However, this does not mean that there are more points per road network length, namely clustering patterns with higher intensity. The clustering patterns that our method detects mainly cross the Fourth Ring Road from the periphery to the city’s center (see b). For example, the clustering patterns from a residential area to Taiyanggong metro station and Shaoyaoju metro station, and the clustering pattern from a residential area to Liangmaqiao metro station. These are travel demands of residential-transit mode, which reflect a common mode that residents living in the suburbs and far away from the subway station usually take a taxi to the nearest subway station and then commute. In addition, our method also detected other clusters, such as the one from the Sanyuanqiao metro station to the 798 Art Block, a famous entertainment place in this research area, which shows transit-entertainment mode. Furthermore, it could be concluded that our method is more sensitive to clustering patterns as it is able to detect patterns at relatively small scales. For example, our method detects a clustering pattern at the network scale of 275 m, but the network flow K-function did not identify that there is a clustering pattern on the equivalent planar scale.

Figure 14. Flows with top local function values at detected clustering scales. Flows with (a) top local NK and (b) top local LL values at detected scales for Wangjing Area, (c) top local NK, and (d) top local LL values at detected scales for Financial Street.

Figure 14. Flows with top local function values at detected clustering scales. Flows with (a) top local NK and (b) top local LL values at detected scales for Wangjing Area, (c) top local NK, and (d) top local LL values at detected scales for Financial Street.

For Financial Street, the clustering patterns detected by the network K-function are also distributed at road networks with relatively dense and complex structures (see c). Similar to the case study of Wangjing Area, the results of the network K-function in Financial Street cannot reflect clustering patterns with high intensity either. In addition, we also found that the clustering scale detected in this area is very large, meaning that each detected clustering pattern at this scale occupies a large part of the study area, which is not conducive to analysing local anomalies. However, our method detected clustering patterns consisting of four main parts (see d), from west to east: the one from the residential place to Muxidi metro station, the one from the residential place to Beijing Children’s Hospital, the one from Parkson shopping center to China Life Insurance, Zhuozhou Center, Everbright Center, and other financial institutions, and the one from these financial institutions to Xidan Joy City shopping mall and Juntai Department Store. These clustering patterns reflect the residential-transit, residential-medical, entertainment-working, and working-entertainment modes.

The result of the length-squared L-function adds a new perspective and brings new discoveries into urban anomaly detection. In addition, our proposed method is based on the network scale; thus, the detected clustering pattern is the one with higher intensity, namely with more flows per unit of network flow space, revealing more concentrated interactions (Fang et al. Citation2023). Furthermore, our method is more sensitive to clustering patterns since it can amplify the difference between observed data and CSR data. The network flow K-function does not regard that there is a clustering pattern at relatively small planar scales, but at the equivalent network scale, our method could detect clustering patterns.

7. Conclusions

Based on the length-based network scale, this study proposes the length-squared L-function to identify clustering patterns of network-constrained flows. Experiments with synthetic and real data show that our method can accurately identify the location and extent of clustering patterns in road networks. Compared with the existing network flow K-function, which is based on the radius-based planar scale, our method can overcome the influence of road topology and detect the clustering patterns with high intensities. In the case study with taxi OD data in selected areas, our method can detect the clustering patterns of high intensities, which the existing network K-function ignored. The detected clustering patterns reveal the primary behavior mode (e.g. the residential-transit mode), which helps perceive the strong travel demand in a selected area and supports appropriate urban planning.

The contribution of our work is reflected in four aspects: First of all, with the novel h2-neighborhood based on the network scale, the proposed method would not be affected by the topological complexity of road networks and can thus identify the clustering pattern of high intensity. Moreover, the h2-neighborhood proposed in this study can also be used to extend other spatial statistics methods for more precise spatial quantification in the context of road networks. Second, our method can identify the clustering pattern without setting scale parameters (i.e. clustering scales). In other related methods (e.g. DBSCAN), the scales of the clusters should be set in advance. In our method, however, the clustering scales can be detected by the derivative of the L-function. Third, the L-function is able to amplify the difference between observed data and randomly distributed data and is thus more sensitive to the clustering scale. Forth, our work has corrected and improved the concepts in network flow space and laid a solid mathematical foundation for future statistics in this direction.

Our work still has shortcomings, which could be crucial increments of future work. First, though our method can be applied to networks of any size, it is usually necessary to select some study regions manually to get more precise results. In the L-function method, when the research area is large, there might be too many clustering patterns with different scales, and the detected scales are the comprehensive contribution of these multiple clustering patterns (Yamada and Thill Citation2007; Kan, Kwan, and Tang Citation2021). It will be challenging to determine a specific clustering pattern based on the detected scale in a large area. Thus, when selecting a research area, it is better to use prior knowledge (e.g. the distribution of flow data) to select areas with only a few potential clustering patterns. Furthermore, we use flow CSR as the benchmark of our method because it can be a base for deriving the L function. Such CSR assumption is too strong for most true-to-life ways. As a result, more research should be done to extend our method to real applications, such as using the Monte Carlo method or developing some other simulation methods (Liu, Tong, and Liu Citation2015; Kan, Kwan, and Tang Citation2021). Moreover, the time attribute of the flow is not considered in our method for flows. This is because this work focuses on the spatial pattern of flows. However, some approaches have been made for spatiotemporal flows (Yan et al. Citation2023a; Citation2023b; Hohl et al. Citation2022), which may provide an indication of extending our work to spatiotemporal flows.

8. Geolocation information

In this paper, we used two urban areas of Beijing as study areas, which mainly include the Wangjing Area in Chaoyang District (the latitude range is about from 116.42–116.52 degrees, the longitude range is from 39.95–40.02 degrees) and the Financial Street Area in Xicheng District (the latitude range is about from 116.33–116.37 degrees, the longitude range is from 39.90–39.92 degrees).

Acknowledgements

The authors would also like to thank anonymous reviewers for their valuable comments on the manuscript.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Additional information

Funding

This work was supported by the National Key R&D Program of China [Grant No. 2022YFC3800803], the National Natural Science Foundation of China [Grant No. 42071436] and the Innovation Project of LREIS [Grant No. KPI002].

References

  • Adrienko, N., and G. Adrienko. 2011. “Spatial Generalization and Aggregation of Massive Movement Data.” IEEE Transactions on Visualization and Computer Graphics 17 (2): 205–19. https://doi.org/10.1109/TVCG.2010.44.
  • Berglund, Svante, and Anders Karlström. 1999. “Identifying Local Spatial Association in Flow Data.” Journal of Geographical Systems 1 (3): 219–36. https://doi.org/10.1007/s101090050013.
  • Besag, Julian E. 1977. “Comment on'Modeling Spatial Patterns’ by BD Ripley.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (2): 192–212. https://doi.org/10.1111/j.2517-6161.1977.tb01616.x.
  • Chen, Zhenhua, Zhaoya Gong, Shan Yang, Qiwei Ma, and Changcheng Kan. 2020. “Impact of Extreme Weather Events on Urban Human Flow: A Perspective from Location-Based Service Data.” Computers, Environment and Urban Systems 83: 101520. https://doi.org/10.1016/j.compenvurbsys.2020.101520.
  • Chen, Chao, Daqing Zhang, Zhi-Hua Zhou, Nan Li, Tülin Atmaca, and Shijian Li. 2013. “B-Planner: Night bus Route Planning Using Large-Scale Taxi GPS Traces.” Paper Presented at the 2013 IEEE International Conference on Pervasive Computing and Communications (PerCom), 18-22 March 2013.
  • Dijkstra, Edsger W. 1959. “A Note on two Problems in Connexion with Graphs.” Numerische Mathematik 1 (1): 269–71. https://doi.org/10.1007/BF01386390.
  • Doantam, Phan, Xiao Ling, R. Yeh, and P. Hanrahan. 2005. “Flow map Layout”. Paper Presented at the IEEE Symposium on Information Visualization, 2005. INFOVIS 2005., 23-25 Oct. 2005.
  • Fang, Zidong, Ci Song, Hua Shu, Jie Chen, Tianyu Liu, Xi Wang, Xiao Chen, Xiaorui Yan, and Tao Pei. 2023. “Length L-Function for Network-Constrained Point Data.” Transactions in GIS 27 (2): 476–93. https://doi.org/10.1111/tgis.13035.
  • Gao, Yizhao, Ting Li, Shaowen Wang, Myeong-Hun Jeong, and Kiumars Soltani. 2018. “A Multidimensional Spatial Scan Statistics Approach to Movement Pattern Comparison.” International Journal of Geographical Information Science 32 (7): 1304–25. https://doi.org/10.1080/13658816.2018.1426859.
  • Guo, Diansheng, Xi Zhu, Hai Jin, Peng Gao, and Clio Andris. 2012. “Discovering Spatial Patterns in Origin-Destination Mobility Data.” Transactions in GIS 16 (3): 411–29. https://doi.org/10.1111/j.1467-9671.2012.01344.x.
  • Haase, Peter. 1995. “Spatial Pattern Analysis in Ecology Based on Ripley's K-Function: Introduction and Methods of Edge Correction.” Journal of Vegetation Science 6 (4): 575–82. https://doi.org/10.2307/3236356.
  • Hohl, Alexander, Wenwu Tang, Irene Casas, Xun Shi, and Eric Delmelle. 2022. “Detecting Space–Time Patterns of Disease Risk Under Dynamic Background Population.” Journal of Geographical Systems 24 (3): 389–417. https://doi.org/10.1007/s10109-022-00377-7.
  • Jiang, Jingyu, Xi Wang, Tianyu Liu, Zidong Fang, Tao Pei, Ci Song, and Peijun Du. 2022. “Topological Relationship Model for Geographical Flows.” Cartography and Geographic Information Science, 1–17. https://doi.org/10.1080/15230406.2022.2104377.
  • Kan, Zihan, Mei-Po Kwan, and Luliang Tang. 2021. “Ripley’s K-Function for Network-Constrained Flow Data.” Geographical Analysis 54 (4): 769–88. https://doi.org/10.1111/gean.12300.
  • Kiskowski, Maria A., John F. Hancock, and Anne K. Kenworthy. 2009. “On the Use of Ripley's K-Function and Its Derivatives to Analyze Domain Size.” Biophysical Journal 97 (4): 1095–103. https://doi.org/10.1016/j.bpj.2009.05.039.
  • Lee, Jae-Gil, Jiawei Han, and Kyu-Young Whang. 2007. “Trajectory Clustering: A Partition-and-Group Framework.” Proceedings of the ACM SIGMOD International Conference on Management of Data, https://doi.org/10.1145/1247480.1247546.
  • Liu, Yu, Daoqin Tong, and Xi Liu. 2015. “Measuring Spatial Autocorrelation of Vectors.” Geographical Analysis 47 (3): 300–19. https://doi.org/10.1111/gean.12069.
  • Lu, Yongmei, and Xuwei Chen. 2007. “On the False Alarm of Planar K-Function When Analyzing Urban Crime Distributed Along Streets.” Social Science Research 36 (2): 611–32. https://doi.org/10.1016/j.ssresearch.2006.05.003.
  • Nanni, Mirco, and Dino Pedreschi. 2006. “Time-focused Clustering of Trajectories of Moving Objects.” Journal of Intelligent Information Systems 27 (3): 267–89. https://doi.org/10.1007/s10844-006-9953-7.
  • Okabe, Atsuyuki, and Kokichi Sugihara. 2012. “Modeling Spatial Events on and Alongside Networks.” In Spatial Analysis Along Networks, 25–44. John Wiley & Sons.
  • Okabe, Atsuyuki, and Ikuho Yamada. 2001. “The K-Function Method on a Network and Its Computational Implementation.” Geographical Analysis 33 (3): 271–90. https://doi.org/10.1111/j.1538-4632.2001.tb00448.x.
  • Pei, Tao, Weiyi Wang, Hengcai Zhang, Ting Ma, Yunyan Du, and Chenghu Zhou. 2015. “Density-based Clustering for Data Containing two Types of Points.” International Journal of Geographical Information Science 29 (2): 175–93. https://doi.org/10.1080/13658816.2014.955027.
  • Pei, Tao, A. Xing Zhu, Chenghu Zhou, Baolin Li, and Chengzhi Qin. 2007. “Delineation of Support Domain of Feature in the Presence of Noise.” Computers & Geosciences 33 (7): 952–65. https://doi.org/10.1016/j.cageo.2006.11.010.
  • Ripley, B. D. 1976. “The Second-Order Analysis of Stationary Point Processes.” Journal of Applied Probability 13 (2): 255–66. https://doi.org/10.2307/3212829.
  • Shu, Hua, Tao Pei, Ci Song, Xiao Chen, Sihui Guo, Yaxi Liu, Jie Chen, Xi Wang, and Chenghu Zhou. 2021. “L-Function of Geographical Flows.” International Journal of Geographical Information Science 35 (4): 689–716. https://doi.org/10.1080/13658816.2020.1749277.
  • Song, Ci, Tao Pei, Ting Ma, Yunyan Du, Hua Shu, Sihui Guo, and Zide Fan. 2019. “Detecting Arbitrarily Shaped Clusters in Origin-Destination Flows Using ant Colony Optimization.” International Journal of Geographical Information Science 33 (1): 134–54. https://doi.org/10.1080/13658816.2018.1516287.
  • Sterner, Robert W., Christine A. Ribic, and George E. Schatz. 1986. “Testing for Life Historical Changes in Spatial Patterns of Four Tropical Tree Species.” The Journal of Ecology 74 (3): 621–33. https://doi.org/10.2307/2260386.
  • Tao, Ran, and Jean-Claude Thill. 2016. “Spatial Cluster Detection in Spatial Flow Data.” Geographical Analysis 48 (4): 355–72. https://doi.org/10.1111/gean.12100.
  • Tao, Ran, Jean-Claude Thill, and Ikuho Yamada. 2015. “Detecting Clustering Scales with the Incremental K-Function: Comparison Tests on Actual and Simulated Geospatial Datasets.” In Information Fusion and Geographic Information Systems (IF&GIS’ 2015): Deep Virtualization for Mobile GIS, edited by Vasily Popovich, Christophe Claramunt, Manfred Schrenk, Kyrill Korolenko, and Jérôme Gensel, 93–107. Cham: Springer International Publishing.
  • Wood, Jo, Jason Dykes, and Aidan Slingsby. 2010. “Visualisation of Origins, Destinations and Flows with OD Maps.” The Cartographic Journal 47 (2): 117–29. https://doi.org/10.1179/000870410X12658023467367.
  • Yamada, Ikuho, and Jean-Claude Thill. 2004. “Comparison of Planar and Network K-Functions in Traffic Accident Analysis.” Journal of Transport Geography 12 (2): 149–58. https://doi.org/10.1016/j.jtrangeo.2003.10.006.
  • Yamada, Ikuho, and Jean-Claude Thill. 2007. “Local Indicators of Network-Constrained Clusters in Spatial Point Patterns.” Geographical Analysis 39 (3): 268–92. https://doi.org/10.1111/j.1538-4632.2007.00704.x.
  • Yan, Xiaorui, Tao Pei, Hua Shu, Ci Song, Mingbo Wu, Zidong Fang, and Jie Chen. 2023a. “Spatiotemporal Flow L-Function: A new Method for Identifying Spatiotemporal Clusters in Geographical Flow Data.” International Journal of Geographical Information Science, https://doi.org/10.1080/13658816.2023.2204345. inpress.
  • Yan, Xiaorui, Tao Pei, Ci Song, and Xiaohan Liu. 2023b. “Estimating Spatiotemporal Aggregation Scales by Revisiting the Spatiotemporal L-Function.” Transactions in GIS 27 (2): 592–604. https://doi.org/10.1111/tgis.13034.
  • Zhao, Pengxiang, Xintao Liu, Wenzhong Shi, Tao Jia, Wengen Li, and Min Chen. 2020. “An Empirical Study on the Intra-Urban Goods Movement Patterns Using Logistics big Data.” International Journal of Geographical Information Science 34 (6): 1089–116. https://doi.org/10.1080/13658816.2018.1520236.
  • Zhu, Xi, and Diansheng Guo. 2014. “Mapping Large Spatial Flow Data with Hierarchical Clustering.” Transactions in GIS 3: 421–35. https://doi.org/10.1111/tgis.12100.
  • Zhu, Bing, Qixing Huang, Leonidas Guibas, and Lin Zhang. 2013. “Urban Population Migration Pattern Mining Based on Taxi Trajectories.” Paper Presented at the 3rd International Workshop on Mobile Sensing: The Future, Brought to you by big Sensor Data, Philadelphia, USA.