Keywords

1 Introduction

With the rapid development of internet, cyberspace has become an important field for entertainment, consumption, etc. If we associate the cyberspace with geospatial, i.e. linking the online with offline, a mass of user behaviors could be mined, which would be of great help for recommendation system, friendship prediction, urban planning, etc. [1]. The first step for mapping cyberspace to geospatial is to find the regions where users locate. Therefore, how to divide the world, especially urban areas, becomes an important and necessary intermediate link. Traditionally, three methods are widely used for area partition, i.e. address method, rigid method and cluster method. However, these methods all have limitations, which would be introduced in detail as follows.

Address method refers to partitioning areas with physical address, like Starbucks, KFC, etc. In fact, as addresses often reveal user behaviors, like shopping, having dinner, entertainment, etc., this method is commonly used to study either user behavior prediction or product recommendation [2,3,4]. Although achieving high precision, address method has limitation in the scale of regions. And it’s also too difficult for the method to find interactions between two users. For instance, the sparsity of user interactions in Foursquare and Yelp are greater than 99.99% [5]. As a result, it is inappropriate to do researches related to user interactions using this method.

Grid method refers to partitioning areas with gird. The metric for partitioning areas is usually kilometer or degree of latitude and longitude. For instance, Backstrom et al. divided the US into 0.01 \(\times \) 0.01 degree regions (about 0.4 square miles) [6], while Cho et al. discretized the world into 25 \(\times \) 25 km cells [7]. This partitioning method is simple and practicable, and obtained regions could cover an ideal range to find interactions between users. Therefore, it is often used to study user social relationships [8,9,10]. However, its disadvantages are also obvious. The method could not partition areas based on practical significance or independent meaning, and always divides a function unit, e.g. a shop or a store, into two or more separate girds, missing the interaction information of users. Besides, it is hard to set a proper size for girds, since little size may miss interactions, while big size may cover too many user traces and make the extraction of interactions too hard to fulfill.

Cluster method refers to partitioning areas by clustering locations in the form of GPS, and density-based clustering (DBSCAN) is a representative method. Zhou et al. designed a new approach based on DBSCAN, with which arbitrary shapes could be obtained, and areas could be partitioned according to practical significance or independent meaning as a result [11]. Besides, reasonable parameter value is easy to set for DBSCAN when partitioning areas. By this way, DBSCAN outperforms other cluster methods evidently, and is widely used later when dealing with personal locations [12, 13]. However, as DBSCAN works based on density, a cluster would “spread” infinitely when density meets the requirement. As a result, intensive locations in busy urban areas would be clustered into a same group with DBSCAN. Figure 1 shows the hot map of user check-ins in the busy area of Austin, US with the dataset extracted from Gowalla, a location-based social network. Intensive check-ins and locations as shown in the figure, make it difficult to partition the area with DBSCAN. In fact, this area is taken as a region when clustered with 0.001 degree. Therefore, DBSCAN doesn’t work well when dealing with a mass of locations, especially with those in busy urban areas.

Fig. 1.
figure 1

The hot map of users’ check-ins in the busy urban area of Austin, US, around the Texas government.

In this paper, we propose a method for partitioning busy urban areas with spatial-temporal information in LBSN (Location-Based Social Network). On one hand, arbitrary shapes would be obtained by designing method with density-based clustering, overcoming the stiff of grid method. On the other hand, a boundary is designed ingeniously for limiting the scale of clusters, overcoming the infinite “spread” of DBSCAN. Our approach is validated using real user data and show a good performance.

Our contributions are concluded as follows:

  • We propose a new method, RST-DBSCAN, for partitioning busy urban areas based on spatial-temporal information, with which arbitrary and separated shapes of regions would be obtained reasonably;

  • We propose a new time mapping method to connect temporal information with spatial information. The method makes it possible to partition areas with 3-dimensional spatial-temporal information;

  • We further propose to cluster from the locations which own greater density. In this way, we can get busier regions earlier, making the approach more reasonably;

  • We visually and quantitatively evaluate our approach with a location-based social network, Gowalla. Results show that the performance of RST-DBSCAN outperforms that of competitors.

The rest of this paper is organized as follows. Section 2 introduces the design of RST-DBSCAN. Section 3 describes experiments. Finally, Sect. 4 gives the conclusion.

2 Design of the Model

In this section, we introduce our method in the following steps. Definitions for designing RST-DBSCAN would be introduced at first. Then we introduce the ideas for designing RST-DBSCAN in the view of space and time respectively. Finally, we describe the method with schematic and pseudocode.

2.1 Definitions

DBSCAN is a well-known cluster method and works by clustering points which satisfy the density conditions into a group. Therefore, two important concepts, \(\varepsilon \), the radius of a circle, and MinPts, the minimum number of neighbor points within that circle, are needed [14]. In this section, points are employed to denote locations in 2-dimensional spatial space. Then the circle with radius \(\varepsilon \) (named neighbor region here) means the scale covered by a location, and MinPts means the minimum number of location’s neighbors. Besides, we make definitions for designing RST-DBSCAN as follows.

Definition 1: Candidate core location

If the neighbor region of a location, p, covers at least MinPts locations, p is a candidate core location.

Definition 2: Neighbor location

The neighbors of p, denoted by \(N_\varepsilon (p)\), is defined by

$$\begin{aligned} N_\varepsilon (p)=\{q\in S\mid dist(p,q)\le \varepsilon \} \end{aligned}$$
(1)

Here, S is the set of all locations, and q is any location in the set.

Fig. 2.
figure 2

The schematic of RST-DBSCAN

Definition 3: Core location

The location which starts a clustering process is taken as core location.

Definition 4: Location directly density reachable

For a core location p and a location q, we say that q is directly density reachable from p if q is the neighbor location of p.

Definition 5: Location density reachable

For a location p and a location q, we say q is density reachable from p if there is a chain of locations \(q_1,q_2,...,q_n\), such that \(q_1=q, q_n=p\), and \(q_{i+1}\) is directly density reachable from \(q_i\). Here \(1\le i \le n\).

Definition 6: Core-related location

Assume location n is density reachable from core location p and satisfies the following condition

$$\begin{aligned} \{n\in S\mid dist(p,n)\le \lambda \cdot \varepsilon \} \end{aligned}$$
(2)

Then location n is a core-related location for p. Besides, the circle region with radius \( \lambda \cdot \varepsilon \) is called as core-related region of p.

What needs to note is that once a candidate core location is covered by a core location, it becomes a core-related location.

2.2 Area Partition with Spatial Information

For the sake of understanding, we introduce how RST-DBSCAN works with spatial information at first in this section, and then the application of time information would be introduced in the next section.

The schematic of RST-DBSCAN with \( \lambda =2\) is shown in Fig. 2. Here black points denote locations, circles of dotted lines denote the neighbor regions, and the circles of solid line denote core-related regions. RST-DBSCAN decides the core locations based on the number of neighbor locations. As the density around location o is the greatest, we take it as the first core location. Location s, r and t are all density reachable from o, so these locations belong to the same group when clustering with DBSCAN. However, as location t beyond the range of o’s core-related region, it would not belong to the group if clustered by RST-DBSCAN. With core-related regions as the restricted condition, we would obtain separated regions in busy urban areas.

2.3 Area Partition with Temporal Information

In Sect. 2.2, we partition busy urban area by restricted DBSCAN with spatial information in the form of GPS. With the help of temporal information, we could continuously process urban area further. Although intensive check-in is an important feature in busy urban area, the intensity varies a lot over time. Figure 3 shows the hot map of users’ check-ins around the Texas government at different time periods. As can be seen, most of check-ins appear at commercial districts in the daytime, while most of them appear at resident area at night. In more detail, Fig. 4 shows the situation of check-ins in a few blocks away. As can be seen, check-ins are active in shopping mall in the afternoon and at dusk, while more users appear in bars at early morning. Therefore, conclusions can be drawn that users check-ins also reflect their living habits, and area partitioned with temporal information would contain more information.

Fig. 3.
figure 3

The hot map of users’ check-ins around the Texas government at different time periods

Fig. 4.
figure 4

The hot map of users’ check-ins in a few blocks away at different time periods.

However, how to use temporal information for partitioning area is still a challenging issue. Previous studies often partition time with hour periods, which is not reasonable enough. For instance, a restaurant is active from 5:10 PM to 20:45 PM, and it is hard to partition this time period with traditional method. Therefore, we bring out a new time-mapping method to connect temporal information with spatial information. The formula of time-mapping is

$$\begin{aligned} l_{Time}={\vartheta }\cdot \varepsilon \cdot \left( T_{hour}+\frac{T_{min}}{60}+\frac{T_{min}}{3600}\right) \end{aligned}$$
(3)

Here \(\vartheta \) is time-mapping parameter. When \(\vartheta =1\), 1 hour corresponds to \(\varepsilon \) degree in the spatial scale. Then we could deal with temporal information with RST-DBSCAN, with which arbitrary and reasonable time period would be obtained.

2.4 Description of the Model

In this section, we would introduce the design of our approach. The pseudocode of RST-DBSCAN is shown in Algorithm 1. We take the check-in dataset D, the radius of a circle \(\varepsilon \) and the minimum number of neighbors MinPts as input, and obtain location clusters with the algorithm. Here D includes user coordinate of latitude and longitude and check-in time. Specific implementation details of RST-DBSCAN algorithm are as follows.

figure a
  1. (1)

    Initialization

Set the number of clusters as 0. Compute the mapping result of time and obtain 3-dimensional dataset with unprocessed locations, \(\varGamma \). The dataset of candidate core locations \({\varOmega }'\) and a queue \({\varOmega }\) are initialized as empty, respectively. Here we say a location has been processed if it has been clustered into to a group, unprocessed otherwise.

  1. (2)

    Obtain the set of candidate core locations

Traverse all the locations. Take location x for example. Calculate the number of x’s neighbors, \(\left| N_\varepsilon (x) \right| \), and then confirm whether x belongs to \({\varOmega }'\) or not according to the numerical value of \(\left| N_\varepsilon (x) \right| \) and MinPts.

  1. (3)

    Sort locations with their number of neighbors

Sort locations in \({\varOmega }'\) in descending order, according to their number of neighbors. Then a queue of candidate core locations \({\varOmega }\) is formed. Location with more neighbors in \({\varOmega }\) would be in more forward position.

  1. (4)

    Obtain target clusters

Extract the first candidate core location q in \({\varOmega }\). Drop q if it has been processed, otherwise set it as the core location. Find all locations which are unprocessed and density reachable from q, and remove those that beyond the scope of q’ core-related region. Then we would obtain a cluster with q as the center. Mark these locations as processed. Repeat the process above until \({\varOmega }\) is empty, and all locations will be clustered.

With above steps, a region where locations belong to the same cluster is formed at last. Based on practical significance and independent meaning, RST-DBSCAN could partition the area into arbitrary and separated shapes of regions.

3 Experiments

In this section, we carry on two experiments to verify the effectiveness of our approach. One is to partition a busy urban area with RST-DBSCAN. The other is to mine social links with user interactions in the regions obtained by RST-DBSCAN, as area partition plays an important role on social link mining.

3.1 Evaluation Index

We adopt three performance metrices, precision (P), recall (R), F1-measure (\(F_1\)) to estimate the model, which can be calculated as follows. Let TP, TN, FP and FN denote the numbers of true positives, true negatives, false positives and false negatives respectively.

$$\begin{aligned} P= & {} \frac{TP}{TP+FP} \end{aligned}$$
(4)
$$\begin{aligned} R= & {} \frac{TP}{TP+FN} \end{aligned}$$
(5)
$$\begin{aligned} F_1= & {} 2\frac{P*R}{P+R} \end{aligned}$$
(6)

As can be seen, F1-measure is the comprehensive value of precision and recall. Besides, links in social network are typically sparse, so F1-measure plays a more important role than accuracy when evaluating model. As a result, F1-measure is taken as a main performance index here.

Table 1. Comparison of DBSCAN and RST-DBSCAN when the value of \(\varepsilon \) varies.

3.2 Experiment 1

Dataset

Experiments are performed on a publicly available dataset, which is extracted from a location-based social network namely Gowalla. It collects user check-ins and social links from 2009.2 to 2010.10 [7]. As most check-ins appear in Austin, US, we choose the urban area of Austin, around the government of Texas, as the target. The detailed scope is \(30.262^{\circ }\) N -\(30.270 ^{\circ }\) N, \(97.730 ^{\circ }\) W-\(97.747^{\circ }\) W, and 5,173 users check in 72,131 times at 1,450 different locations.

Comparative Approach

DBSCAN is taken as the comparative approach in this part.

Experimental Setup

We set \(MinPts=1\) for both DBSCAN and RST-DBSCAN. \(\lambda \) is set to 3 for RST-DBSCAN. \(\varepsilon \) is set in the range of 0.005, 0.001, 0.0005.

Evaluation Results

When the value of \(\varepsilon \) varies, the results are shown in Table 1. As can be seen, for \(\varepsilon =0.005\) and 0.001, DBSCAN couldn’t partition this area, while RST-DBSCAN divides the area into 4 and 14 regions respectively. For \(\varepsilon =0.005\), the area could be partitioned by DBSCAN, while the number is still far less than that by RST-DBSCAN. In fact, as 0.001 degree is roughly equal to 0.1 km, smaller size even couldn’t cover a unit like a store or a restaurant. Therefore, 0.001 is nearly the minimum size for rationality, and our approach outperforms DBSCAN substantially.

More visually, Fig. 5 shows 1,450 different locations in this area, represented by red points. When \(\varepsilon =0.001\), these locations belong to a same group clustered by DBSCAN. When clustered with RST-DBSCAN, the area is divided into 14 regions and results are shown in Fig. 6. Here we mark the locations with 14 different colors, and locations with the same color belong to the same group. We draw borders of these groups manually for viewing convenience.

3.3 Experiment 2

Dataset

The dataset extracted from Gowalla is also employed in this experiment. As inadequate information would make adverse effects on the experiment, we select users whose check-ins and friends are all at the top 2% ( i.e. the time of check-ins is above 186, and the number of friends is above 52). There are 761 users which are matched with the conditions and there exists 8,828 links among them.

Comparative Approaches

Fig. 5.
figure 5

Locations in the urban area of Austin.

Fig. 6.
figure 6

The visual result of area partition with RST-DBSCAN.

Address method, Grid method and DBSCAN are taken as comparative approaches in this part.

We obtain regions with these methods at the first step, and compute the interactions of users in these regions to mine unknown social links. Scellatopropose et al. have proved that social links could be mined by user interactions in spatial space and designed several features for mapping relations [15]. We choose CR as the feature in this experiment, which is represented as the number of common regions two persons both check in. We calculate CR and assume there exists social links among users whose CR are at the k top. Recall, precision and F1-measure are taken as the metrics to evaluate our approach.

Fig. 7.
figure 7

The recall of social link mining by top-k.

Fig. 8.
figure 8

The precision of social link mining by top-k.

Experimental Setup

We set \(MinPts=1\), \(\varepsilon =0.001\) and \(\lambda =3\) for RST-DBSCAN. The MinPts and \(\varepsilon \) of DBSCAN is the same for a fair comparison. We set gird method with 0.01\(\times \)0.01 degree, as this is the most effective value and is often used when studying relationships between user interactions in spatial and social link mining.

Evaluation Results

Results are shown in Figs. 7 and 8. As can be seen, recall@k of RST-DBSCAN outperforms the other methods, and the advantage of RST-DBSCAN is more obvious when k is smaller. At the same time, precision@k of RST-DBSCAN also outperform that of Grid and DBSCAN. An interesting phenomenon is that precision@k of Address is the greatest when k is small enough. In fact, it is normal as address is much more precise than other methods, and two persons meet at smaller regions frequently means it is more possible for them to be friends. However, as Address method commonly covers small regions, a mass of interactions between two persons are easily missed, leading a rapid drop of Precision@k when k increases. Therefore, compared to others, Address is not a stable method when mining social links.

Specially, we compute the F1-measure when k = 1, and the values of Address, Grid, DBSCAN, RST-DBSCAN are 40.14%, 44.72%, 42.71%, 46.95% respectively, and the F1-measure of RST-DBSCAN outperforms others’ by 6.81%, 2.24%, 4.24%. As RST-DBSCAN could partition busy urban areas more reasonably, especially for commercial areas, better results would be obtained when mining social links.

4 Conclusion

In this paper, we present a new area partition method, called RST-DBSCAN, for partitioning busy urban areas with spatial-temporal information in LBSN. Our approach is able to divide the areas into arbitrary and separated regions based on practical significance or independent meaning reasonably. Comprehensive experiments are conducted on a real-world dataset, and results show that our approach performs better than competitors.