Skip to main content

An Experimental Study of the k-MXT Algorithm with Applications to Clustering Geo-Tagged Data

  • Conference paper
  • First Online:
Algorithms and Models for the Web Graph (WAW 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10836))

Included in the following conference series:

Abstract

We consider a graph fragmentation process which can be described as follows. Each vertex v selects the k adjacent vertices which have the largest number of common of neighbours. For each selected neighbour u, we retain the edge (vu) to form a the subgraph graph S of the input graph. The object of interest are the components of S, the k-Max-Triangle-Neighbour (k-MXT) subgraph, and the vertex clusters they produce in the original graph.

We study the application of this process to clustering in the planted partition model, and on the geometric disk graph formed from geo-tagged photographic data downloaded from Flickr.

In the planted partition model, there are \(\ell \) numbers of partitions, or subgraphs, which are connected densely within each partition but sparser between partitions. The objective is to recover these hidden partitions. We study the case of the planted partition model based on the random graph \(G_{n,p}\) with additional edge probability q within the partitions. Theoretical and experimental results show that the 2-MXT algorithm can recover the partitions for any \(q/p>0\) constant provided the density of triangles is high enough.

We apply the k-MXT algorithm experimentally to the problem of clustering geographical data, using London as an example. Given a dataset consisting of geographical coordinates extracted from photographs, we construct a disk graph by connecting every point to other points if and only if theirs distance is at most d. Our experimental results show that the k-MXT algorithm is able to produce clusters which are of comparable to popular clustering algorithms such as DBSCAN (see e.g. Fig. 5).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bollobas, B.: Random Graphs. Cambridge Studies in Advanced Mathematics, 2nd edn. Cambridge University Press, Cambridge (2001)

    Book  Google Scholar 

  2. Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 17(8), 790–799 (1995)

    Article  Google Scholar 

  3. Condon, A., Karp, R.M.: Algorithms for graph partitioning on the planted partition model. Random Struct. Algorithms 18(2), 116–140 (2001)

    Article  MathSciNet  Google Scholar 

  4. Crandall, D.J., Backstrom, L., Huttenlocher, D., Kleinberg, J.: Mapping the world’s photos. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 761–770. ACM, New York (2009)

    Google Scholar 

  5. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise, pp. 226–231. AAAI Press (1996)

    Google Scholar 

  6. Frieze, A., Michal, K.: Introduction to Random Graphs. Cambridge University Press, Cambridge (2015)

    MATH  Google Scholar 

  7. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 7821–7826 (2002). 2001

    Article  MathSciNet  Google Scholar 

  8. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Article  Google Scholar 

  9. Ciollaro, M., Wang, D.: Package: MeanShift. https://cran.r-project.org/web/packages/MeanShift/MeanShift.pdf. Accessed 2017

  10. Hahsler, M., et al.: Package: dbscan. https://cran.r-project.org/web/packages/dbscan/dbscan.pdf. Accessed 2017

  11. Various. Boost.Geometry. http://www.boost.org/doc/libs/1.61.0/libs/geometry/doc/html/index.html. Accessed 2017

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Colin Cooper or Ngoc Vu .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Complexity Analysis

The k-MXT procedure can be broken down into two main tasks: constructing the disk graph and selecting the k-Max Triangles neighbours for each vertex.

Graph Construction. To construct the proximity graph, we need to find the points inside each vertex’s disc. The naive approach takes \(O(n^2)\) operations. A simple improvement is to separate the set of latitudes and longitudes and sort them. For each vertex \(v = (x_v, y_v)\) and each coordinate \(x_v\) and \(y_v\), we locate the values within a fixed distance from it using range binary search. The result, for each coordinate, is a set of points which are located within a fixed distance d from the queried point i.e. \( S(x_v) = \{u: | x_u - x_v | \le d\} \text { and } S(y_v) = \{w: |y_w - y_v | \le d\}. \) The intersection of these two sets \(S(x_v) \cap S(y_v)\) can be done using the smaller set, and yields the set of points bounded by a square of width 2d centered at v. To transform the bounding square to a bounding circle then requires an additional computation step. Overall, the complexity of the improved naive method takes

$$ \underbrace{O(n \log n)}_{\text {sort}} + \underbrace{O(n \log n)}_{\text {range search}} + \underbrace{O( n \times \min _{v \in V} \{ |S(x_v)|, |S(y_v)| \}).}_{\text {connect edges}} $$

Further improvement requires using spatial data structures such as R-trees or kd-trees. The construction of such trees take, on average, \(O(n \log n)\). A search query takes \(O(\log n)\) on average and O(n) worst. Also, note that both structures search operation only support query by rectangle. Thus, an additional step is required to locate points within a vertex’s circle of radius d.

We give an experimental running time of the graph construction methods in Table 3. We used the C++ Boost Geometry library [11] for an implementation of R-tree and the Approximate Nearest Neighbour (ANN) for kd-tree.

Additionally, there is further consideration of calculating distances using great-circle distance when carrying out geo-tagged clustering at country or world scale. Note that only the R-tree implementation supports this operation. Our experiments show that the improved naive method out-performs R-trees (Table 3).

Selecting Neighbours. Given the constructed graph, to select the k-Max Triangle edges for each vertex requires,

  1. 1.

    For each edge: calculate the number of common neighbours;

  2. 2.

    For each vertex: select the k highest scores;

  3. 3.

    Find the connected components of the resulting graph fragments.

Table 3. Large dataset: \(n = 45,000\). Graph construction time in seconds and averaged over 10 executions. The best running time are highlighted. Interestingly, the improved naive method performs better than the R-tree when computing the spherical coordinates.
Table 4. Table presents the average density and the density of the top polygons for the large dataset with \(d,\epsilon =25\) m. Results for the 2-MXT algorithm (\(w=40, 80\)) and DBSCAN (\(minPts = 40, 80\)) are also included.

If the adjacency lists are sorted (implying \(O(n \times d_{max} \log d_{max})\) pre-processing), the first task is equivalent to finding the set intersection. Thus for each edge it takes at most \(d(u) + d(v)\). Hence, the overall computation cost is

$$\begin{aligned} \sum _{v \in V} \sum _{u \in N(v)} \big ( d(v) + d(u) \big ) = \sum _{v \in V}d^2(v) + \sum _{u \in N(v)} \sum _{v \in V} d(u) \le 2 d_{max} \sum _{v \in V}d(v) = 4 |E| d_{max}. \end{aligned}$$

Thus the first task takes \(O(|E| \times d_{max})\).

The second task is done using a priority queue i.e. min-heap which takes at most \(\sum _{v \in V} d(v) \log k = \log {k}\times 2|E| = O(|E|)\), for fixed k.

The final task is to compute the connected components of the k-MXT subgraph. This can be done using any classical algorithm in linear time in the number of edges in the component, O(kn) overall. For small values of k this is O(n).

Overall, the fragmenting process has a running time of

$$ \underbrace{O(n \times d_{\max } \log d_{\max })}_{\text {pre-processing}} + \underbrace{O(|E| d_{\max })}_{\text {set intersect}} + \underbrace{O(|E|)}_{k{\text {-max scores}}} + \underbrace{O(n)}_{\text {components}} = O( |E| d_{\max }). $$

In comparison, DBSCAN implemented with R-trees or kd-trees has an average complexity of \(O(n \log n)\) [5]. For mean shift, a loose theoretical running time is \(O(n \times T_{\max })\) where \(T_{\max }\) is the maximum number of iterations allowed for each query. In practice, we usually set the distance to determines convergence to \(\lambda = 0.5\) m, and noticed that the mode converged in relatively fewer iterations than \(T_{max}\).

Experimental Running Time. DBSCAN is executed using the R package dbscan [10], a fast re-implementation of the original algorithm in C++. Mean-shift is executed using the R package MeanShift [9].

Table 2 presents the experimental running time of the algorithms. The results of the small dataset experiments show the mean shift has the slowest running time; hence it was excluded in the large experiment. Furthermore, in both experiments, optimized DBSCAN outperformed the current implementation of k-MXT, which is hardly surprising. For k-MXT, it is seen that the running time for the clustering procedure seems to scale quadratically with the distance, which determines the number of edges in the graph hence \(d_{max}\). This is probably due to the \(O(|E| d_{max}) \approx O(n (d_{max})^2)\) set intersection.

Fig. 8.
figure 8

k-MXT algorithms on the small dataset with \(d=10\) m and \(k = 1,2,3\) and 4. For \(k \ge 2\), some regions are clearly defined. One can roughly identify some straight lines which are bridges, roads or footpaths. Notice that the polygons expand with k.

Fig. 9.
figure 9

k-MXT\((d=10)\)m. The results for each k have been overlaid on the same layer, demonstrating the expanding regions as k is increased. The region in the top right hand corner is the small dataset, shown in previous Figs. 8 and 4.

Fig. 10.
figure 10

DBSCAN\((\epsilon =10\) m). As minPts is increased, the number of clusters is decreased as the area of the polygons also decrease. There is no visible green (correspond to \(minPts = 80\)) polygon, few of the blue polygons (\(minPts = 40\)), while the area of the red polygon (\(minPts = 20)\) are relatively small. (Color figure online)

Fig. 11.
figure 11

2-MXT(\(d=10\) m), layered on London’s map. Some popular tourist attractions are identified and labelled. It can be seen that there are clusters/polygons corresponding to these places. The two separated pictures provide a closer look at the two regions, specified by its border’s colour. (Color figure online)

Fig. 12.
figure 12

k-MXT(\(d=25\) m). The green (smallest, \(k=1\)) polygons are very small and its orientation seems quite arbitrary. The blue (\(k=2\)) polygons have expanded considerably but still retain some interests. While the red (\(k=3\)) and black (\(k=4\)) regions cover very large areas of central London. (Color figure online)

Fig. 13.
figure 13

DBSCAN\((\epsilon =25\) m). Overall, the picture shows a decent level of clarity. Compare to k-MXT in Fig. 12, the areas of regions produced are much smaller. Notably, the green (small, \(minPts = 80\)) and red (medium, \(minPts=40\)) polygons cover some interesting regions. (Color figure online)

Fig. 14.
figure 14

Parameterized k-MXT and DBSCAN comparison. Truncating low weight edges in k-MXT improves performance

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cooper, C., Vu, N. (2018). An Experimental Study of the k-MXT Algorithm with Applications to Clustering Geo-Tagged Data. In: Bonato, A., Prałat, P., Raigorodskii, A. (eds) Algorithms and Models for the Web Graph. WAW 2018. Lecture Notes in Computer Science(), vol 10836. Springer, Cham. https://doi.org/10.1007/978-3-319-92871-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-92871-5_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92870-8

  • Online ISBN: 978-3-319-92871-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics