An Experimental Study of the k-MXT Algorithm with Applications to Clustering Geo-Tagged Data

Cooper, Colin; Vu, Ngoc

doi:10.1007/978-3-319-92871-5_10

Colin Cooper¹⁶ &
Ngoc Vu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10836))

Included in the following conference series:

International Workshop on Algorithms and Models for the Web-Graph

Abstract

We consider a graph fragmentation process which can be described as follows. Each vertex v selects the k adjacent vertices which have the largest number of common of neighbours. For each selected neighbour u, we retain the edge (v, u) to form a the subgraph graph S of the input graph. The object of interest are the components of S, the k-Max-Triangle-Neighbour (k-MXT) subgraph, and the vertex clusters they produce in the original graph.

We study the application of this process to clustering in the planted partition model, and on the geometric disk graph formed from geo-tagged photographic data downloaded from Flickr.

In the planted partition model, there are $\ell $ numbers of partitions, or subgraphs, which are connected densely within each partition but sparser between partitions. The objective is to recover these hidden partitions. We study the case of the planted partition model based on the random graph $G_{n,p}$ with additional edge probability q within the partitions. Theoretical and experimental results show that the 2-MXT algorithm can recover the partitions for any $q/p>0$ constant provided the density of triangles is high enough.

We apply the k-MXT algorithm experimentally to the problem of clustering geographical data, using London as an example. Given a dataset consisting of geographical coordinates extracted from photographs, we construct a disk graph by connecting every point to other points if and only if theirs distance is at most d. Our experimental results show that the k-MXT algorithm is able to produce clusters which are of comparable to popular clustering algorithms such as DBSCAN (see e.g. Fig. 5).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bollobas, B.: Random Graphs. Cambridge Studies in Advanced Mathematics, 2nd edn. Cambridge University Press, Cambridge (2001)
Book Google Scholar
Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 17(8), 790–799 (1995)
Article Google Scholar
Condon, A., Karp, R.M.: Algorithms for graph partitioning on the planted partition model. Random Struct. Algorithms 18(2), 116–140 (2001)
Article MathSciNet Google Scholar
Crandall, D.J., Backstrom, L., Huttenlocher, D., Kleinberg, J.: Mapping the world’s photos. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 761–770. ACM, New York (2009)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise, pp. 226–231. AAAI Press (1996)
Google Scholar
Frieze, A., Michal, K.: Introduction to Random Graphs. Cambridge University Press, Cambridge (2015)
MATH Google Scholar
Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 7821–7826 (2002). 2001
Article MathSciNet Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article Google Scholar
Ciollaro, M., Wang, D.: Package: MeanShift. https://cran.r-project.org/web/packages/MeanShift/MeanShift.pdf. Accessed 2017
Hahsler, M., et al.: Package: dbscan. https://cran.r-project.org/web/packages/dbscan/dbscan.pdf. Accessed 2017
Various. Boost.Geometry. http://www.boost.org/doc/libs/1.61.0/libs/geometry/doc/html/index.html. Accessed 2017

Download references

Author information

Authors and Affiliations

Department of Informatics, King’s College London, London, UK
Colin Cooper & Ngoc Vu

Authors

Colin Cooper
View author publications
You can also search for this author in PubMed Google Scholar
Ngoc Vu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Colin Cooper or Ngoc Vu .

Editor information

Editors and Affiliations

Department of Mathematics, Ryerson University, Toronto, Ontario, Canada
Anthony Bonato
Department of Mathematics, Ryerson University, Toronto, Ontario, Canada
Paweł Prałat
Department of Discrete Mathematics, Moscow Institute of Physics and Technology, Dolgoprudny, Russia
Andrei Raigorodskii

A Appendix

1.1 A.1 Complexity Analysis

The k-MXT procedure can be broken down into two main tasks: constructing the disk graph and selecting the k-Max Triangles neighbours for each vertex.

Graph Construction. To construct the proximity graph, we need to find the points inside each vertex’s disc. The naive approach takes $O(n^2)$ operations. A simple improvement is to separate the set of latitudes and longitudes and sort them. For each vertex $v = (x_v, y_v)$ and each coordinate $x_v$ and $y_v$, we locate the values within a fixed distance from it using range binary search. The result, for each coordinate, is a set of points which are located within a fixed distance d from the queried point i.e. $ S(x_v) = \{u: | x_u - x_v | \le d\} \text { and } S(y_v) = \{w: |y_w - y_v | \le d\}. $ The intersection of these two sets $S(x_v) \cap S(y_v)$ can be done using the smaller set, and yields the set of points bounded by a square of width 2d centered at v. To transform the bounding square to a bounding circle then requires an additional computation step. Overall, the complexity of the improved naive method takes

$$ \underbrace{O(n \log n)}_{\text {sort}} + \underbrace{O(n \log n)}_{\text {range search}} + \underbrace{O( n \times \min _{v \in V} \{ |S(x_v)|, |S(y_v)| \}).}_{\text {connect edges}} $$

Further improvement requires using spatial data structures such as R-trees or kd-trees. The construction of such trees take, on average, $O(n \log n)$. A search query takes $O(\log n)$ on average and O(n) worst. Also, note that both structures search operation only support query by rectangle. Thus, an additional step is required to locate points within a vertex’s circle of radius d.

We give an experimental running time of the graph construction methods in Table 3. We used the C++ Boost Geometry library [11] for an implementation of R-tree and the Approximate Nearest Neighbour (ANN) for kd-tree.

Additionally, there is further consideration of calculating distances using great-circle distance when carrying out geo-tagged clustering at country or world scale. Note that only the R-tree implementation supports this operation. Our experiments show that the improved naive method out-performs R-trees (Table 3).

Selecting Neighbours. Given the constructed graph, to select the k-Max Triangle edges for each vertex requires,

1.
For each edge: calculate the number of common neighbours;
2.
For each vertex: select the k highest scores;
3.
Find the connected components of the resulting graph fragments.

Table 3. Large dataset: $n = 45,000$. Graph construction time in seconds and averaged over 10 executions. The best running time are highlighted. Interestingly, the improved naive method performs better than the R-tree when computing the spherical coordinates.

Full size table

Table 4. Table presents the average density and the density of the top polygons for the large dataset with $d,\epsilon =25$ m. Results for the 2-MXT algorithm ($w=40, 80$) and DBSCAN ($minPts = 40, 80$) are also included.

Full size table

If the adjacency lists are sorted (implying $O(n \times d_{max} \log d_{max})$ pre-processing), the first task is equivalent to finding the set intersection. Thus for each edge it takes at most $d(u) + d(v)$. Hence, the overall computation cost is

$$\begin{aligned} \sum _{v \in V} \sum _{u \in N(v)} \big ( d(v) + d(u) \big ) = \sum _{v \in V}d^2(v) + \sum _{u \in N(v)} \sum _{v \in V} d(u) \le 2 d_{max} \sum _{v \in V}d(v) = 4 |E| d_{max}. \end{aligned}$$

Thus the first task takes $O(|E| \times d_{max})$.

The second task is done using a priority queue i.e. min-heap which takes at most $\sum _{v \in V} d(v) \log k = \log {k}\times 2|E| = O(|E|)$, for fixed k.

The final task is to compute the connected components of the k-MXT subgraph. This can be done using any classical algorithm in linear time in the number of edges in the component, O(kn) overall. For small values of k this is O(n).

Overall, the fragmenting process has a running time of

$$ \underbrace{O(n \times d_{\max } \log d_{\max })}_{\text {pre-processing}} + \underbrace{O(|E| d_{\max })}_{\text {set intersect}} + \underbrace{O(|E|)}_{k{\text {-max scores}}} + \underbrace{O(n)}_{\text {components}} = O( |E| d_{\max }). $$

In comparison, DBSCAN implemented with R-trees or kd-trees has an average complexity of $O(n \log n)$ [5]. For mean shift, a loose theoretical running time is $O(n \times T_{\max })$ where $T_{\max }$ is the maximum number of iterations allowed for each query. In practice, we usually set the distance to determines convergence to $\lambda = 0.5$ m, and noticed that the mode converged in relatively fewer iterations than $T_{max}$.

Experimental Running Time. DBSCAN is executed using the R package dbscan [10], a fast re-implementation of the original algorithm in C++. Mean-shift is executed using the R package MeanShift [9].

Table 2 presents the experimental running time of the algorithms. The results of the small dataset experiments show the mean shift has the slowest running time; hence it was excluded in the large experiment. Furthermore, in both experiments, optimized DBSCAN outperformed the current implementation of k-MXT, which is hardly surprising. For k-MXT, it is seen that the running time for the clustering procedure seems to scale quadratically with the distance, which determines the number of edges in the graph hence $d_{max}$. This is probably due to the $O(|E| d_{max}) \approx O(n (d_{max})^2)$ set intersection.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cooper, C., Vu, N. (2018). An Experimental Study of the k-MXT Algorithm with Applications to Clustering Geo-Tagged Data. In: Bonato, A., Prałat, P., Raigorodskii, A. (eds) Algorithms and Models for the Web Graph. WAW 2018. Lecture Notes in Computer Science(), vol 10836. Springer, Cham. https://doi.org/10.1007/978-3-319-92871-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-92871-5_10
Published: 30 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92870-8
Online ISBN: 978-3-319-92871-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Experimental Study of the k-MXT Algorithm with Applications to Clustering Geo-Tagged Data

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Complexity Analysis

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation