Skip to main content
Log in

Mining latent patterns in geoMobile data via EPIC

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

We coin the term geoMobile data to emphasize datasets that exhibit geo-spatial features reflective of human behaviors. We propose and develop an EPIC framework to mine latent patterns from geoMobile data and provide meaningful interpretations: we first ‘E’xtract latent features from high dimensional geoMobile datasets via Laplacian Eigenmaps and perform clustering in this latent feature space; we then use a state-of-the-art visualization technique to ‘P’roject these latent features into 2D space; and finally we obtain meaningful ‘I’nterpretations by ‘C’ulling cluster-specific significant feature-set. We illustrate that the local space contraction property of our approach is most superior than other major dimension reduction techniques. Using diverse real-world geoMobile datasets, we show the efficacy of our framework via three case studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17

Similar content being viewed by others

Notes

  1. Although calls involving two neighboring towers semantically qualify to be local, our reference of a call being local is solely from the tower’s perspective where both the caller and callee of a call are associated with the same tower.

  2. https://archive.ics.uci.edu/ml/datasets/Wine

  3. http://yann.lecun.com/exdb/mnist/

References

  1. Alsheikh, M.A., Niyato, D., Lin, S., p Tan, H., Han, Z.: Mobile big data analytics using deep learning and apache spark. IEEE Netw. 30(3), 22–29 (2016). https://doi.org/10.1109/MNET.2016.7474340

    Article  Google Scholar 

  2. Baratchi, M., Meratnia, N., Havinga, P.J.M., et al.: A hierarchical hidden semi-markov model for modeling mobility data. In: ACM Ubicomp (2014)

  3. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. NIPS (2003)

  4. Bengio, Y., Paiement, J.F., Vincent, P., et al.: Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. NIPS (2004)

  5. Fan, Z., Song, X., Shibasaki, R.: Cityspectrum: A non-negative tensor factorization approach. In: ACM Ubicomp (2014)

  6. Hristova, D., Williams, M.J., Musolesi, M., et al.: Measuring urban social diversity using interconnected geo-social networks. In: ACM WWW (2016)

  7. Ihler, A.T., Smyth, P.: Learning time-intensity profiles of human activity using non-parametric bayesian models. In: NIPS (2006)

  8. Kling, F., Pozdnoukhov, A.: When a city tells a story: urban topic analysis. In: ACM SIGSPATIAL (2012)

  9. Krishnamurthy, A.: High-dimensional clustering with sparse gaussian mixture models. Unpublished paper, pp. 191–192 (2011)

  10. Lakhina, A., Crovella, M., Diot, C.: Diagnosing network-wide traffic anomalies. In: ACM SIGCOMM Computer Communication Review (2004)

    Article  Google Scholar 

  11. Lv, Y., Duan, Y., Kang, W., Li, Z., Wang, F.Y.: Traffic flow prediction with big data: a deep learning approach. IEEE Trans. Intell. Transp. Syst. 16(2), 865–873 (2015)

    Google Scholar 

  12. Manor, L., Perona, P.: Self-tuning spectral clustering. In: NIPS (2005)

  13. Metro, S.: Subway construction plan. http://www.szpl.gov.cn/main/zsgg/200707090211041.shtml (2015)

  14. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: NIPS (2002)

  15. Ozakin, A.N.V. II., Gray, A.: Manifold learning theory and applications. CRC Press, Boca Raton (2011)

    Google Scholar 

  16. Pressley, A.: Elementary Differential Geometry. Springer, Berlin (2010)

    Book  Google Scholar 

  17. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: Explicit invariance during feature extraction. ICML (2011)

  18. Robusto, C.: The cosine-haversine formula. Am. Math. Mon. 64(1), 38–40 (1957)

    Article  MathSciNet  Google Scholar 

  19. Schiebinger, G., Wainwright, M.J., Yu, B., et al.: The geometry of kernelized spectral clustering. Ann. Stat. 43(2), 819–846 (2015)

    Article  MathSciNet  Google Scholar 

  20. Städler, N., Mukherjee, S.: Penalized estimation in high-dimensional hidden markov models with state-specific graphical models. Ann. Appl. Stat. (2013)

  21. van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research (JMLR) (2008)

  22. Wallach, H.M., Mimno, D.M., McCallum, A.: Rethinking lda: Why priors matter. In: NIPS (2009)

  23. Wang, Z., Hu, K., Xu, K., et al.: Structural analysis of network traffic matrix via relaxed principal component pursuit. Comput. Netw. (2012)

  24. Witayangkurn, A., Horanont, T., Sekimoto, Y., et al.: Anomalous event detection on large-scale GPS data from mobile phones using hidden markov model and cloud platform. In: ACM Ubicomp (2013)

  25. Yuan, J., Zheng, Y., Xie, X.: Discovering regions of different functions in a city using human mobility and pois. In: ACM SIGKDD (2012)

  26. Zhang, Y., Ge, Z., Greenberg, A., Roughan, M.: Network anomography. In: ACM SIGCOMM IMC (2005)

  27. Zhang, D., Huang, J., Li, Y., et al.: Exploring human mobility with multi-source data at extremely large metropolitan scales. In: ACM MobiCom (2014)

  28. Zhang, F., Wilkie, D., Zheng, Y., Xie, X.: Sensing the pulse of urban refueling behavior. In: ACM Ubicomp (2013)

Download references

Acknowledgments

This research was supported in part by DoD ARO MURI Award W911NF-12-1-0385, DTRA grant HDTRA1- 14-1-0040, NSF grant CNS-1411636, CNS-1618339 and CNS-1617729.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arvind Narayanan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Arvind Narayanan and Saurabh Verma contributed equally to this work.

This article belongs to the Topical Collection: Special Issue on Social Computing and Big Data Applications

Guest Editors: Xiaoming Fu, Hong Huang, Gareth Tyson, Lu Zheng, and Gang Wang

Appendix

Appendix

1.1 Proof of Proposition 1

Proof

KDE is a non-parametric way to estimate probability density function; it leverages the chosen kernel in the input space for smooth estimation. Given sub-manifold density estimates p(yi) for data points \(\mathbf {y}_{i}\in \mathbb {R}^{d} \)i, we want to find a representation \(\mathbf {z}_{i}\in \mathbb {R}^{p}\)i, p < d, such that the new density estimates q(zi) agrees with the original density estimates. Here KH, KL denote the kernel in higher, lower dimensions, h is the kernel bandwidth and N is the number of data points. KDE’s in higher and lower dimensions (assuming bandwidth h remains the same) are given by:

$$ {p}(\mathbf{y})=\frac{1}{N}\sum\limits_{j=1}^{N} \frac{1}{h^{d}} K_{H} \left( \frac{|| (\mathbf{y}-\mathbf{y}_{j} ||_{d}}{h} \right), {q}(\mathbf{z})=\frac{1}{N}\sum\limits_{j=1}^{N} \frac{1}{h^{p}} K_{L} \left( \frac{|| \mathbf{z}-\mathbf{z}_{j} ||_{p}}{h} \right) $$

\(\text {such that} \int K(u)du =1\). The objective function of KL divergence loss for KDE can be computed as follows:

$$ \begin{array}{@{}rcl@{}} \mathcal{L} &=& \underset{\mathbf{z}}{min}\hspace{0.2em} KL(p||q) = \underset{\mathbf{z}}{min}\hspace{0.2em} \sum\limits_{i=1}^{N}p(\mathbf{y}_{i})\log \frac{p(\mathbf{y}_{i})}{q(\mathbf{z}_{i})} \\ &=& \underset{\mathbf{z}}{min}\hspace{0.2em} \frac{1}{Nh^{d}}\sum\limits_{i =1}^{N}{\Sigma}_{j} K_{H}(\mathbf{y}_{i},\mathbf{y}_{j}) \log \frac{{\sum}_{j} K_{H}(\mathbf{y}_{i},\mathbf{y}_{j})}{{\sum}_{j} K_{L}(\mathbf{z}_{i},\mathbf{z}_{j})} +c_{1} \end{array} $$
$$ \begin{array}{@{}rcl@{}} && \text{Using log-sum inequality, we can show that,} \\ &&\leq \frac{1}{Nh^{d}}\hspace{0.2em} \underbrace{ \underset{\mathbf{z}}{min}\hspace{0.2em} \sum\limits_{i=1}^{N} \sum\limits_{j =1}^{N} K_{H}(\mathbf{y}_{i},\mathbf{y}_{j}) \log \frac{ K_{H}(\mathbf{y}_{i},\mathbf{y}_{j})}{ K_{L}(\mathbf{z}_{i},\mathbf{z}_{j})}}_{\mathcal{J}} +c_{1} \\ && \leq c_{2} \times \mathcal{J} +c_{1} \end{array} $$

\(\mathcal {J}\) is the objective function of t-SNE (with specific kernels) which upper bounds (with a multiplicative scale and an additive constant) the estimated kernel density estimation loss function. □

1.2 Proof of Proposition 2

Schiebinger et al. [19] studied normalized Laplacian embedding for i.i.d. samples generated from a finite mixture of nonparametric distribution. When the distribution overlap is small and samples are large, then with high probability they showed that the embedded samples forms a orthogonal cone data structure (OCS). Figure 18 shows that (1 − α) fraction of two clusters are accumulated in a cone form of 𝜃 angle around e1 and e2 orthogonal axis.

Figure 18
figure 18

Visualizing (α,𝜃)-OCS [19]

Theorem 1 (Finite-sample angular structure)

There are numbersb,b0,b1,b2,δ,tsatisfying certain conditions such that the embedded dataset\(\{\phi (X_{i}),Z_{i}\}^{n}_{i=1}\)has(α,𝜃) − OCSwith

$$ |cos\theta|\leq \frac{b_{0} \sqrt{\varphi_{n}(\delta)}}{w_{min}^{3} t - b_{0} \sqrt{\varphi_{n}(\delta)}}, \alpha \leq \frac{b_{1}}{w_{m}in^{1.5}} \varphi_{n}(\delta)+ \psi({2t}) $$
(11)

and holds with probability at least \(1-8K^{2}\exp (\frac {b_{2} n \delta ^{4}}{\delta ^{2}+S_{max}(\overline {\mathbb {P}} )+ B(\overline {\mathbb {P}} )} )\).

Proof of Proposition 2:

Our strategy is to exploit the OCS structure of the input data. Let XRN×D be the normalized data with unit norm, corresponding to pij,qij as higher, lower dimensional kernel densities and Z1,Z2 as normalization constant respectively. Let XRN×d,d < D,be the normalized data obtained after LE dimension reduction and have similar corresponding variables \(p_{ij}^{\prime },q_{ij}^{\prime },Z_{1}^{\prime },Z_{2}^{\prime }\). Let \(\beta \in (0,\frac {\pi }{2})\) and β are angles between input feature vectors < xi,xj > and \(<\mathbf {x}^{\prime }_{i},\mathbf {x}^{\prime }_{j}>\) respectively. Constant are denoted by \(c_{1}, c_{1}^{\prime }, c_{2}, c_{3}, a_{1}, a_{2} \geq 0\). Also σ and σ are the kernel bandwidth of the estimated kernel densities in the X and X input data respectively. Let ith cluster has Ni samples out of K clusters. For our analysis, we will focus on this ith cluster.

Since t-SNE preserves kernel density in lower dimensions, we will have pij = qij and \(p_{ij}^{\prime }=q_{ij}^{\prime }\). Some t-SNE related expressions that we will use for the proof are as follows,

$$ \begin{array}{@{}rcl@{}} && p_{ij} =\frac{\exp{\left( - \frac{\| \mathbf{x}_{i}-\mathbf{x}_{j} \|^{2}}{2\sigma^{2}} \right)}}{{\sum}_{k\neq l}\exp{\left( - \frac{\| \mathbf{x}_{k}-\mathbf{x}_{l} \|^{2}}{2\sigma^{2}} \right)}};q_{ij} =\frac{(1+\| \mathbf{y}_{i}-\mathbf{y}_{j} \|^{2})^{-1}}{{\sum}_{k\neq l}(1+\| \mathbf{y}_{k}-\mathbf{y}_{l} \|^{2})^{-1}}\\ && Z_{1}={\sum}_{k\neq l}\exp{\left( - \frac{\| \mathbf{x}_{k}-\mathbf{x}_{l} \|^{2}}{2\sigma^{2}} \right)};Z_{2}={\sum}_{k\neq l}(1+\| \mathbf{y}_{k}-\mathbf{y}_{l} \|^{2})^{-1}\\ \end{array} $$

Similar expressions can be obtained for \(p_{ij}^{\prime },q_{ij}^{\prime },Z_{1}^{\prime }\) and \(Z_{2}^{\prime }\). From these equations, we can show that,

$$ \frac{(1-\cos\beta)}{{\sigma_{i}^{2}}} =\log\left( \frac{1}{p_{ij}}-2\right) \frac{1}{\log \sum\limits_{{k\neq l; k,l \neq i,j}}\exp{\left( - \frac{\| \mathbf{x}_{k}-\mathbf{x}_{l} \|^{2}}{2\sigma^{2}} \right)}} $$
(12)
$$ \| \mathbf{y}_{i}-\mathbf{y}_{j} \|^{2}=\left( \frac{1}{q_{ij}}-2\right)\frac{1}{\sum\limits_{{k\neq l; k,l \neq i,j}}(1+\| \mathbf{y}_{k}-\mathbf{y}_{l} \|^{2})^{-1}}-1 $$
(13)

\( \text {Let,} c_{1} =\sum \limits _{{k\neq l; k,l \neq i,j}}(1+\| \mathbf {y}_{k}-\mathbf {y}_{l} \|^{2})^{-1}\). Now according to Theorem 1, β is bounded in \((\frac {\pi }{2}-2\theta , \frac {\pi }{2}+2\theta )\) with high probability, if (i,j) belongs to different class labels. In general for small 𝜃, we can assume that the different clusters form a separation angle (with respect to origin) such that \(\frac {\pi }{2}-2\theta > \beta \) i.e ββ for all pairs of (i,j). Then according to (12), \(p_{ij} \geq p_{ij}^{\prime }\) and therefore \(q_{ij} \geq q_{ij}^{\prime }\), if (i,j) belongs to different class labels. Equation (13) further yields,

$$ \log \frac{c_{1}^{\prime}(1+\| \mathbf{y}^{\prime}_{i}-\mathbf{y}^{\prime}_{j} \|^{2})+2}{c_{1}(1+\| \mathbf{y}_{i}-\mathbf{y}_{j} \|^{2})+2} = \log \frac{q_{ij}}{q_{ij}^{\prime}} = \log \frac{p_{ij}}{p_{ij}^{\prime}} \geq 0 $$

This shows that Lt-SNE always provide better mapping than t-SNE for \(c_{1}^{\prime } \leq c_{1}\) which is generally the case. For small 𝜃, we expect \(p_{kl} > p^{\prime }_{kl}\)\((\Rightarrow q_{kl} > q^{\prime }_{kl})\) for (k,l) belonging to different class and \(p_{kl} \approx p^{\prime }_{kl}\)\((\Rightarrow q_{kl} \approx q^{\prime }_{kl})\) for (k,l) belonging to the same class. This leads to \(c_{1}^{\prime } \leq c_{1}\) since qkl ∝ (1 + ∥ykyl2)− 1. Next, we establish an lower bound on this mapping ratio using this expression,

$$ \log \frac{c_{1}^{\prime}(1+\| \mathbf{y}^{\prime}_{i}-\mathbf{y}^{\prime}_{j} \|^{2})+2}{c_{1}(1+\| \mathbf{y}_{i}-\mathbf{y}_{j} \|^{2})+2} = \frac{1-\cos\beta^{\prime}}{\sigma^{\prime2}}+ \log \frac{Z_{1}^{\prime}}{Z_{1}}+ \frac{\cos \beta -1}{\sigma^{2}} $$
(14)

For fixed β, \((\frac {\cos \beta -1} {\sigma ^{2}}- \log Z_{1})\) term is constant. Now normalization constant \(Z_{1}^{\prime }\) is the sum of kernel densities between samples within cluster itself and across other clusters. From Theorem 1, we know that (1 − α) fraction of a cluster belongs to a orthogonal cone structure with \(\theta \in (0,\frac {\pi }{4})\) angle with high probability. Ignoring α samples (which add positive values to \(Z_{1}^{\prime }\)), we can provide a lower bound on \(Z_{1}^{\prime }\) with the same probability bound as given in Theorem 1 for 𝜃,α.

$$ \begin{array}{@{}rcl@{}} Z^{\prime}_{1} &\geq& \underbrace{ \sum\limits_{k = 1}^{K} (1-\alpha)^{2} N_{K} (N_{k}-1) e^{-\frac{(1-\cos 2\theta)}{\sigma^{\prime2}}}}_{\text{sum of densities within clusters}}+ \underbrace{ \sum\limits_{k \neq l} (1-\alpha)^{2} N_{k} N_{l} e^{-\frac{(1+\sin 2\theta)}{\sigma^{\prime2}}}}_{\text{sum of densities across clusters}} \\ Z^{\prime}_{1} &\geq& \frac{(1-\alpha)^{2}}{e^{\frac{(1-\cos 2\theta)}{\sigma^{\prime2}}}} \left( \sum\limits_{k = 1}^{K} N_{K} (N_{k}-1) + \sum\limits_{k \neq l} (1-\alpha)^{2} N_{k} N_{l} e^{- \sqrt{2}\cos (\frac{\pi}{4} -2\theta)} \right)\\ \end{array} $$

Finally, we can plug \(Z^{\prime }_{1}\) in (14) and putting \(\beta ^{\prime }=\frac {\pi }{2}-\theta \) for getting lower bound, we obtain our final expressions.

$$ c_{2} =\frac{\cos \beta -1} {\sigma^{2}}- \log Z_{1} + \log\left( \sum\limits_{k = 1}^{K} N_{K} (N_{k}-1) +\sum\limits_{k \neq l} (1-\alpha)^{2} N_{k} N_{l} e^{- \sqrt{2}}\right) $$
$$ \begin{array}{@{}rcl@{}} \log \frac{c_{1}^{\prime}(1+\| \mathbf{y}^{\prime}_{i}-\mathbf{y}^{\prime}_{j} \|^{2})+2}{c_{1}(1+\| \mathbf{y}_{i}-\mathbf{y}_{j} \|^{2})+2} & \geq& \frac{\sqrt{2}\sin (\frac{\pi}{4} -2\theta)} {\sigma_{\prime}^{2}}+ 2\log(1-\alpha)+c_{2} \\ \implies \| \mathbf{y}^{\prime}_{i}-\mathbf{y}^{\prime}_{j} \|^{2} & \geq& a_{1}\| \mathbf{y}_{i}-\mathbf{y}_{j} \|^{2} +a_{2} \end{array} $$

Here, \(c_{3}= \exp (\frac {\sqrt {2}\sin (\frac {\pi }{4} -2\theta )} {\sigma _{\prime }^{2}}+ 2\log (1-\alpha )+c_{2}) \geq 1\), \(a_{2}=\frac {c_{3}+c_{3}c_{1}-c_{1}^{\prime }-1}{c_{1}^{\prime }}\geq 0\) and \(a_{1}=\frac {c_{3}c_{1}}{c_{1}^{\prime }} \geq 1\), if \(c_{1} \geq c_{1}^{\prime }\) which is the case for small 𝜃. This completes the full proof. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Narayanan, A., Verma, S. & Zhang, ZL. Mining latent patterns in geoMobile data via EPIC. World Wide Web 22, 2771–2798 (2019). https://doi.org/10.1007/s11280-019-00702-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-019-00702-z

Keywords

Navigation