Elite: an elastic infrastructure for big spatiotemporal trajectories

Xie, Xike; Mei, Benjin; Chen, Jinchuan; Du, Xiaoyong; Jensen, Christian S.

doi:10.1007/s00778-016-0425-6

Elite: an elastic infrastructure for big spatiotemporal trajectories

Regular Paper
Published: 17 February 2016

Volume 25, pages 473–493, (2016)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Xike Xie¹,
Benjin Mei²,
Jinchuan Chen³,
Xiaoyong Du³ &
…
Christian S. Jensen¹

1435 Accesses
31 Citations
Explore all metrics

Abstract

As the volumes of spatiotemporal trajectory data continue to grow at a rapid pace; a new generation of data management techniques is needed in order to be able to utilize these data to provide a range of data-driven services, including geographic-type services. Key challenges posed by spatiotemporal data include the massive data volumes, the high velocity with which the data are captured, the need for interactive response times, and the inherent inaccuracy of the data. We propose an infrastructure, Elite, that leverages peer-to-peer and parallel computing techniques to address these challenges. The infrastructure offers efficient, parallel update and query processing by organizing the data into a layered index structure that is logically centralized, but physically distributed among computing nodes. The infrastructure is elastic with respect to storage, meaning that it adapts to fluctuations in the storage volume, and with respect to computation, meaning that the degree of parallelism can be adapted to best match the computational requirements. Further, the infrastructure offers advanced functionality, including probabilistic simulations, for contending with the inaccuracy of the underlying data in query processing. Extensive empirical studies offer insight into properties of the infrastructure and indicate that it meets its design goals, thus enabling the effective management of big spatiotemporal data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big Spatial and Spatio-Temporal Data Analytics Systems

Strabo 2: Distributed Management of Massive Geospatial RDF Datasets

GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data

Article 07 January 2021

Notes

The minimum capacity is below half of the maximum capacity. Otherwise, the condensed node may exceed the maximum capacity.
Details on the *node are covered in Sect. 4.3.1.
From experiments, we obtained $\alpha =5e^{-4}$, $\beta =1e^{-6}$, and $\gamma =1e^{-6}$.
http://iapg.jade-hs.de/personen/brinkhoff/generator/.
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/.
http://en.wikipedia.org/wiki/DifferentialGPS.
http://www.mongodb.org.
https://github.com/couchbase/geocouch.
There exist different semantics for top-k queries over uncertain data, such as U-TopK, U-kRanks, Expected scores, and Expected ranks. Among them, the Expected score and Expected rank might be best ones in terms of properties such as Containment and Unique-Rank [39].

References

Ceikute, V., Jensen, C.S.: Vehicle routing with user-generated trajectory data. In: MDM (2015)
Yang, B., Guo, C., Ma, Y., Jensen, C.S.: Toward personalized, context-aware routing. VLDB J. 24(2), 297–318 (2015)
Article Google Scholar
Dai, J., Yang, B., Guo, C., Jensen, C.S.: Efficient and accurate path cost estimation using trajectory data. In: CoRR abs/1510.02886 (2015)
Stougiannis, A., Pavlovic, M., Tauheed, F., et al.: Data-driven neuroscience: enabling breakthroughs via innovative data management. In: SIGMOD (2013)
Manyika, J., Chui, M.: Big data: the next frontier for innovation, competition, and productivity. In: McKinsey Global Institute (2011)
Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: SIGMOD (2003)
Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Querying imprecise data in moving object environments. TKDE 16(9), 1112–1127 (2004)
Google Scholar
Trajcevski, G., Tamassia, R., Ding, H., et al.: Continuous probabilistic nearest-neighbor queries for uncertain trajectories. In: EDBT (2009)
Civilis, A., Jensen, C.S., Pakalnis, S.: Techniques for efficient road-network-based tracking of moving objects. TKDE 17(5), 698–712 (2005)
Google Scholar
Jensen, C.S., Pakalnis, S.: Trax - real-world tracking of moving objects. In: VLDB (2007)
Eldawy, A., Li, Y., Mokbel, M.F., Janardan, R.: $\text{ CG }\_\text{ Hadoop }$: computational geometry in mapreduce. In: GIS (2013)
Eldawy, A., Mokbel, M.F.: A demonstration of SpatialHadoop: an efficient MapReduce framework for spatial data. In: VLDB (2013)
Aji, A., Wang, F., Vo, H., et al.: Hadoop-GIS: a high performance spatial data warehousing system over mapreduce. In: VLDB (2013)
Nishimura, S., Das, S., Agrawal, D., El Abbadi, A.: MD-HBase: a scalable multi-dimensional data infrastructure for location aware services. In: MDM (2011)
Wang, J., Wu, S., Gao, H., et al.: Indexing multi-dimensional data in a cloud system. In: SIGMOD (2010)
Tsatsanifos, G., Sacharidis, D., Sellis, T.: Index-based query processing on distributed multidimensional data. GeoInformatica 17(3), 489–519 (2013)
Article Google Scholar
Ratnasamy, S., Francis, P., Handley, M., et al.: A scalable content-addressable network. In: SIGCOMM (2001)
Wei, L.Y., Zheng, Y., Peng, W.C.: Constructing popular routes from uncertain trajectories. In: KDD (2012)
Pei, T., Zhou, C., Zhu, A.-X, et al.: Windowed nearest neighbour method for mining spatio-temporal clusters in the presence of noise. In: International Journal of Geographical Information Science (2010)
Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB (2004)
Pfoser, D., Jensen, C.S.: Capturing the uncertainty of moving-objects representations. In: SSDBM (1999)
Lian, X, Chen, L.: Monochromatic and bichromatic reverse skyline search over uncertain databases. In: SIGMOD (2008)
Kriegel, H.P., Kunath, P., Renz, M.: Probabilistic nearest-neighbor query on uncertain objects. In: DASFAA (2007)
Pugh, W.: Concurrent maintenance of lists. Technical report, Dept. of Computer Science, University of Maryland (1990)
Gargantini, I.: An effective way to represent octrees. Commun. ACM 25(12), 905–910 (1982)
Article MATH Google Scholar
Sidlauskas, D., Saltenis, S., Christiansen, C.W., et al.: Trees or grids?: indexing moving objects in main memory. In: GIS (2009)
Sidlauskas, D., Saltenis, S., Jensen, C.S.: Processing of extreme moving-object update and query workloads in main memory. VLDB J. 23(5), 817–841 (2014)
Article Google Scholar
Cheng, R., Chen, J., Mokbel, M., Chow, C.Y.: Probabilistic verifiers: evaluating constrained nearest-neighbor queries over uncertain data. In: ICDE (2008)
You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: ICDE Workshops (2015)
Trajcevski, G., Wolfson, O., Zhang, F., Chamberlain, S.: The geometry of uncertainty in moving object databases. In: EDBT (2002)
Zheng, K., Trajcevski, G., Zhou, X., Scheuermann, P.: Probabilistic range queries for uncertain trajectories on road networks. In: EDBT (2011)
Zheng, K., Fung, G.P.C., Zhou, X.: K-nearest neighbor search for fuzzy objects. In: SIGMOD (2010)
Xie, X., Yiu, M.L., Cheng, R., Lu, H.: Scalable evaluation of trajectory queries over imprecise location data. TKDE 26(8), 2029–2044 (2014)
Google Scholar
Tao, Y., Papadias, D.: MV3R-Tree: a spatio-temporal access method for timestamp and interval queries. In: VLDB (2001)
Pfoster, D., Jensen, C.S., Theodoridis, Y.: Novel approaches to the indexing of moving object trajectories. In: VLDB (2000)
Chakka, V.P., Everspaugh, A.C., Patel, J.M., et al.: Indexing large trajectory data sets with SETI. In: The first biennial conference on innovative data systems research (CIDR) (2003). http://www.cidrdb.org/cidr2003/program/p15.pdf
Tsatsanifos, G., Sacharidis, D., Sellis, T.: RIPPLE: a scalable framework for distributed processing of rank queries. In: EDBT (2014)
The apache cassandra project. http://cassandra.apache.org/
Cormode, G., Li, F., Yi, K.: Semantics of ranking queries for probabilistic data and expected ranks. In: ICDE (2009)
Born, M.: On the stability of crystal lattices. IX. Covariant theory of lattice deformations and the stability of some hexagonal lattices. In: Proceedings of the Cambridge Philosophical Society 38 (1942)

Download references

Acknowledgments

This work was supported by the 973 program with No 2012CB316205, a grant from the Obel Family Foundation, and National Science Foundation of China under Grant No. 61432006. The work was done in part when some of the authors visited SA Center for Big Data Research at Renmin University of China. The center is partially funded by the Chinese National “111” Project “Attracting International Talents in Data Engineering and Knowledge Engineering Research”.

Author information

Authors and Affiliations

Department of Computer Science, Aalborg University, Aalborg, Denmark
Xike Xie & Christian S. Jensen
School of Information, Renmin University of China, Beijing, China
Benjin Mei
Key Lab of Data Engineering and Knowledge Engineering, Renmin University of China, Beijing, China
Jinchuan Chen & Xiaoyong Du

Authors

Xike Xie
View author publications
You can also search for this author in PubMed Google Scholar
Benjin Mei
View author publications
You can also search for this author in PubMed Google Scholar
Jinchuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author in PubMed Google Scholar
Christian S. Jensen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinchuan Chen.

Appendices

Appendix 1: Algorithms

Appendix 2: Routing cost estimation

Lemma 1

The maximum routing cost of a three-dimensional torus of N nodes is approximately equal to $0.91 N^{\frac{1}{3}}$.

Proof

For any torus node n, it takes one hop to reach its six neighbors (first-order neighbors) and two hops to reach its 18 second order neighbors. The number of i-th order neighbors $a^i$ can be represented by [40]:

$$\begin{aligned} a_i = 2+ 4i^2 \end{aligned}$$

Suppose the furthest node on the torus takes m hops from node n. The total number of nodes N visited equals the summation of the number of 1-th to m-th order neighbors. We have:

$$\begin{aligned} \sum _{i=1}^{m} a_i = N \Rightarrow \sum _{i=1}^m 2+ 4i^2 = N \Rightarrow m \approx \left( \frac{3}{4}N\right) ^{\frac{1}{3}} \end{aligned}$$

Thus, the maximum routing cost equals to the distance to n’s m-th order neighbor, $0.91N^{\frac{1}{3}}$. $\square $

Lemma 2

The average routing cost of a three-dimensional torus of N nodes is approximately equal to $0.69 N^{\frac{1}{3}}$ hops.

Proof

Based on Lemma 1, the furthest node from n requires m hops. Then, the average number of hops is:

$$\begin{aligned} { avg\_number\_of\_hops }= & {} \frac{1}{N} \sum _{i=1}^{m} a_i \cdot i = \frac{1}{N} \sum _{i=1}^{m} 2i + 4 i^3\nonumber \\= & {} \frac{1}{N} \left( m^4+\frac{8}{3}m^3+2m^2+\frac{1}{3}m\right) \nonumber \\\approx & {} 0.69 N^{\frac{1}{3}} \end{aligned}$$

(6)

$\square $

Appendix 3: Estimation of h

Let us consider the cost estimation on torus node $T_i$. After the range search $Q_i \oplus d_{{ max }} \oplus U_{{ max }}$ (Step 6 in Algorithm 2), we get a set $C_i$ of candidate trajectories with the average length $\overline{{\mathcal {T}}_c\cdot \varDelta t}=\frac{1}{|C_i|} \sum _{{\mathcal {T}} \in C_i} {\mathcal {T}}\cdot \varDelta t$. According to Definition 5, the cost of STNNQ depends on the number of trajectories at each snapshot. To estimate that, we first assume the trajectories are uniformly distributed in the spatiotemporal region $Q_i \oplus d_{{ max }} \oplus U_{{ max }}$.

$$\begin{aligned} { \#~of~objects~per~snapshot } = \frac{ \overline{{\mathcal {T}}_c\cdot \varDelta t} \cdot |C_i| }{|Q_i\cdot \varDelta t|} \end{aligned}$$

(7)

We define the density $\rho $, as the number of objects per snapshot divided by the area of the filtering bound $\pi (d_{{ max }}+U_{{ max }})^2$.

Lemma 3

Assume a two-dimensional region S in the spatial domain ${\mathfrak {S}}$, where the points are uniformly distributed, and let $N(S) = m$ represent the fact that there are m points inside region S. The probability of $N(S) = m$ is given by:

$$\begin{aligned} P(N(S)=m) = \frac{\rho |S| e^{-\rho |S|m}}{m!} \end{aligned}$$

(8)

Proof

The probability that m points out of n objects are in S is:

$$\begin{aligned} P(N(S)=m)= \left( {\begin{array}{c}n\\ m\end{array}}\right) \left( \frac{|S|}{|{\mathfrak {S}}|}\right) ^m \left( 1-\frac{|S|}{|{\mathfrak {S}}|}\right) ^{n-m} \end{aligned}$$

The extreme form of the binomial distribution is a Poisson distribution. Let $\rho = \frac{n}{|{\mathfrak {S}}|}$. The above equation becomes:

$$\begin{aligned} P(N(S)=m) = \frac{(\rho |S|)^me^{-\rho |S|}}{m!} \end{aligned}$$

Then, the probability that there is at least one point in S is:

$$\begin{aligned} P(N \ge 1)= & {} \sum _{i=1}^{\infty } P(N = i) = \sum _{i=1}^{\infty } \frac{(\rho |S|)^ie^{-\rho |S|}}{i!}\\= & {} 1 - e^{-\rho |S|} \end{aligned}$$

$\square $

Then, we can infer that there is a nearest neighbor within the circular region S to the query point with a probability higher than $P^*$. In our implementation, we set $P^*$ to 0.9, which is reasonably large for S to contain the nearest neighbor.

$$\begin{aligned} |S| = -\frac{ln\left( 1-P^*\right) }{\rho } \end{aligned}$$

(9)

The number of candidate objects per snapshot is estimated as:

$$\begin{aligned} h = \rho (|S \oplus U_{{ max }}|). \end{aligned}$$

(10)

Appendix 4: Obtaining $d_{\mathrm{max}}$

In our system, we try a series of range queries $Q_i \oplus d \oplus U_{{ max }}$ to incrementally obtain $d_{{ max }}$, where $d=5, 10, 20\,\%, \cdots $ of torus node $T_i$’s spatial domain size. Upon collecting the candidate trajectory set $C_i$ by the range search parameterized with d, we test whether the union of these trajectories’ time spans can cover $Q_i$’s $\varDelta t$, i.e., to decide whether $\cup _{UT \in C}UT\cdot \varDelta t \supseteq Q_i\cdot \varDelta t$ is true. If true, it means that there always exists at least an object for each timestamp in $Q_i\cdot \varDelta t$. Therefore, current d is taken as $d_{{ max }}$, which is sufficiently large for not missing any possible candidate trajectories. Otherwise, we need to increase d incrementally and repeat the aforementioned process.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xie, X., Mei, B., Chen, J. et al. Elite: an elastic infrastructure for big spatiotemporal trajectories. The VLDB Journal 25, 473–493 (2016). https://doi.org/10.1007/s00778-016-0425-6

Download citation

Received: 02 July 2015
Revised: 25 January 2016
Accepted: 29 January 2016
Published: 17 February 2016
Issue Date: August 2016
DOI: https://doi.org/10.1007/s00778-016-0425-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Elite: an elastic infrastructure for big spatiotemporal trajectories

Abstract

Access this article

Similar content being viewed by others

Big Spatial and Spatio-Temporal Data Analytics Systems

Strabo 2: Distributed Management of Massive Geospatial RDF Datasets

GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data

Notes

References

Acknowledgments