Abstract
As the volumes of spatiotemporal trajectory data continue to grow at a rapid pace; a new generation of data management techniques is needed in order to be able to utilize these data to provide a range of data-driven services, including geographic-type services. Key challenges posed by spatiotemporal data include the massive data volumes, the high velocity with which the data are captured, the need for interactive response times, and the inherent inaccuracy of the data. We propose an infrastructure, Elite, that leverages peer-to-peer and parallel computing techniques to address these challenges. The infrastructure offers efficient, parallel update and query processing by organizing the data into a layered index structure that is logically centralized, but physically distributed among computing nodes. The infrastructure is elastic with respect to storage, meaning that it adapts to fluctuations in the storage volume, and with respect to computation, meaning that the degree of parallelism can be adapted to best match the computational requirements. Further, the infrastructure offers advanced functionality, including probabilistic simulations, for contending with the inaccuracy of the underlying data in query processing. Extensive empirical studies offer insight into properties of the infrastructure and indicate that it meets its design goals, thus enabling the effective management of big spatiotemporal data.
Similar content being viewed by others
Notes
The minimum capacity is below half of the maximum capacity. Otherwise, the condensed node may exceed the maximum capacity.
Details on the *node are covered in Sect. 4.3.1.
From experiments, we obtained \(\alpha =5e^{-4}\), \(\beta =1e^{-6}\), and \(\gamma =1e^{-6}\).
There exist different semantics for top-k queries over uncertain data, such as U-TopK, U-kRanks, Expected scores, and Expected ranks. Among them, the Expected score and Expected rank might be best ones in terms of properties such as Containment and Unique-Rank [39].
References
Ceikute, V., Jensen, C.S.: Vehicle routing with user-generated trajectory data. In: MDM (2015)
Yang, B., Guo, C., Ma, Y., Jensen, C.S.: Toward personalized, context-aware routing. VLDB J. 24(2), 297–318 (2015)
Dai, J., Yang, B., Guo, C., Jensen, C.S.: Efficient and accurate path cost estimation using trajectory data. In: CoRR abs/1510.02886 (2015)
Stougiannis, A., Pavlovic, M., Tauheed, F., et al.: Data-driven neuroscience: enabling breakthroughs via innovative data management. In: SIGMOD (2013)
Manyika, J., Chui, M.: Big data: the next frontier for innovation, competition, and productivity. In: McKinsey Global Institute (2011)
Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: SIGMOD (2003)
Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Querying imprecise data in moving object environments. TKDE 16(9), 1112–1127 (2004)
Trajcevski, G., Tamassia, R., Ding, H., et al.: Continuous probabilistic nearest-neighbor queries for uncertain trajectories. In: EDBT (2009)
Civilis, A., Jensen, C.S., Pakalnis, S.: Techniques for efficient road-network-based tracking of moving objects. TKDE 17(5), 698–712 (2005)
Jensen, C.S., Pakalnis, S.: Trax - real-world tracking of moving objects. In: VLDB (2007)
Eldawy, A., Li, Y., Mokbel, M.F., Janardan, R.: \(\text{ CG }\_\text{ Hadoop }\): computational geometry in mapreduce. In: GIS (2013)
Eldawy, A., Mokbel, M.F.: A demonstration of SpatialHadoop: an efficient MapReduce framework for spatial data. In: VLDB (2013)
Aji, A., Wang, F., Vo, H., et al.: Hadoop-GIS: a high performance spatial data warehousing system over mapreduce. In: VLDB (2013)
Nishimura, S., Das, S., Agrawal, D., El Abbadi, A.: MD-HBase: a scalable multi-dimensional data infrastructure for location aware services. In: MDM (2011)
Wang, J., Wu, S., Gao, H., et al.: Indexing multi-dimensional data in a cloud system. In: SIGMOD (2010)
Tsatsanifos, G., Sacharidis, D., Sellis, T.: Index-based query processing on distributed multidimensional data. GeoInformatica 17(3), 489–519 (2013)
Ratnasamy, S., Francis, P., Handley, M., et al.: A scalable content-addressable network. In: SIGCOMM (2001)
Wei, L.Y., Zheng, Y., Peng, W.C.: Constructing popular routes from uncertain trajectories. In: KDD (2012)
Pei, T., Zhou, C., Zhu, A.-X, et al.: Windowed nearest neighbour method for mining spatio-temporal clusters in the presence of noise. In: International Journal of Geographical Information Science (2010)
Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB (2004)
Pfoser, D., Jensen, C.S.: Capturing the uncertainty of moving-objects representations. In: SSDBM (1999)
Lian, X, Chen, L.: Monochromatic and bichromatic reverse skyline search over uncertain databases. In: SIGMOD (2008)
Kriegel, H.P., Kunath, P., Renz, M.: Probabilistic nearest-neighbor query on uncertain objects. In: DASFAA (2007)
Pugh, W.: Concurrent maintenance of lists. Technical report, Dept. of Computer Science, University of Maryland (1990)
Gargantini, I.: An effective way to represent octrees. Commun. ACM 25(12), 905–910 (1982)
Sidlauskas, D., Saltenis, S., Christiansen, C.W., et al.: Trees or grids?: indexing moving objects in main memory. In: GIS (2009)
Sidlauskas, D., Saltenis, S., Jensen, C.S.: Processing of extreme moving-object update and query workloads in main memory. VLDB J. 23(5), 817–841 (2014)
Cheng, R., Chen, J., Mokbel, M., Chow, C.Y.: Probabilistic verifiers: evaluating constrained nearest-neighbor queries over uncertain data. In: ICDE (2008)
You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: ICDE Workshops (2015)
Trajcevski, G., Wolfson, O., Zhang, F., Chamberlain, S.: The geometry of uncertainty in moving object databases. In: EDBT (2002)
Zheng, K., Trajcevski, G., Zhou, X., Scheuermann, P.: Probabilistic range queries for uncertain trajectories on road networks. In: EDBT (2011)
Zheng, K., Fung, G.P.C., Zhou, X.: K-nearest neighbor search for fuzzy objects. In: SIGMOD (2010)
Xie, X., Yiu, M.L., Cheng, R., Lu, H.: Scalable evaluation of trajectory queries over imprecise location data. TKDE 26(8), 2029–2044 (2014)
Tao, Y., Papadias, D.: MV3R-Tree: a spatio-temporal access method for timestamp and interval queries. In: VLDB (2001)
Pfoster, D., Jensen, C.S., Theodoridis, Y.: Novel approaches to the indexing of moving object trajectories. In: VLDB (2000)
Chakka, V.P., Everspaugh, A.C., Patel, J.M., et al.: Indexing large trajectory data sets with SETI. In: The first biennial conference on innovative data systems research (CIDR) (2003). http://www.cidrdb.org/cidr2003/program/p15.pdf
Tsatsanifos, G., Sacharidis, D., Sellis, T.: RIPPLE: a scalable framework for distributed processing of rank queries. In: EDBT (2014)
The apache cassandra project. http://cassandra.apache.org/
Cormode, G., Li, F., Yi, K.: Semantics of ranking queries for probabilistic data and expected ranks. In: ICDE (2009)
Born, M.: On the stability of crystal lattices. IX. Covariant theory of lattice deformations and the stability of some hexagonal lattices. In: Proceedings of the Cambridge Philosophical Society 38 (1942)
Acknowledgments
This work was supported by the 973 program with No 2012CB316205, a grant from the Obel Family Foundation, and National Science Foundation of China under Grant No. 61432006. The work was done in part when some of the authors visited SA Center for Big Data Research at Renmin University of China. The center is partially funded by the Chinese National “111” Project “Attracting International Talents in Data Engineering and Knowledge Engineering Research”.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Algorithms
Appendix 2: Routing cost estimation
Lemma 1
The maximum routing cost of a three-dimensional torus of N nodes is approximately equal to \(0.91 N^{\frac{1}{3}}\).
Proof
For any torus node n, it takes one hop to reach its six neighbors (first-order neighbors) and two hops to reach its 18 second order neighbors. The number of i-th order neighbors \(a^i\) can be represented by [40]:
Suppose the furthest node on the torus takes m hops from node n. The total number of nodes N visited equals the summation of the number of 1-th to m-th order neighbors. We have:
Thus, the maximum routing cost equals to the distance to n’s m-th order neighbor, \(0.91N^{\frac{1}{3}}\). \(\square \)
Lemma 2
The average routing cost of a three-dimensional torus of N nodes is approximately equal to \(0.69 N^{\frac{1}{3}}\) hops.
Proof
Based on Lemma 1, the furthest node from n requires m hops. Then, the average number of hops is:
\(\square \)
Appendix 3: Estimation of h
Let us consider the cost estimation on torus node \(T_i\). After the range search \(Q_i \oplus d_{{ max }} \oplus U_{{ max }}\) (Step 6 in Algorithm 2), we get a set \(C_i\) of candidate trajectories with the average length \(\overline{{\mathcal {T}}_c\cdot \varDelta t}=\frac{1}{|C_i|} \sum _{{\mathcal {T}} \in C_i} {\mathcal {T}}\cdot \varDelta t\). According to Definition 5, the cost of STNNQ depends on the number of trajectories at each snapshot. To estimate that, we first assume the trajectories are uniformly distributed in the spatiotemporal region \(Q_i \oplus d_{{ max }} \oplus U_{{ max }}\).
We define the density \(\rho \), as the number of objects per snapshot divided by the area of the filtering bound \(\pi (d_{{ max }}+U_{{ max }})^2\).
Lemma 3
Assume a two-dimensional region S in the spatial domain \({\mathfrak {S}}\), where the points are uniformly distributed, and let \(N(S) = m\) represent the fact that there are m points inside region S. The probability of \(N(S) = m\) is given by:
Proof
The probability that m points out of n objects are in S is:
The extreme form of the binomial distribution is a Poisson distribution. Let \(\rho = \frac{n}{|{\mathfrak {S}}|}\). The above equation becomes:
Then, the probability that there is at least one point in S is:
\(\square \)
Then, we can infer that there is a nearest neighbor within the circular region S to the query point with a probability higher than \(P^*\). In our implementation, we set \(P^*\) to 0.9, which is reasonably large for S to contain the nearest neighbor.
The number of candidate objects per snapshot is estimated as:
Appendix 4: Obtaining \(d_{\mathrm{max}}\)
In our system, we try a series of range queries \(Q_i \oplus d \oplus U_{{ max }}\) to incrementally obtain \(d_{{ max }}\), where \(d=5, 10, 20\,\%, \cdots \) of torus node \(T_i\)’s spatial domain size. Upon collecting the candidate trajectory set \(C_i\) by the range search parameterized with d, we test whether the union of these trajectories’ time spans can cover \(Q_i\)’s \(\varDelta t\), i.e., to decide whether \(\cup _{UT \in C}UT\cdot \varDelta t \supseteq Q_i\cdot \varDelta t\) is true. If true, it means that there always exists at least an object for each timestamp in \(Q_i\cdot \varDelta t\). Therefore, current d is taken as \(d_{{ max }}\), which is sufficiently large for not missing any possible candidate trajectories. Otherwise, we need to increase d incrementally and repeat the aforementioned process.
Rights and permissions
About this article
Cite this article
Xie, X., Mei, B., Chen, J. et al. Elite: an elastic infrastructure for big spatiotemporal trajectories. The VLDB Journal 25, 473–493 (2016). https://doi.org/10.1007/s00778-016-0425-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-016-0425-6