ABSTRACT
The rapid increase of GPS-enabled devices has led to immense amounts of trajectory data being collected and analyzed. To provide insight into these datasets, a number of spatio-temporal queries need to be executed efficiently and at scale. One such important query is the Query by Path, which given a series of road segments and a time interval, retrieves all trajectories that have passed through the road segments within a given time interval. The Query by Path finds application in many areas, including traffic management, transportation planning and fleet monitoring.
In this paper we develop an approach to partition and distribute trajectories across a cluster and execute queries by path at scale. At the center of our approach is the partitioning of the entire dataset and indexing each partition with a Trie. We develop a basic set of partitioning approaches and show that each can be rendered inefficient by skew in the dataset. We consequently propose a HYbrid PartitiOning algorithm (HYPO) that performs robustly in face of skew. We also provide the cost models to configure HYPO. Finally we assess its performance extensively using both real and synthetic datasets to demonstrate that it scales well in face of skew.
- George M Adel'son-Vel'skii and Evgenii Mikhailovich Landis. 1962. An algorithm for organization of information. In Doklady Akademii Nauk, Vol. 146. Russian Academy of Sciences.Google Scholar
- Apache. [n.d.]. Spark Accumulators. https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#accumulators Accessed: 2019-09-30.Google Scholar
- Apache. [n.d.]. Spark API Documentation. https://spark.apache.org/docs/2.2.0/api.html Accessed: 2019-09-16.Google Scholar
- BMW Car IT GmbH. [n.d.]. GitHub - bmwcarit/barefoot. https://github.com/bmwcarit/barefoot Accessed: 2019-09-16.Google Scholar
- Viorica Botea, Daniel Mallett, Mario A. Nascimento, and Jörg Sander. 2008. PIST: An Efficient and Practical Indexing Technique for Historical Spatio-Temporal Point Data. GeoInformatica 12, 2 (01 Jun 2008).Google Scholar
- Jian Dai, Bin Yang, Chenjuan Guo, Christian S. Jensen, and Jilin Hu. 2016. Path Cost Distribution Estimation Using Trajectory Data. Proc. VLDB Endow. 10, 3 (Nov. 2016).Google ScholarDigital Library
- DiDi Chuxing GAIA Open Dataset Initiative. [n. d.]. Chengdu Trajectory Dataset. https://outreach.didichuxing.com/research/opendata/ Accessed: 2020-01-26.Google Scholar
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD.Google Scholar
- Filippo Furfaro, Giuseppe Mazzeo, Domenico Saccà, and Cristina Sirangelo. 2008. Compressed hierarchical binary histograms for summarizing multi-dimensional data. Knowl. Inf. Syst. 15 (06 2008), 335--380. Google ScholarCross Ref
- Yong Ge, Hui Xiong, Chuanren Liu, and Zhi-Hua Zhou. 2011. A Taxi Driving Fraud Detection System. In International Conference on Data Mining (ICDM).Google ScholarDigital Library
- Geofabrik GmbH. [n. d.]. Geofabrik Download Server. http://download.geofabrik.de/ Accessed: 2020-23-01.Google Scholar
- Chong Yang Goh, Justin Dauwels, Nikola Mitrovic, Muhammad Tayyab Asif, Ali Oran, and Patrick Jaillet. 2012. Online map-matching based on hidden markov model for real-time traffic sensing applications. In International Conference on Intelligent Transportation Systems.Google ScholarCross Ref
- Benjamin Krogh, Nikos Pelekis, Yannis Theodoridis, and Kristian Torp. 2014. Path-based Queries on Trajectory Data. In SIGSPATIAL.Google Scholar
- Ruiyuan Li, Sijie Ruan, Jie Bao, Yanhua Li, Yingcai Wu, and Yu Zheng. 2017. Querying Massive Trajectories by Path on the Cloud. In SIGSPATIAL.Google Scholar
- Sebastian Mattheis, Kazi Khaled Al-Zahid, Birgit Engelmann, Andreas Hildisch, Stefan Holder, Olexiy Lazarevych, Daniel Mohr, Felix Sedlmeier, and Richard Zinck. 2014. Putting the car on the map: a scalable map matching system for the open source community. Informatik (2014).Google Scholar
- Edward M. McCreight. 1976. A Space-Economical Suffix Tree Construction Algorithm. J. ACM 23, 2 (April 1976).Google ScholarDigital Library
- Paul Newson and John Krumm. 2009. Hidden Markov Map Matching Through Noise and Sparseness. In GIS.Google Scholar
- OpenStreetMap. [n. d.]. OSM File Formats - OpenStreetMap Wiki. https://wiki.openstreetmap.org/wiki/OSM_file_formats Accessed: 2020-23-01.Google Scholar
- Ridester. [n. d.]. How Many Uber Drivers are There? https://www.ridester.com/how-many-uber-drivers-are-there [Online;accessed 11-June-2019].Google Scholar
- Iulian Sandu Popa, Karine Zeitouni, Vincent Oria, Dominique Barth, and Sandrine Vial. 2011. Indexing In-network Trajectory Flows. The VLDB Journal 20, 5 (Oct. 2011).Google Scholar
- Renchu Song, Weiwei Sun, Baihua Zheng, and Yu Zheng. 2014. PRESS: A Novel Framework of Trajectory Compression in Road Networks. Proc. VLDB Endow. 7, 9 (May 2014).Google ScholarDigital Library
- UCI Machine Learning Repository. [n. d.]. UCI Machine Learning Repository: Taxi Service Trajectory - Prediction Challenge, ECML PKDD 2015 Data Set. https://archive.ics.uci.edu/ml/datasets/Taxi+Service+Trajectory+-+Prediction+Challenge,+ECML+PKDD+2015 Accessed: 2019-09-16.Google Scholar
- Yilun Wang, Yu Zheng, and Yexiang Xue. 2014. Travel Time Estimation of a Path Using Sparse Trajectories. In KDD.Google Scholar
- Daqing Zhang, Nan Li, Zhi-Hua Zhou, Chao Chen, Lin Sun, and Shijian Li. 2011. iBAT: Detecting Anomalous Taxi Trajectories from GPS Traces. In UbiComp.Google Scholar
- Jianting Zhang. 2012. Smarter Outlier Detection and Deeper Understanding of Large-scale Taxi Trip Records: a Case Study of NYC. In SIGKDD.Google Scholar
Index Terms
- HYPO: skew-resilient partitioning for trajectory datasets
Recommendations
Path-based queries on trajectory data
SIGSPATIAL '14: Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information SystemsIn traffic research, management, and planning a number of path-based analyses are heavily used, e.g., for computing turn-times, evaluating green waves, or studying traffic flow. These analyses require retrieving the trajectories that follow the full ...
GCOTraj: A storage approach for historical trajectory data sets using grid cells ordering
AbstractVast amounts of trajectory data have been collected due to the popularity of GPS devices. Analyzing this wealth of data is important, thus highlighting the need to efficiently index and store this large amount of data on secondary ...
TRIFL: A Generic Trajectory Index for Flash Storage
Due to several important features, such as high performance, low power consumption, and shock resistance, NAND flash has become a very popular stable storage medium for embedded mobile devices, personal computers, and even enterprise servers. However, ...
Comments