Abstract
ID-associated data are sequences of entries, and each entry is semantically associated with a unique ID. Examples are user IDs in user behaviour logs of mobile applications and device IDs in sensor records of self-driving cars. Nowadays, many big data applications generate such types of ID-associated data at high speed, and most queries over them are ID-centric (on specific IDs and ranges of time). To generate valuable insights from such data timely, the system needs to ingest high volumes of them with low latency, and support real-time analysis over them efficiently. In this paper, we introduce a system prototype designed for this goal. The system designed a parallel ingestion pipeline and a lightweight indexing scheme for the fast ingestion and efficient analysis. Besides, a fiber partitioning method is utilized to achieve dynamic scalability. For better integration with Hadoop ecosystem, the prototype is implemented based on open source projects, including Kafka and Presto.
This work is supported by Science and Technology Planning Project of Guangdong under grant No.2015B010131015, 863 key project under grant No.2015AA015307, and the National Science Foundation of China under grants No.61472426, U1711261, 61432006.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
select sum(quantity), sum(totalprice), min(discount), max(discount), avg(extendedprice), count(*) from test where custkey = k and \(generation\_time > t1\) and \(generation\_time < t2\) order by linestatus.
- 2.
References
Apache Hive (2011). http://hive.apache.org
Spark SQL: relational data processing in spark (2015). http://spark.apache.org/sql/
Apache ORC (2018). https://orc.apache.org
Apache Parquet (2018). https://parquet.apache.org
Gormley, C., Tong, Z.: ElasticSearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine. O’Reilly Media Inc, Sebastopol (2015)
Karger, D., et al.: Web caching with consistent hashing. Comput. Netw. 31(11–16), 1203–1213 (1999)
Kreps, J., Narkhede, N., Rao, J., et al.: Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB, pp. 1–7 (2011)
Traverso, M.: Presto: interacting with petabytes of data at facebook (2013). Accessed 4 Feb 2014
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Jin, G., Wang, Y., Qin, X., Chen, Y., Du, X. (2018). Towards Real-Time Analysis of ID-Associated Data. In: Woo, C., Lu, J., Li, Z., Ling, T., Li, G., Lee, M. (eds) Advances in Conceptual Modeling. ER 2018. Lecture Notes in Computer Science(), vol 11158. Springer, Cham. https://doi.org/10.1007/978-3-030-01391-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-01391-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01390-5
Online ISBN: 978-3-030-01391-2
eBook Packages: Computer ScienceComputer Science (R0)