ABSTRACT
For big data, the data warehouse stores all business data of the entire enterprise. The data collected in the data warehouse will generate new data collection through the operations of data union, splitting, and transformation. This data conversion relationship in the data production process is called data lineage. Therefore, tracking the data lineage in the data warehouse is an important part of the process of data warehouse construction. However, the existing open-source solutions have shortcomings such as high coupling, poor accuracy, and intrusion in the processing of this critical link. Therefore, this paper designs a column-level data lineage processing system based on the Hive data warehouse for the Hive data warehouse. The system has realized the ability to analyze the Data lineage of Hive SQL data locally and realizes the fine-grained and high accuracy of the data lineage analysis results while ensuring the low coupling between the data lineage function and the Hive data warehouse.
- Buneman, P., and Tan, W. C. 2019. Data Provenance: What next?. SIGMOD Rec. 47, 3 (September 2018), 5--16. DOI=https://doi.org/10.1145/3316416.3316418.Google ScholarDigital Library
- Glavic B. 2012. Big data provenance: Challenges and implications for benchmarking. Specifying big data benchmarks. Springer, Berlin, Heidelberg, 72--80.Google Scholar
- Apache Hive. 2020. Apache Hive. Retrieved from http://hive.apache.org/.Google Scholar
- Park, H., Ikeda, R., and Widom, J. 2011. Ramp: A system for capturing and tracing provenance in mapreduce workflows.Google ScholarDigital Library
- Apache Atlas 2020. Apache Atlas: Data Governance and Metadata framework for Hadoop. Retrieved from http://atlas.apache.org/.Google Scholar
- Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E. N., O'Malley, O., ... and Zhang, X. 2014. Major technical advancements in apache hive. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 1235--1246. DOI=https://doi.org/10.1145/2588555.2595630.Google ScholarDigital Library
- Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., ... and Murthy, R. 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626--1629. DOI=https://doi.org/10.14778/1687553.1687609.Google ScholarDigital Library
- Camacho-Rodríguez, J., Chauhan, A., Gates, A., Koifman, E., O'Malley, O., Garg, V., ... and Jaiswal, D. 2019. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1773--1786. DOI=https://doi.org/10.1145/3299869.3314045.Google ScholarDigital Library
- Interlandi, M., Ekmekji, A., Shah, K., Gulzar, M. A., Tetali, S. D., Kim, M., ... and Condie, T. 2018. Adding data provenance support to Apache Spark. The VLDB Journal 27, 5 (October 2018), 595--615. DOI=https://doi.org/10.1007/s00778-017-0474-5.Google ScholarDigital Library
- Pokorný, J., Sykora, J., and Valenta, M. 2019. Data Lineage Temporally Using a Graph Database. In Proceedings of the 11th International Conference on Management of Digital EcoSystems (MEDES '19). Association for Computing Machinery, New York, NY, USA, 285--291. DOI=https://doi.org/10.1145/3297662.3365794.Google ScholarDigital Library
- Tang, M., Shao, S., Yang, W., Liang, Y., Yu, Y., Saha, B., and Hyun, D. 2019. SAC: A System for Big Data Lineage Tracking. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1964-1967. DOI=http://doi.org/10.1109/ICDE.2019.00215.Google Scholar
- lulumengyi. 2019. Hive SQL AST. Retrieved from https://github.com/lulumengyi/Hive_SQL_AST.Google Scholar
Index Terms
- A Column-Level Data Lineage Processing System Based on Hive
Recommendations
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataApache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture ...
Lineage tracing for general data warehouse transformations
Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of ...
Query optimization using column statistics in hive
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & ApplicationsHive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the ...
Comments