research-article

A Column-Level Data Lineage Processing System Based on Hive

Authors:

Zehua Tan,

Haihong E,

Meina SongAuthors Info & Claims

ICBDT '20: Proceedings of the 3rd International Conference on Big Data Technologies

Pages 47 - 52

https://doi.org/10.1145/3422713.3422719

Published: 23 October 2020 Publication History

Get Access

Abstract

For big data, the data warehouse stores all business data of the entire enterprise. The data collected in the data warehouse will generate new data collection through the operations of data union, splitting, and transformation. This data conversion relationship in the data production process is called data lineage. Therefore, tracking the data lineage in the data warehouse is an important part of the process of data warehouse construction. However, the existing open-source solutions have shortcomings such as high coupling, poor accuracy, and intrusion in the processing of this critical link. Therefore, this paper designs a column-level data lineage processing system based on the Hive data warehouse for the Hive data warehouse. The system has realized the ability to analyze the Data lineage of Hive SQL data locally and realizes the fine-grained and high accuracy of the data lineage analysis results while ensuring the low coupling between the data lineage function and the Hive data warehouse.

References

[1]

Buneman, P., and Tan, W. C. 2019. Data Provenance: What next?. SIGMOD Rec. 47, 3 (September 2018), 5--16. DOI=https://doi.org/10.1145/3316416.3316418.

Digital Library

Google Scholar

[2]

Glavic B. 2012. Big data provenance: Challenges and implications for benchmarking. Specifying big data benchmarks. Springer, Berlin, Heidelberg, 72--80.

Google Scholar

[3]

Apache Hive. 2020. Apache Hive. Retrieved from http://hive.apache.org/.

Google Scholar

[4]

Park, H., Ikeda, R., and Widom, J. 2011. Ramp: A system for capturing and tracing provenance in mapreduce workflows.

Digital Library

Google Scholar

[5]

Apache Atlas 2020. Apache Atlas: Data Governance and Metadata framework for Hadoop. Retrieved from http://atlas.apache.org/.

Google Scholar

[6]

Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E. N., O'Malley, O., ... and Zhang, X. 2014. Major technical advancements in apache hive. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 1235--1246. DOI=https://doi.org/10.1145/2588555.2595630.

Digital Library

Google Scholar

[7]

Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., ... and Murthy, R. 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626--1629. DOI=https://doi.org/10.14778/1687553.1687609.

Digital Library

Google Scholar

[8]

Camacho-Rodríguez, J., Chauhan, A., Gates, A., Koifman, E., O'Malley, O., Garg, V., ... and Jaiswal, D. 2019. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1773--1786. DOI=https://doi.org/10.1145/3299869.3314045.

Digital Library

Google Scholar

[9]

Interlandi, M., Ekmekji, A., Shah, K., Gulzar, M. A., Tetali, S. D., Kim, M., ... and Condie, T. 2018. Adding data provenance support to Apache Spark. The VLDB Journal 27, 5 (October 2018), 595--615. DOI=https://doi.org/10.1007/s00778-017-0474-5.

Digital Library

Google Scholar

[10]

Pokorný, J., Sykora, J., and Valenta, M. 2019. Data Lineage Temporally Using a Graph Database. In Proceedings of the 11th International Conference on Management of Digital EcoSystems (MEDES '19). Association for Computing Machinery, New York, NY, USA, 285--291. DOI=https://doi.org/10.1145/3297662.3365794.

Digital Library

Google Scholar

[11]

Tang, M., Shao, S., Yang, W., Liang, Y., Yu, Y., Saha, B., and Hyun, D. 2019. SAC: A System for Big Data Lineage Tracking. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1964-1967. DOI=http://doi.org/10.1109/ICDE.2019.00215.

Crossref

Google Scholar

[12]

lulumengyi. 2019. Hive SQL AST. Retrieved from https://github.com/lulumengyi/Hive_SQL_AST.

Google Scholar

Index Terms

A Column-Level Data Lineage Processing System Based on Hive
1. Software and its engineering
  1. Software notations and tools

Recommendations

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture ...
Lineage tracing for general data warehouse transformations

Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of ...
Query optimization using column statistics in hive
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications

Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the ...

Comments

Information & Contributors

Information

Published In

ICBDT '20: Proceedings of the 3rd International Conference on Big Data Technologies

September 2020

250 pages

ISBN:9781450387859

DOI:10.1145/3422713

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICBDT 2020

ICBDT 2020: 2020 3rd International Conference on Big Data Technologies

September 18 - 20, 2020

Qingdao, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
170
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Index Terms

Recommendations

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Lineage tracing for general data warehouse transformations

Query optimization using column statistics in hive

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations