skip to main content
10.1145/3422713.3422719acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbdtConference Proceedingsconference-collections
research-article

A Column-Level Data Lineage Processing System Based on Hive

Authors Info & Claims
Published:23 October 2020Publication History

ABSTRACT

For big data, the data warehouse stores all business data of the entire enterprise. The data collected in the data warehouse will generate new data collection through the operations of data union, splitting, and transformation. This data conversion relationship in the data production process is called data lineage. Therefore, tracking the data lineage in the data warehouse is an important part of the process of data warehouse construction. However, the existing open-source solutions have shortcomings such as high coupling, poor accuracy, and intrusion in the processing of this critical link. Therefore, this paper designs a column-level data lineage processing system based on the Hive data warehouse for the Hive data warehouse. The system has realized the ability to analyze the Data lineage of Hive SQL data locally and realizes the fine-grained and high accuracy of the data lineage analysis results while ensuring the low coupling between the data lineage function and the Hive data warehouse.

References

  1. Buneman, P., and Tan, W. C. 2019. Data Provenance: What next?. SIGMOD Rec. 47, 3 (September 2018), 5--16. DOI=https://doi.org/10.1145/3316416.3316418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Glavic B. 2012. Big data provenance: Challenges and implications for benchmarking. Specifying big data benchmarks. Springer, Berlin, Heidelberg, 72--80.Google ScholarGoogle Scholar
  3. Apache Hive. 2020. Apache Hive. Retrieved from http://hive.apache.org/.Google ScholarGoogle Scholar
  4. Park, H., Ikeda, R., and Widom, J. 2011. Ramp: A system for capturing and tracing provenance in mapreduce workflows.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Apache Atlas 2020. Apache Atlas: Data Governance and Metadata framework for Hadoop. Retrieved from http://atlas.apache.org/.Google ScholarGoogle Scholar
  6. Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E. N., O'Malley, O., ... and Zhang, X. 2014. Major technical advancements in apache hive. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 1235--1246. DOI=https://doi.org/10.1145/2588555.2595630.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., ... and Murthy, R. 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626--1629. DOI=https://doi.org/10.14778/1687553.1687609.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Camacho-Rodríguez, J., Chauhan, A., Gates, A., Koifman, E., O'Malley, O., Garg, V., ... and Jaiswal, D. 2019. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1773--1786. DOI=https://doi.org/10.1145/3299869.3314045.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Interlandi, M., Ekmekji, A., Shah, K., Gulzar, M. A., Tetali, S. D., Kim, M., ... and Condie, T. 2018. Adding data provenance support to Apache Spark. The VLDB Journal 27, 5 (October 2018), 595--615. DOI=https://doi.org/10.1007/s00778-017-0474-5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Pokorný, J., Sykora, J., and Valenta, M. 2019. Data Lineage Temporally Using a Graph Database. In Proceedings of the 11th International Conference on Management of Digital EcoSystems (MEDES '19). Association for Computing Machinery, New York, NY, USA, 285--291. DOI=https://doi.org/10.1145/3297662.3365794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Tang, M., Shao, S., Yang, W., Liang, Y., Yu, Y., Saha, B., and Hyun, D. 2019. SAC: A System for Big Data Lineage Tracking. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1964-1967. DOI=http://doi.org/10.1109/ICDE.2019.00215.Google ScholarGoogle Scholar
  12. lulumengyi. 2019. Hive SQL AST. Retrieved from https://github.com/lulumengyi/Hive_SQL_AST.Google ScholarGoogle Scholar

Index Terms

  1. A Column-Level Data Lineage Processing System Based on Hive

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICBDT '20: Proceedings of the 3rd International Conference on Big Data Technologies
      September 2020
      250 pages
      ISBN:9781450387859
      DOI:10.1145/3422713

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 October 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)39
      • Downloads (Last 6 weeks)5

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader