Skip to main content

Augmented Lineage: Traceability of Data Analysis Including Complex UDFs

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12923))

Abstract

Data lineage allows information to be traced to its origin in data analysis by showing how the results were derived. Although many methods have been proposed to identify the source data from which the analysis results are derived, analysis is becoming increasingly complex both with regard to the target (e.g., images, videos, and texts) and technology (e.g., AI and machine learning). In such complex data analysis, simply showing the source data may not ensure traceability. Analysts often need to know which parts of images are relevant to the output and why the classifier made a decision. Recent studies have intensively investigated interpretability and explainability in the machine learning (ML) domain. Integrating these techniques into the lineage framework will greatly enhance the traceability of complex data analysis, including the basis for decisions. In this paper, we propose the concept of augmented lineage, which is an extended lineage, and an efficient method to derive the augmented lineage for complex data analysis. We express complex data analysis flows using relational operators by combining user defined functions (UDFs). UDFs can represent invocations of AI/ML models within the data analysis. Then we present an algorithm to derive the augmented lineage for arbitrarily chosen tuples among the analysis results. We also experimentally demonstrate the efficiency of the proposed method.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    If a table is referred to more than once, each reference is regarded as access to a different table.

  2. 2.

    UDF can be used to model target-specific simple computation (e.g., compute an area), too. For simplicity of discussion, we focus on the use of UDF for complex data analysis.

  3. 3.

    If a reason is not needed (only the output value is needed), we can perform \(f\) in the non-reasoning mode, which produces no reason.

References

  1. Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An annotation management system for relational databases. VLDB J. 14(4), 373–396 (2005)

    Article  Google Scholar 

  2. Chalapathy, R., Chawla, S.: Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407 (2019)

  3. Cheney, J., Chiticariu, L., Tan, W.C.: Provenance in Databases: Why, How, and Where. Now Publishers Inc, Hanover (2009)

    Google Scholar 

  4. Cui, Y., Widom, J.: Practical lineage tracing in data warehouses. In: Proceedings of 16th International Conference on Data Engineering, pp. 367–378 (2000)

    Google Scholar 

  5. Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. VLDB J. 12(1), 41–58 (2003)

    Article  Google Scholar 

  6. Cui, Y., Widom, J., Wiener, J.L.: Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 25(2), 179–227 (2000)

    Article  Google Scholar 

  7. Kermany, D.S., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172(5), 1122–1131.e9 (2018)

    Google Scholar 

  8. Du, M., Liu, N., Hu, X.: Techniques for interpretable machine learning. Commun. ACM 63(1), 68–77 (2019)

    Article  Google Scholar 

  9. Gunning, D.: Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA), nd Web 2, 2 (2017)

    Google Scholar 

  10. Herschel, M., Diestelkämper, R., Ben Lahmar, H.: A survey on provenance: what for? What form? What from? VLDB J. 26(6), 881–906 (2017)

    Article  Google Scholar 

  11. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report 07–49, University of Massachusetts, Amherst, October 2007

    Google Scholar 

  12. Kwon, D., Kim, H., Kim, J., Suh, S.C., Kim, I., Kim, K.J.: A survey of deep learning-based network anomaly detection. Clust. Comput. 22(1), 949–961 (2017). https://doi.org/10.1007/s10586-017-1117-8

    Article  Google Scholar 

  13. Litjens, G., et al.: Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci. Rep. 6(1), 26286 (2016)

    Google Scholar 

  14. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774. Curran Associates, Inc. (2017)

    Google Scholar 

  15. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)

    Google Scholar 

  16. Selvaraju, R.R., et al.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

    Google Scholar 

  17. Wu, E., Madden, S., Stonebraker, M.: Subzero: a fine-grained lineage system for scientific databases. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 865–876 (2013)

    Google Scholar 

  18. Zhang, S., Yao, L., Sun, A., Tay, Y.: Deep learning based recommender system: a survey and new perspectives. ACM Comput. Surv. (CSUR) 52(1), 1–38 (2019)

    Google Scholar 

  19. Zheng, N., Alawini, A., Ives, Z.G.: Fine-grained provenance for matching & ETL. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 184–195 (2019)

    Google Scholar 

Download references

Acknowledgment

This work was partly supported by JSPS KAKENHI Grant Number JP19H04114 and the Project Commissioned by New Energy and Industrial Technology Development Organization (JPNP20006).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Masaya Yamada or Hiroyuki Kitagawa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yamada, M., Kitagawa, H., Amagasa, T., Matono, A. (2021). Augmented Lineage: Traceability of Data Analysis Including Complex UDFs. In: Strauss, C., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2021. Lecture Notes in Computer Science(), vol 12923. Springer, Cham. https://doi.org/10.1007/978-3-030-86472-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86472-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86471-2

  • Online ISBN: 978-3-030-86472-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics