Augmented Lineage: Traceability of Data Analysis Including Complex UDFs

Yamada, Masaya; Kitagawa, Hiroyuki; Amagasa, Toshiyuki; Matono, Akiyoshi

doi:10.1007/978-3-030-86472-9_6

Augmented Lineage: Traceability of Data Analysis Including Complex UDFs

Conference paper
First Online: 31 August 2021

1250 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12923))

Abstract

Data lineage allows information to be traced to its origin in data analysis by showing how the results were derived. Although many methods have been proposed to identify the source data from which the analysis results are derived, analysis is becoming increasingly complex both with regard to the target (e.g., images, videos, and texts) and technology (e.g., AI and machine learning). In such complex data analysis, simply showing the source data may not ensure traceability. Analysts often need to know which parts of images are relevant to the output and why the classifier made a decision. Recent studies have intensively investigated interpretability and explainability in the machine learning (ML) domain. Integrating these techniques into the lineage framework will greatly enhance the traceability of complex data analysis, including the basis for decisions. In this paper, we propose the concept of augmented lineage, which is an extended lineage, and an efficient method to derive the augmented lineage for complex data analysis. We express complex data analysis flows using relational operators by combining user defined functions (UDFs). UDFs can represent invocations of AI/ML models within the data analysis. Then we present an algorithm to derive the augmented lineage for arbitrarily chosen tuples among the analysis results. We also experimentally demonstrate the efficiency of the proposed method.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
If a table is referred to more than once, each reference is regarded as access to a different table.
2.
UDF can be used to model target-specific simple computation (e.g., compute an area), too. For simplicity of discussion, we focus on the use of UDF for complex data analysis.
3.
If a reason is not needed (only the output value is needed), we can perform \(f\) in the non-reasoning mode, which produces no reason.

References

Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An annotation management system for relational databases. VLDB J. 14(4), 373–396 (2005)
Article Google Scholar
Chalapathy, R., Chawla, S.: Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407 (2019)
Cheney, J., Chiticariu, L., Tan, W.C.: Provenance in Databases: Why, How, and Where. Now Publishers Inc, Hanover (2009)
Google Scholar
Cui, Y., Widom, J.: Practical lineage tracing in data warehouses. In: Proceedings of 16th International Conference on Data Engineering, pp. 367–378 (2000)
Google Scholar
Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. VLDB J. 12(1), 41–58 (2003)
Article Google Scholar
Cui, Y., Widom, J., Wiener, J.L.: Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 25(2), 179–227 (2000)
Article Google Scholar
Kermany, D.S., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172(5), 1122–1131.e9 (2018)
Google Scholar
Du, M., Liu, N., Hu, X.: Techniques for interpretable machine learning. Commun. ACM 63(1), 68–77 (2019)
Article Google Scholar
Gunning, D.: Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA), nd Web 2, 2 (2017)
Google Scholar
Herschel, M., Diestelkämper, R., Ben Lahmar, H.: A survey on provenance: what for? What form? What from? VLDB J. 26(6), 881–906 (2017)
Article Google Scholar
Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report 07–49, University of Massachusetts, Amherst, October 2007
Google Scholar
Kwon, D., Kim, H., Kim, J., Suh, S.C., Kim, I., Kim, K.J.: A survey of deep learning-based network anomaly detection. Clust. Comput. 22(1), 949–961 (2017). https://doi.org/10.1007/s10586-017-1117-8
Article Google Scholar
Litjens, G., et al.: Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci. Rep. 6(1), 26286 (2016)
Google Scholar
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774. Curran Associates, Inc. (2017)
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Google Scholar
Selvaraju, R.R., et al.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Wu, E., Madden, S., Stonebraker, M.: Subzero: a fine-grained lineage system for scientific databases. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 865–876 (2013)
Google Scholar
Zhang, S., Yao, L., Sun, A., Tay, Y.: Deep learning based recommender system: a survey and new perspectives. ACM Comput. Surv. (CSUR) 52(1), 1–38 (2019)
Google Scholar
Zheng, N., Alawini, A., Ives, Z.G.: Fine-grained provenance for matching & ETL. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 184–195 (2019)
Google Scholar

Download references

Acknowledgment

This work was partly supported by JSPS KAKENHI Grant Number JP19H04114 and the Project Commissioned by New Energy and Industrial Technology Development Organization (JPNP20006).

Author information

Authors and Affiliations

University of Tsukuba, Tsukuba, Ibaraki, Japan
Masaya Yamada, Hiroyuki Kitagawa & Toshiyuki Amagasa
National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo, Japan
Masaya Yamada, Hiroyuki Kitagawa & Akiyoshi Matono

Authors

Masaya Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kitagawa
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Amagasa
View author publications
You can also search for this author in PubMed Google Scholar
Akiyoshi Matono
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Masaya Yamada or Hiroyuki Kitagawa .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
Johannes Kepler University of Linz, Linz, Oberösterreich, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yamada, M., Kitagawa, H., Amagasa, T., Matono, A. (2021). Augmented Lineage: Traceability of Data Analysis Including Complex UDFs. In: Strauss, C., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2021. Lecture Notes in Computer Science(), vol 12923. Springer, Cham. https://doi.org/10.1007/978-3-030-86472-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-86472-9_6
Published: 31 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86471-2
Online ISBN: 978-3-030-86472-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics