Skip to main content

Computing Data Lineage and Business Semantics for Data Warehouse

  • Conference paper
  • First Online:
  • 774 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 914))

Abstract

We present and validate a method and underlying set of technologies, data structures and algorithms to calculate, categorize and visualize component dependencies, data lineage and business semantics from the database structures and queries, independently of actual data in the data warehouse. Chosen approach based on semantic techniques, probabilistic weight calculation and estimation of the impact of data in queries and implemented rule system supports the calculation of the dependency graph from these estimates. We demonstrate a method for business semantics integration and ontology learning from data structures and schemas with a combination of query semantics captured by dependency graph. Annotation of technical assets using a business ontology provides meaning and governance view for human and machine agents to address various planning, automation and decision support problems. Data processing performance and business ontology integration is evaluated and analyzed over several real-life datasets.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.goldparser.org/.

  2. 2.

    http://www.dlineage.com/.

References

  1. Cheney, J., Chiticariu, L., Tan, W.-C.: Provenance in databases: why, how, and where. Found. Trends Databases 1(4), 379–474 (2007)

    Article  Google Scholar 

  2. Tan, W.: Provenance in databases: past, current, and future. In: SIGMOD 2007, pp. 1–10 (2007)

    Google Scholar 

  3. Priebe, T., Reisser, A., Anh Hoang, D.T.: Reinventing the wheel?! Why harmonization and reuse fail in complex data warehouse environments and a proposed solution to the problem. In: Proceedings of the 10th International Conference on Wirtschaftsinformatik, pp. 766–775 (2011)

    Google Scholar 

  4. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-Science. SIGMOD Rec. 34(3), 31–36 (2005)

    Article  Google Scholar 

  5. Davidson, S.B., Freire, J.: Provenance and scientific workflows. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data - SIGMOD 2008, p. 1345 (2008)

    Google Scholar 

  6. Bose, R., Frew, J.: Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. 37(1), 1–28 (2005)

    Article  Google Scholar 

  7. Buneman, P., Tan, W.: Provenance in databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1171–1173 (2007)

    Google Scholar 

  8. Zdonik, S.B.: Provenance, lineage, and workflows. In: Computer (Long. Beach. Calif), pp. 1–24 (2010)

    Google Scholar 

  9. Buneman, P., Khanna, S., Wang-Chiew, T.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44503-X_20

    Chapter  Google Scholar 

  10. Cui, Y., Widom, J., Wiener, J.L.: Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 25(2), 179–227 (2000)

    Article  Google Scholar 

  11. Green, T.J., Karvounarakis, G., Tannen, V.: Provenance semirings. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems - Pod. 2007, no. June, p. 31 (2007)

    Google Scholar 

  12. Buneman, P., Khanna, S., Tan, W.-C.: On propagation of deletions and annotations through views. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems - Pod. 2002, vol. 2002, no. June, p. 150 (2002)

    Google Scholar 

  13. Buneman, P., Cheney, J., Vansummeren, S.: On the expressiveness of implicit provenance in query and update languages. In: Schwentick, T., Suciu, D. (eds.) ICDT 2007. LNCS, vol. 4353, pp. 209–223. Springer, Heidelberg (2006). https://doi.org/10.1007/11965893_15

    Chapter  Google Scholar 

  14. Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An annotation management system for relational databases. VLDB J. 14(4), 373–396 (2005)

    Article  Google Scholar 

  15. Green, T., Karvounarakis, G.: Update exchange with mappings and provenance. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 675–686 (2007)

    Google Scholar 

  16. Deutch, D., Moskovitch, Y., Tannen, V.: A provenance framework for data-dependent process analysis. Proc. VLDB Endow. 7(6), 457–468 (2014)

    Article  Google Scholar 

  17. Heinis, T., Alonso, G.: Efficient lineage tracking for scientific workflows. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data - SIGMOD 2008, Section 2, p. 1007 (2008)

    Google Scholar 

  18. Missier, P., Belhajjame, K., Zhao, J., Roos, M., Goble, C.: Data lineage model for taverna workflows with lightweight annotation requirements. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 17–30. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89965-5_4

    Chapter  Google Scholar 

  19. Ikeda, R., Das Sarma, A., Widom, J.: Logical provenance in data-oriented workflows? In: Proceedings - International Conference on Data Engineering, pp. 877–888 (2013)

    Google Scholar 

  20. Ramesh, B., Jarke, M.: Toward reference models for requirements traceability. IEEE Trans. Softw. Eng. 27(1), 58–93 (2001)

    Article  Google Scholar 

  21. Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. VLDB J. 12(1), 41–58 (2003)

    Article  Google Scholar 

  22. Benjelloun, O., Das Sarma, A., Hayworth, C., Widom, J.: An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull. 29(1), 5–16 (2006)

    Google Scholar 

  23. Fan, H., Poulovassilis, A.: Using AutoMed metadata in data warehousing environments. In: Proceedings of the 6th ACM International of the Work. In: Data Warehouse Ol. - Dol. 2003, p. 86 (2003)

    Google Scholar 

  24. Giorgini, P., Rizzi, S., Garzetti, M.: A goal-oriented approach to requirement analysis in data warehouses. Decis. Support Syst. 45(1), 4–21 (2008)

    Article  Google Scholar 

  25. Fan, H., Poulovassilis, A.: Using schema transformation pathways for data lineage tracing. In: Jackson, M., Nelson, D., Stirk, S. (eds.) BNCOD 2005. LNCS, vol. 3567, pp. 133–144. Springer, Heidelberg (2005). https://doi.org/10.1007/11511854_11

    Chapter  Google Scholar 

  26. Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: Proceedings of the 13th International Conference on Data Engineering, no. January, pp. 91–102 (1997)

    Google Scholar 

  27. Dayal, U., Castellanos, M., Simitsis, A., Wilkinson, K.: Data integration flows for business intelligence. In: Proceedings of the 12th International Conference on Extending Database Technology Advanced Database Technology - EDBT 2009, p. 1 (2009)

    Google Scholar 

  28. Simitsis, A., Vassiliadis, P.: A methodology for the conceptual modeling of ETL processes. In: CAiSE Work, pp. 305–316 (2003)

    Google Scholar 

  29. Kabiri, A., Chiadmi, D.: A method for modelling and organizing ETL processes. In: 2nd International Conference on Innovative Computing Technology, INTECH 2012, pp. 138–143 (2012)

    Google Scholar 

  30. Skoutas, D., Simitsis, A.: Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int. J. Semant. Web Inf. Syst. 3, 1–24 (2007)

    Article  Google Scholar 

  31. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Improving data cleaning quality using a data lineage facility. In: DMDW (2001)

    Google Scholar 

  32. Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: Proceedings of the 2005 CIDR Conference, pp. 262–276 (2005)

    Google Scholar 

  33. DeSantana, A.S., Moura, A.M.C.: Metadata to support transformations and data & metadata lineage in a warehousing environment. In: Proceedings of 6th International Conference on Data Warehousing and Knowledge Discovery, DaWaK 2004, Zaragoza, Spain, vol. 3181, 1–3 September 2004, pp. 249–258 (2004)

    Google Scholar 

  34. Tomingas, K., Kliimask, M., Tammet, T.: Data integration patterns for data warehouse automation. In: Bassiliades, N., et al. (eds.) New Trends in Database and Information Systems II. AISC, vol. 312, pp. 41–55. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-10518-5_4

    Chapter  Google Scholar 

  35. Bala, M., Boussaid, O., Alimazighi, Z.: Extracting-transforming-loading modeling approach for big data analytics. Int. J. Decis. Support Syst. Technol. 8(4), 50–69 (2016)

    Article  Google Scholar 

  36. Bansal, S.K.: Towards a semantic extract-transform-load (ETL) framework for big data integration. In: Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014, pp. 522–529 (2014)

    Google Scholar 

  37. Wang, J., Crawl, D., Purawat, S., Nguyen, M., Altintas, I.: Big data provenance: challenges, state of the art and opportunities. In: Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015, pp. 2509–2516 (2015)

    Google Scholar 

  38. Suen, C.H., Ko, R.K.L., Tan, Y.S., Jagadpramana, P., Lee, B.S.: S2Logger: end-to-end data tracking mechanism for cloud data provenance. In: Proceedings - 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2013 (2013)

    Google Scholar 

  39. Glavic, B., Dittrich, K.: Data provenance: a categorization of existing approaches. In: BTW, pp. 227–241 (2007)

    Google Scholar 

  40. Davidson, S., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1–6 (2008)

    Google Scholar 

  41. Anand, M.K., Bowers, S., Ludascher, B.: Techniques for efficiently querying scientific workflow provenance graphs. In: International Conference on Extending Database Technology, pp. 287–298 (2010)

    Google Scholar 

  42. Guarino, N.: Formal ontology and information systems. In: Proceedings of the first International Conference on FOIS 1998, vol. 46, no. June, pp. 3–15 (1998)

    Google Scholar 

  43. Guarino, N.: Semantic matching: formal ontological distinctions for information organization, extraction, and integration. In: Pazienza, M.T. (ed.) SCIE 1997. LNCS, vol. 1299, pp. 139–170. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63438-X_8

    Chapter  Google Scholar 

  44. Maedche, A., Staab, S.: Ontology learning. Handb. Ontol. 13(3), 245–267 (2004)

    Google Scholar 

  45. Maedche, A., Staab, S.: Ontology learning for the semantic web. IEEE Intell. Syst. 16, 72–79 (2001)

    Article  Google Scholar 

  46. Li, M.L.M., Du, X.-Y., Wang, S.: Learning ontology from relational database. In: 2005 International Conference on Machine Learning and Cybernetics, vol. 6, no. August, pp. 18–21 (2005)

    Google Scholar 

  47. Astrova, I.: Rules for mapping SQL relational databases to OWL ontologies. In: Metadata and Semantics, pp. 415–424 (2009)

    Google Scholar 

  48. Tomingas, K., Tammet, T., Kliimask, M.: Rule-based impact analysis for enterprise business intelligence. In: Iliadis, L., Maglogiannis, I., Papadopoulos, H., Sioutas, S., Makris, C. (eds.) AIAI 2014. IAICT, vol. 437, pp. 301–309. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44722-2_32

    Chapter  Google Scholar 

  49. Anand, M.K., Bowers, S., McPhillips, T., Ludäscher, B.: Efficient provenance storage over nested data collections. In: Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology EDBT 2009, p. 958 (2009)

    Google Scholar 

  50. Tomingas, K., Järv, P., Tammet, T.: Discovering data lineage from data warehouse procedures 1. In: Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, pp. 101–110 (2016)

    Google Scholar 

Download references

Acknowledgements

The research has been supported by EU through European Regional Development Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kalle Tomingas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tomingas, K., Järv, P., Tammet, T. (2019). Computing Data Lineage and Business Semantics for Data Warehouse. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Bernardino, J., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2016. Communications in Computer and Information Science, vol 914. Springer, Cham. https://doi.org/10.1007/978-3-319-99701-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99701-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99700-1

  • Online ISBN: 978-3-319-99701-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics