Skip to main content

Simplifying Extract-Transform-Load for Ranked Hierarchical Trees via Mapping Specifications

  • Conference paper
  • 418 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 346))

Abstract

A popular approach to deal with data integration of heterogeneous data sources is to Extract, Transform and Load (ETL) data from disparate sources into a consolidated data store while addressing integration challenges including, but not limited to: structural differences in the source and target schemas, semantic differences in their vocabularies, and data encoding. This work focuses on the integration of tree-like hierarchical data or information that when modeled as a relational schema can take the shape of a flat schema, a self-referential schema or a hybrid schema. Examples include evolutionary taxonomies, geological time scales, and organizational charts. Given the observed complexity in developing ETL processes for this particular but common type of data, our work focuses on reducing the time and effort required to map and transform this data. Our research automates and simplifies all possible transformations involving ranked self-referential and flat representations, by: (a) proposing MSL+, an extension to IBM’s Mapping Specification Language (MSL), to succinctly express the mapping between schemas while hiding the actual transformation implementation complexity from the user, and (b) implementing a transformation component for the Talend open-source ETL platform, called Tree Transformer (TT). We evaluated MSL+ and TT, in the context of biodiversity data integration, where this class of transformations is a recurring pattern. We demonstrate the effectiveness of MSL+ with respect to development time savings as well as a 2 to 25-fold performance improvement in transformation time achieved by TT when compared to existing implementations and to Talend built-in components.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press (2011)

    Google Scholar 

  2. Morris, P.J.: Relational database design and implementation for biodiversity informatics. PhyloInformatics 7, 1–66 (2005)

    Google Scholar 

  3. iDigBio Project, http://www.idigbio.org

  4. Lenzerini, M.: Data Integration: A Theoretical Perspective. In: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 233-246. ACM PODS (2002)

    Google Scholar 

  5. Katsis, Y., Papakonstantinou, Y.: View-based Data Integration. Encyclopedia of Database Systems, pp. 3332–3339 (2009), doi:10.1007/978-0-387-39940-9_1072

    Google Scholar 

  6. Ullman, J.D.: Information Integration Using Logical Views. In: Afrati, F.N., Kolaitis, P.G. (eds.) ICDT 1997. LNCS, vol. 1186, pp. 19–40. Springer, Heidelberg (1996)

    Google Scholar 

  7. Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries over heterogeneous data sources. In: VLDB (2001)

    Google Scholar 

  8. Halevy, A.Y.: Answering Queries Using Views: A Survey. The VLDB Journal 10, 270–294 (2001)

    Article  MATH  Google Scholar 

  9. Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., Vassalos, V., Widom, J.: The TSIMMIS Approach to Mediation: Data Models and Languages. In: 2nd Workshop on Next-Gen. Information Technologies and Systems, Naharia, Israel (June 1995)

    Google Scholar 

  10. Carey, M.J., Haas, L.M., Schwarz, P.M., Arya, M., Cody, W.F., Fagin, R., Flickner, M., Luniewski, A., Niblack, W., Petkovic, D., ThomasII, J., Williams, J.H., Wimmers, E.L.: Towards heterogeneous multimedia information systems: The Garlic approach. In: RIDE-DOM, pp. 124–131 (1995)

    Google Scholar 

  11. Kirk, T., Levy, A.Y., Sagiv, Y., Srivatava, D.: The Information Manifold. In: AAAI Spring Symposium on Information Gathering (1995)

    Google Scholar 

  12. Friedman, M., Levy, A.Y., Millstein, T.D.: Navigational Plans for Data Integration. In: Proceedings of the 16th National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, pp. 67–73. AAAI/IAAI (1999)

    Google Scholar 

  13. Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: Query Processing under GLAV Mappings for Relational and Graph Databases. In: VLDB 2013 (2013)

    Google Scholar 

  14. Kwakye, M.M.: A Practical Approach to Merging Multidimensional Data Models. IARIA (2013)

    Google Scholar 

  15. Haas, L.M., Hernandez, M.A., Ho, H., Popa, L., Roth, M.: Clio Grows Up: From Research Prototype to Industrial Tool. ACM SIGMOD (2005)

    Google Scholar 

  16. Fagin, R., Haas, L.M., Hernandez, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: Schema Mapping Creation and Data Exchange. Conceptual Modeling, pp. 198–236

    Google Scholar 

  17. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data Exchange: Semantics and Query Answering. Theoretical Comput. Sci. 336(1), 89–124 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  18. Miller, R.J., Haas, L.M., Hernandez, M.A.: Schema Mapping as Query Discovery. In: Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000 (2000)

    Google Scholar 

  19. Andritsos, P., Fagin, R., Fuxman, A., Haas, L.M., Hernandez, M.A., Ho, C.T.H., Kementsietsidis, A., Miller, R.J., Naumann, F., Popa, L., Velegrakis, Y., Vilarem, C.: Schema Management. IEEE Data Engineering Bulletin (DEBU) 25(3), 32–38 (2002)

    Google Scholar 

  20. Hernandez, M.A., Popa, L., Ho, C.T.H., Naumann, F.: Clio: A Schema Mapping Tool for Information Integration. In: Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms, and Networks, ISPAN 2005, p. 11 (2005)

    Google Scholar 

  21. Hernandez, M.A., Miller, R.J., Haas, L.M.: Clio: A Semi-Automatic Tool For Schema Mapping. In: A Workshop Presentation at ACM Conference, p. 607. ACM SIGMOD (2001)

    Google Scholar 

  22. Miller, R.J., Hernandez, M.A., Haas, L.M.: The Clio Project: Managing Heterogeneity. SIGMOD Record 30(1), 78–83

    Google Scholar 

  23. Fuxman, A., Hernandez, M.A., Ho, C.T.H., Miller, R.J., Papotti, P., Popa, L.: Nested Mappings: Schema Mapping Reloaded. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB 2006, pp. 67–78 (2006)

    Google Scholar 

  24. IBM InfoSphere Data Architect, http://www-03.ibm.com/software/products/en/ibminfodataarch

  25. Xu, L.: Source discovery and schema mapping for data integration, Doctoral Dissertation, Brigham Young University (2003)

    Google Scholar 

  26. Xu, L., Embley, D.W.: Combining the Best of Global-as-View and Local-as-View for Data Integration. In: Conference on Information Systems Technology and its Applications (ISTA 2004), Salt Lake City, Utah, USA, pp. 123–136 (2004)

    Google Scholar 

  27. Xu, L., Embley, D.W.: A composite approach to automating direct and indirect schema mappings. Information Systems 31(8), 697–732 (2006)

    Article  Google Scholar 

  28. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001)

    Article  MATH  Google Scholar 

  29. Popa, L., Velegrakis, Y., Miller, R.J., Hernandez, M.A., Fagin, R.: Translating Web Data. In: Proceedings of the 28th VLDB Conference, Hong Kong, China (2002)

    Google Scholar 

  30. Specify database, http://specifysoftware.org/

  31. Symbiota, http://symbiota.org/tiki/tiki-index.php

  32. DarwinCore, TDWG, http://rs.tdwg.org/dwc/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarfaraz Soomro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Soomro, S., Matsunaga, A., Fortes, J.A.B. (2015). Simplifying Extract-Transform-Load for Ranked Hierarchical Trees via Mapping Specifications. In: Bouabana-Tebibel, T., Rubin, S. (eds) Formalisms for Reuse and Systems Integration. FMI 2014. Advances in Intelligent Systems and Computing, vol 346. Springer, Cham. https://doi.org/10.1007/978-3-319-16577-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16577-6_9

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16576-9

  • Online ISBN: 978-3-319-16577-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics