Abstract
A popular approach to deal with data integration of heterogeneous data sources is to Extract, Transform and Load (ETL) data from disparate sources into a consolidated data store while addressing integration challenges including, but not limited to: structural differences in the source and target schemas, semantic differences in their vocabularies, and data encoding. This work focuses on the integration of tree-like hierarchical data or information that when modeled as a relational schema can take the shape of a flat schema, a self-referential schema or a hybrid schema. Examples include evolutionary taxonomies, geological time scales, and organizational charts. Given the observed complexity in developing ETL processes for this particular but common type of data, our work focuses on reducing the time and effort required to map and transform this data. Our research automates and simplifies all possible transformations involving ranked self-referential and flat representations, by: (a) proposing MSL+, an extension to IBM’s Mapping Specification Language (MSL), to succinctly express the mapping between schemas while hiding the actual transformation implementation complexity from the user, and (b) implementing a transformation component for the Talend open-source ETL platform, called Tree Transformer (TT). We evaluated MSL+ and TT, in the context of biodiversity data integration, where this class of transformations is a recurring pattern. We demonstrate the effectiveness of MSL+ with respect to development time savings as well as a 2 to 25-fold performance improvement in transformation time achieved by TT when compared to existing implementations and to Talend built-in components.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press (2011)
Morris, P.J.: Relational database design and implementation for biodiversity informatics. PhyloInformatics 7, 1–66 (2005)
iDigBio Project, http://www.idigbio.org
Lenzerini, M.: Data Integration: A Theoretical Perspective. In: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 233-246. ACM PODS (2002)
Katsis, Y., Papakonstantinou, Y.: View-based Data Integration. Encyclopedia of Database Systems, pp. 3332–3339 (2009), doi:10.1007/978-0-387-39940-9_1072
Ullman, J.D.: Information Integration Using Logical Views. In: Afrati, F.N., Kolaitis, P.G. (eds.) ICDT 1997. LNCS, vol. 1186, pp. 19–40. Springer, Heidelberg (1996)
Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries over heterogeneous data sources. In: VLDB (2001)
Halevy, A.Y.: Answering Queries Using Views: A Survey. The VLDB Journal 10, 270–294 (2001)
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., Vassalos, V., Widom, J.: The TSIMMIS Approach to Mediation: Data Models and Languages. In: 2nd Workshop on Next-Gen. Information Technologies and Systems, Naharia, Israel (June 1995)
Carey, M.J., Haas, L.M., Schwarz, P.M., Arya, M., Cody, W.F., Fagin, R., Flickner, M., Luniewski, A., Niblack, W., Petkovic, D., ThomasII, J., Williams, J.H., Wimmers, E.L.: Towards heterogeneous multimedia information systems: The Garlic approach. In: RIDE-DOM, pp. 124–131 (1995)
Kirk, T., Levy, A.Y., Sagiv, Y., Srivatava, D.: The Information Manifold. In: AAAI Spring Symposium on Information Gathering (1995)
Friedman, M., Levy, A.Y., Millstein, T.D.: Navigational Plans for Data Integration. In: Proceedings of the 16th National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, pp. 67–73. AAAI/IAAI (1999)
Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: Query Processing under GLAV Mappings for Relational and Graph Databases. In: VLDB 2013 (2013)
Kwakye, M.M.: A Practical Approach to Merging Multidimensional Data Models. IARIA (2013)
Haas, L.M., Hernandez, M.A., Ho, H., Popa, L., Roth, M.: Clio Grows Up: From Research Prototype to Industrial Tool. ACM SIGMOD (2005)
Fagin, R., Haas, L.M., Hernandez, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: Schema Mapping Creation and Data Exchange. Conceptual Modeling, pp. 198–236
Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data Exchange: Semantics and Query Answering. Theoretical Comput. Sci. 336(1), 89–124 (2005)
Miller, R.J., Haas, L.M., Hernandez, M.A.: Schema Mapping as Query Discovery. In: Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000 (2000)
Andritsos, P., Fagin, R., Fuxman, A., Haas, L.M., Hernandez, M.A., Ho, C.T.H., Kementsietsidis, A., Miller, R.J., Naumann, F., Popa, L., Velegrakis, Y., Vilarem, C.: Schema Management. IEEE Data Engineering Bulletin (DEBU) 25(3), 32–38 (2002)
Hernandez, M.A., Popa, L., Ho, C.T.H., Naumann, F.: Clio: A Schema Mapping Tool for Information Integration. In: Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms, and Networks, ISPAN 2005, p. 11 (2005)
Hernandez, M.A., Miller, R.J., Haas, L.M.: Clio: A Semi-Automatic Tool For Schema Mapping. In: A Workshop Presentation at ACM Conference, p. 607. ACM SIGMOD (2001)
Miller, R.J., Hernandez, M.A., Haas, L.M.: The Clio Project: Managing Heterogeneity. SIGMOD Record 30(1), 78–83
Fuxman, A., Hernandez, M.A., Ho, C.T.H., Miller, R.J., Papotti, P., Popa, L.: Nested Mappings: Schema Mapping Reloaded. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB 2006, pp. 67–78 (2006)
IBM InfoSphere Data Architect, http://www-03.ibm.com/software/products/en/ibminfodataarch
Xu, L.: Source discovery and schema mapping for data integration, Doctoral Dissertation, Brigham Young University (2003)
Xu, L., Embley, D.W.: Combining the Best of Global-as-View and Local-as-View for Data Integration. In: Conference on Information Systems Technology and its Applications (ISTA 2004), Salt Lake City, Utah, USA, pp. 123–136 (2004)
Xu, L., Embley, D.W.: A composite approach to automating direct and indirect schema mappings. Information Systems 31(8), 697–732 (2006)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001)
Popa, L., Velegrakis, Y., Miller, R.J., Hernandez, M.A., Fagin, R.: Translating Web Data. In: Proceedings of the 28th VLDB Conference, Hong Kong, China (2002)
Specify database, http://specifysoftware.org/
Symbiota, http://symbiota.org/tiki/tiki-index.php
DarwinCore, TDWG, http://rs.tdwg.org/dwc/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Soomro, S., Matsunaga, A., Fortes, J.A.B. (2015). Simplifying Extract-Transform-Load for Ranked Hierarchical Trees via Mapping Specifications. In: Bouabana-Tebibel, T., Rubin, S. (eds) Formalisms for Reuse and Systems Integration. FMI 2014. Advances in Intelligent Systems and Computing, vol 346. Springer, Cham. https://doi.org/10.1007/978-3-319-16577-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-16577-6_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16576-9
Online ISBN: 978-3-319-16577-6
eBook Packages: EngineeringEngineering (R0)