ABSTRACT
Mapping and translating data across different representations is a crucial problem in information systems. Many formalisms and tools are currently used for this purpose, to the point that developers typically face a difficult question: "what is the right tool for my translation task?" In this paper, we introduce several techniques that contribute to answer this question. Among these, a fairly general definition of a data transformation system, a new and very efficient similarity measure to evaluate the outputs produced by such a system, and a metric to estimate user efforts. Based on these techniques, we are able to compare a wide range of systems on many translation tasks, to gain interesting insights about their effectiveness, and, ultimately, about their "intelligence".
- S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google ScholarDigital Library
- B. Alexe, W. Tan, and Y. Velegrakis. Comparing and Evaluating Mapping Systems with STBenchmark. PVLDB, 1(2):1468--1471, 2008. Google ScholarDigital Library
- B. Alexe, W. Tan, and Y. Velegrakis. STBenchmark: Towards a Benchmark for Mapping Systems. PVLDB, 1(1):230--244, 2008. Google ScholarDigital Library
- N. Augsten, M. Bohlen, and J. Gamper. Approximate Matching of Hierarchical Data Using pq-Grams. In VLDB, pages 301--312, 2005. Google ScholarDigital Library
- A. Bernstein, E. Kaufmann, C. Kiefer, and C. Bürki. SimPack: A Generic Java Library for Similiarity Measures in Ontologies. Technical report, Department of Informatics, University of Zurich, 2005.Google Scholar
- P. A. Bernstein and S. Melnik. Model Management 2.0: Manipulating Richer Mappings. In SIGMOD, pages 1--12, 2007. Google ScholarDigital Library
- P. Bille. A Survey on Tree Edit Distance and Related Problems. TCS, 337:217--239, 2005. Google ScholarDigital Library
- A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, and G. Summa. Schema Mapping Verification: The Spicy Way. In EDBT, pages 85--96, 2008. Google ScholarDigital Library
- S. Dessloch, M. A. Hernandez, R. Wisnesky, A. Radwan, and J. Zhou. Orchid: Integrating Schema Mapping and ETL. In ICDE, pages 1307--1316, 2008. Google ScholarDigital Library
- R. Fagin, P. Kolaitis, R. Miller, and L. Popa. Data Exchange: Semantics and Query Answering. TCS, 336(1):89--124, 2005. Google ScholarDigital Library
- R. Fagin, P. Kolaitis, and L. Popa. Data Exchange: Getting to the Core. ACM TODS, 30(1):174--210, 2005. Google ScholarDigital Library
- F. Fortin. The Graph Isomorphism Problem. Technical report, Department of Computer Science, University of Alberta, 1996.Google Scholar
- . X.Gao, B. Xiao, D. Tao, and X. Li. A Survey of Graph Edit Distance. Pattern Analysis & Application, 13:113--129, 2010. Google ScholarDigital Library
- Gartner. Magic Quadrant for Data Integration Tools. http://www.gartner.com/technology/, 2011.Google Scholar
- G. Gottlob and A. Nash. Efficient Core Computation in Data Exchange. J. of the ACM, 55(2):1--49, 2008. Google ScholarDigital Library
- L. M. Haas. Beauty and the Beast: The Theory and Practice of Information Integration. In ICDT, pages 28--43, 2007. Google ScholarDigital Library
- R. Hull and M. Yoshikawa. ILOG: Declarative Creation and Manipulation of Object Identifiers. In VLDB, pages 455--468, 1990. Google ScholarDigital Library
- R. Kimball and J. Caserta. The Data Warehouse ETL Toolkit. Wiley and Sons, 2004.Google Scholar
- D. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. Google ScholarDigital Library
- T. A. Majchrzak, T. Jansen, and H. Kuchen. Efficiency evaluation of open source etl tools. In SAC, pages 287--294, 2011. Google ScholarDigital Library
- B. Marnette, G. Mecca, and P. Papotti. Scalable data exchange with functional dependencies. PVLDB, 3(1):105--116, 2010. Google ScholarDigital Library
- B. Marnette, G. Mecca, P. Papotti, S. Raunich, and D. Santoro. ++SPICY: an opensource tool for second-generation schema mapping and data exchange. PVLDB, 4(11):1438--1441, 2011.Google Scholar
- G. Mecca, P. Papotti, and S. Raunich. Core Schema Mappings. In SIGMOD, pages 655--668, 2009. Google ScholarDigital Library
- R. J. Miller, L. M. Haas, and M. A. Hernandez. Schema Mapping as Query Discovery. In VLDB, pages 77--99, 2000. Google ScholarDigital Library
- L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernandez, and R. Fagin. Translating Web Data. In VLDB, pages 598--609, 2002. Google ScholarDigital Library
- M. A. Roth, H. F. Korth, and A. Silberschatz. Extended Algebra and Calculus for Nested Relational Databases. ACM TODS, 13:389--417, October 1988. Google ScholarDigital Library
- G. Rull Fort, F. C., E. Teniente, and T. Urpí. Validation of Mappings between Schemas. Data and Know. Eng., 66(3):414--437, 2008. Google ScholarDigital Library
- L. Seligman, P. Mork, A. Halevy, K. Smith, M. J. Carey, K. Chen, C. Wolf, J. Madhavan, A. Kannan, and D. Burdick. OpenII: an Open Source Information Integration Toolkit. In SIGMOD, pages 1057--1060, 2010. Google ScholarDigital Library
- A. Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, and V. Tziovara. Benchmarking etl workflows. In TPCTC, pages 199--220, 2009. Google ScholarDigital Library
- B. ten Cate, L. Chiticariu, P. Kolaitis, and W. C. Tan. Laconic Schema Mappings: Computing Core Universal Solutions by Means of SQL Queries. PVLDB, 2(1):1006--1017, 2009. Google ScholarDigital Library
- C. J. Van Rijsbergen. Information Retrieval. Butterworths (London, Boston), 1979. Google ScholarDigital Library
- L. Wyatt, B. Caufield, and D. Pol. Principles for an etl benchmark. In TPCTC, pages 183--198, 2009. Google ScholarDigital Library
Index Terms
- What is the IQ of your data transformation system?
Recommendations
The SPARQL2XQuery interoperability framework
In the context of the emergent Web of Data, a large number of organizations, institutes and companies (e.g., DBpedia, Data.gov, GeoNames, PubMed) adopt the Linked Data practices. Utilizing the Semantic Web (SW) technologies, they publish their data and ...
Spreadsheet-based complex data transformation
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementSpreadsheets are used by millions of users as a routine all-purpose data management tool. It is now increasingly necessary for external applications and services to consume spreadsheet data. In this paper, we investigate the problem of transforming ...
Large System Performance of SPEC OMP2001 Benchmarks
ISHPC '02: Proceedings of the 4th International Symposium on High Performance ComputingPerformance characteristics of application programs on large-scale systems are often significantly different from those on smaller systems. SPEC OMP2001 is a benchmark suite intended for measuring performance of modern shared memory parallel systems. ...
Comments