Abstract
The problem of decentralized data sharing, which is relevant to a wide range of applications, is still a source of major theoretical and practical challenges, in spite of many years of sustained research. In this paper we focus on the challenge of efficiency of query evaluation in information integration systems that use the global-as-view approach, with the objective of developing query-processing strategies that would be widely applicable and easy to implement in real-life applications. Our algorithms take into account important features of today’s data sharing applications: XML as likely interface or representation for data sources; the potential for information overlap across data sources; and the need for inter-source processing, as in joins of data across sources. The focus of this paper is on performance-related characteristics of several alternative approaches that we propose for efficient query processing in information integration, including an approach that uses materialized restructured views. We use synthetic and real-life datasets in our implementation of an information integration system shell to provide experimental results that demonstrate that our algorithms are efficient and competitive in the information integration setting. In addition, our experimental results allow us to make context-specific recommendations on selecting query-processing approaches from our proposed alternatives. As such, our approaches could form a basis for scalable query processing in information integration and interoperability in many practical settings.
Similar content being viewed by others
Notes
Our proposed semantic-optimization techniques are also more widely applicable to general XQuery optimization.
We omit declarations of elements of type #PCDATA.
If source \(i\) has no data for relation \(r\) (or \(s\)) then all subqueries involving fragment \(r_i\) (or \(s_i\)) are empty and need not be executed.
References
Abiteboul, S., Duschka, O. M.: Complexity of answering queries using materialized views. In: Proceedings of ACM symposium on principles of database systems, pp. 254–263 (1998)
Academic Department Ontology: http://www.daml.org/ontologies/65
Arenas, M., Kantere, V., Kementsietsidis, A., Kiringa, I., Miller, R., Mylopoulos, J.: The hyperion project: from data integration to data coordination. SIGMOD Record 32(3), 53–58 (2003) (Special issue on Peer to Peer Data Management)
Arenas, M., Libkin, L.: XML data exchange: consistency and query answering. In: Proceedings of ACM symposium on principles of satabase systems, pp. 13–24 (2005)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008)
Calvanese, D., De Giacomo, G., Lenzerini, M.: Answering queries using views over description logics knowledge bases. In: Proceedings of AAAI, pp. 386–391 (2000)
Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: Answering regular path queries using views. In: Proceedings of IEEE international conference on data, engineering, pp. 389–398 (2000)
Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: View-based query processing for regular path queries with inverse. In: Proceedings of ACM symposium on principles of database systems, pp. 58–66 (2000)
Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: View-based query answering and query containment over semistructured data. In: Proceedings of DBPL, pp. 40–61 (2001)
Chen, D., Chirkova, R., Kormilitsin, M., Sadri, F., Salo, T.J.: Query optimization in XML-based information integration. In: Proceedings of international conference on information and, knowledge management, pp. 1405–1406 (2008)
Chen, D., Chirkova, R., Sadri, F.: Designing an information integration and interoperability system— first steps. Technical Report NCSU CSC TR-2006-29. Available at http://www.csc.ncsu.edu/research/tech/reports.php, October (2006)
Chen, D., Chirkova, R., Sadri, F.: Query optimization using restructured views: theory and experiments. Inf. Syst. 34(3), 353–370 (2009)
Chirkova, R., Sadri, F.: Query optimization using restructured views. In: Proceedings of international conference on information and, knowledge management, pp. 642–651 (2006)
Christophides, V., Karvounarakis, G., Magkanaraki, A., Plexousakis, D., Tannen, V.: The ICS-FORTH Semantic Web integration middleware (SWIM). In: IEEE data, engineering bulletin, pp. 11–18 (2003)
CiteSeer: http://citeseer.ist.psu.edu/
Copeland, G.P., Khoshafian, S.: A decomposition storage model. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 268–279 (1985)
Cunningham, C., Graefe, G., Galindo-Legaria, C.A.: PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS. In: Proceedings of international conference on very large databases, pp. 998–1009 (2004)
DAML Ontology Library: http://www.daml.org/ontologies/
Dar, S., Franklin, M.J., Thór Jónsson, B., Srivastava, D., Tan, M.: Semantic data caching and replacement. In: Proceedings of international conference on very large databases, pp. 330–341 (1996)
Davidson, S., Fan, W., Hara, C., Qin, J.: Propagating XML constraints to relations. In: Proceedings of IEEE international conference on data engineering (2003)
Dong, X., Halevy, A.: Indexing dataspaces. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 43–54 (2007)
Duschka, O.M., Genesereth, M.R., Levy, A.Y.: Recursive query plans for data integration. J. Log. Program. 43(1), 49–73 (2000)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. In: Proceedings of international conference on database theory, pp. 207–224 (2003)
Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005)
Fensel, D.: Information Integration with Ontologies: Ontology Based Information Integration in an Industrial Setting. Wiley, New York (2005)
Franklin, M.J., Thór Jónsson, B., Kossmann, D.: Performance tradeoffs for client–server query processing. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 149–160 (1996)
Grahne, G., Mendelzon, A.O.: Tableau techniques for querying information sources through global schemas. In: Proceedings of international conference on database theory, pp. 332–347 (1999)
Gyssens, M., Lakshmanan, L.V.S., Subramanian, I.N.: Tables as a paradigm for querying and restructuring. In: Proceedings of ACM symposium on principles of database systems, pp. 93–103 (1996)
Halevy, A.Y.: Answering queries using views. LDB J. 10(4), 270–294 (2001)
Halevy, A.Y.: Data integration: a status report. In: Proceedings of German database conference (Datenbanksysteme für Business, Technologie und Web, BTW), pp. 24–29 (2003)
Halevy, A.Y., Etzioni, O., Doan, A., Ives, Z.G., Madhavan, J., McDowell, L., Tatarinov, I.: Crossing the structure chasm. In: Proceedings of the biennial conference on innovative data, systems research (CIDR) (2003)
Halevy, A.Y., Ives, Z.G., Madhavan, J., Mork, P., Suciu, D., Tatarinov, I.: The Piazza peer data management system. IEEE Trans. Knowl. Data Eng. 16(7), 787–798 (2004)
Hernández, M.A., Papotti, P., Tan, W.C.: Data exchange with data-metadata translations. Proc. VLDB Endow. 1(1), 260–273 (2008)
Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)
Krishnamurthy, R., Litwin, W., Kent, W.: Language features for interoperability of databases with schematic discrepancies. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 40–49 (1991)
Lacroix, Z., Raschid, L., Vidal, M.-E.: Semantic model to integrate biological resources. In: ICDE workshops (2006)
Lakshmanan, L.V.S., Sadri, F.: Interoperability on XML data. In: Proceedings of the international semantic web conference (ISWC), pp. 146–163 (2003)
Lakshmanan, L.V.S., Sadri, F., Subramanian I.N.: On the logical foundations of schema integration and evolution in heterogeneous database systems. In: Proceedings of international conference on deductive and object-oriented databases, pp. 81–100 (1993)
Lakshmanan, L.V.S., Sadri, F., Subramanian, I.N.: SchemaSQL: a language for interoperability in relational multi-database systems. In: Proceedings of international conference on very large databases, pp. 239–250 (1996)
Lakshmanan, L.V.S., Sadri, F., Subramanian, S.N.: On efficiently implementing SchemaSQL on a SQL database system. In: Proceedings of international conference on very large databases, pp. 471–482 (1999)
Lakshmanan, L.V.S., Sadri, F., Subramanian, S.N.: SchemaSQL—an extension to SQL for multi-database interoperability. ACM Trans. Database Syst. 26(4), 476–519 (2001)
Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of ACM symposium on principles of database systems, pp. 233–246 (2002)
Levy, A.Y., Mendelzon, A.O., Sagiv, Y., Srivastava, D.: Answering queries using views. In: Proceedings of ACM symposium on principles of database systems, pp. 95–104 (1995)
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions. In: Proceedings of international conference on very large databases, pp. 251–262 (1996)
Liu, L., Özsu, M. (eds.): Encyclopedia of Database Systems. Springer, Berlin (2009)
Madhavan, J., Cohen, S., Luna Dong, X., Halevy, A.Y., Jeffery, S.R., Ko, D., Yu, C.: Web-scale data integration: you can afford to pay as you go. In: Proceedings of the biennial conference on innovative data, systems research (CIDR), pp. 342–350 (2007)
Madhavan, J., Halevy, A.Y.: Composing mappings among data sources. In: Proceedings of international conference on very large databases, pp. 572–583 (2003)
Marnette, B., Mecca, G., Papotti, P.: Scalable data exchange with functional dependencies. Proc. VLDB Endow. 3(1), 105–116 (2010)
Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L.-L., Howard Ho, C.T., Fagin, R., Popa, L.: The Clio project: managing heterogeneity. SIGMOD Record 30(1), 78–83 (2001)
NCSU-UNCG Information-Integration Project: http://dbgroup.ncsu.edu/?page_id=205
Noy, N.F.: Semantic integration: a survey of ontology-based approaches. SIGMOD Record 33(4), 65–70 (2004)
Noy, N.F., McGuinness, D.L.: Ontology development 101: a guide to creating your first ontology. http://ksl.stanford.edu/people/dlm/papers/ontology-tutorial-noy-mcguinness.pdf (2001)
Poess M., Othayoth N.R.: Large scale data warehouses on grid: Oracle database 10g and HP ProLiant systems. In: Proceedings of international conference on very large databases, pp. 1055–1066 (2005)
PostgreSQL: http://www.postgresql.org/
Pottinger, R., Halevy, A.Y.: Minicon: a scalable algorithm for answering queries using views. VLDB J. 10(2–3), 182–198 (2001)
Pottinger, R., Levy, A.Y.: A scalable algorithm for answering queries using views. In: Proceedings of international conference on very large databases, pp. 484–495 (2000)
Sanket Sahoo, S., Thomas, C., Sheth, A.P., York, W.S., Tartir, S.: Knowledge modeling and its application in life sciences: a tale of two ontologies. In: Proceedings of the international WWW conference, pp. 317–326 (2006)
Saxonica, XSLT and XQuery Processing: http://www.saxonica.com/
Semantic Web: http://www.w3c.org/2001/sw/
SIGMOD: http://www.sigmod.org/
Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: a column-oriented dbms. In: Proceedings of international conference on very large databases, pp. 553–564 (2005)
Tatarinov, I., Halevy, A.: Efficient query reformulation in peer data management systems. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 539–550 (2004)
Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press, Rockville (1988)
Ullman, J.D.: Information integration using logical views. In: Proceedings of international conference on database theory, pp. 19–40 (1997)
Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992)
Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D.: Efficient RDF storage and retrieval in Jena2. In: Proceedings of VLDB workshop on semantic web and databases, pp. 131–150 (2003)
Wyss, C.M., Robertson, E.L.: Relational languages for metadata integration. ACM Trans. Database Syst. 30(2), 624–660 (2005)
XML Path Language (XPath): http://www.w3c.org/TR/xpath
Yu, C., Popa, L.: Constraint-based XML query rewriting for data integration. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 371–382 (2004)
Acknowledgments
We are grateful to Igor Tatarinov and Alon Halevy for providing us with the complete experimental setup for [64], and to Natasha Noy for the helpful discussions on ontology-based information mediation. Special thanks go to anonymous referees for their extensive and insightful observations and comments. This work has been supported by NSF grants Career 0447742 and IIS 0307072 and by NCSU CACC grant 07-01.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, D., Chirkova, R., Sadri, F. et al. Query optimization in information integration. Acta Informatica 50, 257–287 (2013). https://doi.org/10.1007/s00236-013-0179-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00236-013-0179-1