Skip to main content
Log in

Query optimization in information integration

  • Original Article
  • Published:
Acta Informatica Aims and scope Submit manuscript

Abstract

The problem of decentralized data sharing, which is relevant to a wide range of applications, is still a source of major theoretical and practical challenges, in spite of many years of sustained research. In this paper we focus on the challenge of efficiency of query evaluation in information integration systems that use the global-as-view approach, with the objective of developing query-processing strategies that would be widely applicable and easy to implement in real-life applications. Our algorithms take into account important features of today’s data sharing applications: XML as likely interface or representation for data sources; the potential for information overlap across data sources; and the need for inter-source processing, as in joins of data across sources. The focus of this paper is on performance-related characteristics of several alternative approaches that we propose for efficient query processing in information integration, including an approach that uses materialized restructured views. We use synthetic and real-life datasets in our implementation of an information integration system shell to provide experimental results that demonstrate that our algorithms are efficient and competitive in the information integration setting. In addition, our experimental results allow us to make context-specific recommendations on selecting query-processing approaches from our proposed alternatives. As such, our approaches could form a basis for scalable query processing in information integration and interoperability in many practical settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Our proposed semantic-optimization techniques are also more widely applicable to general XQuery optimization.

  2. We omit declarations of elements of type #PCDATA.

  3. If source \(i\) has no data for relation \(r\) (or \(s\)) then all subqueries involving fragment \(r_i\) (or \(s_i\)) are empty and need not be executed.

  4. [12, 13] focus on rewriting queries using restructured views in centralized databases. Our emphasis in this paper is on using materialized restructured views to improve query-processing efficiency in data-integration applications.

References

  • Abiteboul, S., Duschka, O. M.: Complexity of answering queries using materialized views. In: Proceedings of ACM symposium on principles of database systems, pp. 254–263 (1998)

  • Academic Department Ontology: http://www.daml.org/ontologies/65

  • Arenas, M., Kantere, V., Kementsietsidis, A., Kiringa, I., Miller, R., Mylopoulos, J.: The hyperion project: from data integration to data coordination. SIGMOD Record 32(3), 53–58 (2003) (Special issue on Peer to Peer Data Management)

    Google Scholar 

  • Arenas, M., Libkin, L.: XML data exchange: consistency and query answering. In: Proceedings of ACM symposium on principles of satabase systems, pp. 13–24 (2005)

  • Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008)

  • Calvanese, D., De Giacomo, G., Lenzerini, M.: Answering queries using views over description logics knowledge bases. In: Proceedings of AAAI, pp. 386–391 (2000)

  • Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: Answering regular path queries using views. In: Proceedings of IEEE international conference on data, engineering, pp. 389–398 (2000)

  • Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: View-based query processing for regular path queries with inverse. In: Proceedings of ACM symposium on principles of database systems, pp. 58–66 (2000)

  • Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: View-based query answering and query containment over semistructured data. In: Proceedings of DBPL, pp. 40–61 (2001)

  • Chen, D., Chirkova, R., Kormilitsin, M., Sadri, F., Salo, T.J.: Query optimization in XML-based information integration. In: Proceedings of international conference on information and, knowledge management, pp. 1405–1406 (2008)

  • Chen, D., Chirkova, R., Sadri, F.: Designing an information integration and interoperability system— first steps. Technical Report NCSU CSC TR-2006-29. Available at http://www.csc.ncsu.edu/research/tech/reports.php, October (2006)

  • Chen, D., Chirkova, R., Sadri, F.: Query optimization using restructured views: theory and experiments. Inf. Syst. 34(3), 353–370 (2009)

    Article  Google Scholar 

  • Chirkova, R., Sadri, F.: Query optimization using restructured views. In: Proceedings of international conference on information and, knowledge management, pp. 642–651 (2006)

  • Christophides, V., Karvounarakis, G., Magkanaraki, A., Plexousakis, D., Tannen, V.: The ICS-FORTH Semantic Web integration middleware (SWIM). In: IEEE data, engineering bulletin, pp. 11–18 (2003)

  • CiteSeer: http://citeseer.ist.psu.edu/

  • Copeland, G.P., Khoshafian, S.: A decomposition storage model. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 268–279 (1985)

  • Cunningham, C., Graefe, G., Galindo-Legaria, C.A.: PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS. In: Proceedings of international conference on very large databases, pp. 998–1009 (2004)

  • DAML Ontology Library: http://www.daml.org/ontologies/

  • Dar, S., Franklin, M.J., Thór Jónsson, B., Srivastava, D., Tan, M.: Semantic data caching and replacement. In: Proceedings of international conference on very large databases, pp. 330–341 (1996)

  • Davidson, S., Fan, W., Hara, C., Qin, J.: Propagating XML constraints to relations. In: Proceedings of IEEE international conference on data engineering (2003)

  • dblp: http://www.informatik.uni-trier.de/ley/db/index.html

  • Dong, X., Halevy, A.: Indexing dataspaces. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 43–54 (2007)

  • Duschka, O.M., Genesereth, M.R., Levy, A.Y.: Recursive query plans for data integration. J. Log. Program. 43(1), 49–73 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  • Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. In: Proceedings of international conference on database theory, pp. 207–224 (2003)

  • Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Fensel, D.: Information Integration with Ontologies: Ontology Based Information Integration in an Industrial Setting. Wiley, New York (2005)

    Google Scholar 

  • Franklin, M.J., Thór Jónsson, B., Kossmann, D.: Performance tradeoffs for client–server query processing. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 149–160 (1996)

  • Grahne, G., Mendelzon, A.O.: Tableau techniques for querying information sources through global schemas. In: Proceedings of international conference on database theory, pp. 332–347 (1999)

  • Gyssens, M., Lakshmanan, L.V.S., Subramanian, I.N.: Tables as a paradigm for querying and restructuring. In: Proceedings of ACM symposium on principles of database systems, pp. 93–103 (1996)

  • Halevy, A.Y.: Answering queries using views. LDB J. 10(4), 270–294 (2001)

    MATH  Google Scholar 

  • Halevy, A.Y.: Data integration: a status report. In: Proceedings of German database conference (Datenbanksysteme für Business, Technologie und Web, BTW), pp. 24–29 (2003)

  • Halevy, A.Y., Etzioni, O., Doan, A., Ives, Z.G., Madhavan, J., McDowell, L., Tatarinov, I.: Crossing the structure chasm. In: Proceedings of the biennial conference on innovative data, systems research (CIDR) (2003)

  • Halevy, A.Y., Ives, Z.G., Madhavan, J., Mork, P., Suciu, D., Tatarinov, I.: The Piazza peer data management system. IEEE Trans. Knowl. Data Eng. 16(7), 787–798 (2004)

    Article  Google Scholar 

  • Hernández, M.A., Papotti, P., Tan, W.C.: Data exchange with data-metadata translations. Proc. VLDB Endow. 1(1), 260–273 (2008)

    Google Scholar 

  • Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)

    Article  Google Scholar 

  • Krishnamurthy, R., Litwin, W., Kent, W.: Language features for interoperability of databases with schematic discrepancies. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 40–49 (1991)

  • Lacroix, Z., Raschid, L., Vidal, M.-E.: Semantic model to integrate biological resources. In: ICDE workshops (2006)

  • Lakshmanan, L.V.S., Sadri, F.: Interoperability on XML data. In: Proceedings of the international semantic web conference (ISWC), pp. 146–163 (2003)

  • Lakshmanan, L.V.S., Sadri, F., Subramanian I.N.: On the logical foundations of schema integration and evolution in heterogeneous database systems. In: Proceedings of international conference on deductive and object-oriented databases, pp. 81–100 (1993)

  • Lakshmanan, L.V.S., Sadri, F., Subramanian, I.N.: SchemaSQL: a language for interoperability in relational multi-database systems. In: Proceedings of international conference on very large databases, pp. 239–250 (1996)

  • Lakshmanan, L.V.S., Sadri, F., Subramanian, S.N.: On efficiently implementing SchemaSQL on a SQL database system. In: Proceedings of international conference on very large databases, pp. 471–482 (1999)

  • Lakshmanan, L.V.S., Sadri, F., Subramanian, S.N.: SchemaSQL—an extension to SQL for multi-database interoperability. ACM Trans. Database Syst. 26(4), 476–519 (2001)

    Article  MATH  Google Scholar 

  • Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of ACM symposium on principles of database systems, pp. 233–246 (2002)

  • Levy, A.Y., Mendelzon, A.O., Sagiv, Y., Srivastava, D.: Answering queries using views. In: Proceedings of ACM symposium on principles of database systems, pp. 95–104 (1995)

  • Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions. In: Proceedings of international conference on very large databases, pp. 251–262 (1996)

  • Liu, L., Özsu, M. (eds.): Encyclopedia of Database Systems. Springer, Berlin (2009)

    MATH  Google Scholar 

  • Madhavan, J., Cohen, S., Luna Dong, X., Halevy, A.Y., Jeffery, S.R., Ko, D., Yu, C.: Web-scale data integration: you can afford to pay as you go. In: Proceedings of the biennial conference on innovative data, systems research (CIDR), pp. 342–350 (2007)

  • Madhavan, J., Halevy, A.Y.: Composing mappings among data sources. In: Proceedings of international conference on very large databases, pp. 572–583 (2003)

  • Marnette, B., Mecca, G., Papotti, P.: Scalable data exchange with functional dependencies. Proc. VLDB Endow. 3(1), 105–116 (2010)

    Google Scholar 

  • Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L.-L., Howard Ho, C.T., Fagin, R., Popa, L.: The Clio project: managing heterogeneity. SIGMOD Record 30(1), 78–83 (2001)

  • NCSU-UNCG Information-Integration Project: http://dbgroup.ncsu.edu/?page_id=205

  • Noy, N.F.: Semantic integration: a survey of ontology-based approaches. SIGMOD Record 33(4), 65–70 (2004)

    Article  Google Scholar 

  • Noy, N.F., McGuinness, D.L.: Ontology development 101: a guide to creating your first ontology. http://ksl.stanford.edu/people/dlm/papers/ontology-tutorial-noy-mcguinness.pdf (2001)

  • Poess M., Othayoth N.R.: Large scale data warehouses on grid: Oracle database 10g and HP ProLiant systems. In: Proceedings of international conference on very large databases, pp. 1055–1066 (2005)

  • PostgreSQL: http://www.postgresql.org/

  • Pottinger, R., Halevy, A.Y.: Minicon: a scalable algorithm for answering queries using views. VLDB J. 10(2–3), 182–198 (2001)

    MATH  Google Scholar 

  • Pottinger, R., Levy, A.Y.: A scalable algorithm for answering queries using views. In: Proceedings of international conference on very large databases, pp. 484–495 (2000)

  • Sanket Sahoo, S., Thomas, C., Sheth, A.P., York, W.S., Tartir, S.: Knowledge modeling and its application in life sciences: a tale of two ontologies. In: Proceedings of the international WWW conference, pp. 317–326 (2006)

  • Saxonica, XSLT and XQuery Processing: http://www.saxonica.com/

  • Semantic Web: http://www.w3c.org/2001/sw/

  • SIGMOD: http://www.sigmod.org/

  • Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: a column-oriented dbms. In: Proceedings of international conference on very large databases, pp. 553–564 (2005)

  • Tatarinov, I., Halevy, A.: Efficient query reformulation in peer data management systems. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 539–550 (2004)

  • Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press, Rockville (1988)

    Google Scholar 

  • Ullman, J.D.: Information integration using logical views. In: Proceedings of international conference on database theory, pp. 19–40 (1997)

  • Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992)

    Article  Google Scholar 

  • Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D.: Efficient RDF storage and retrieval in Jena2. In: Proceedings of VLDB workshop on semantic web and databases, pp. 131–150 (2003)

  • Wyss, C.M., Robertson, E.L.: Relational languages for metadata integration. ACM Trans. Database Syst. 30(2), 624–660 (2005)

    Article  Google Scholar 

  • XML Path Language (XPath): http://www.w3c.org/TR/xpath

  • Yu, C., Popa, L.: Constraint-based XML query rewriting for data integration. In: Proceedings of ACM SIGMOD international conference on management of data, pp. 371–382 (2004)

Download references

Acknowledgments

We are grateful to Igor Tatarinov and Alon Halevy for providing us with the complete experimental setup for [64], and to Natasha Noy for the helpful discussions on ontology-based information mediation. Special thanks go to anonymous referees for their extensive and insightful observations and comments. This work has been supported by NSF grants Career 0447742 and IIS 0307072 and by NCSU CACC grant 07-01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fereidoon Sadri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, D., Chirkova, R., Sadri, F. et al. Query optimization in information integration. Acta Informatica 50, 257–287 (2013). https://doi.org/10.1007/s00236-013-0179-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00236-013-0179-1

Keywords

Navigation