Abstract
The world of data has been developed from two main points of view: the structured relational data model and the unstructured text model. The two distinct cultures of databases and information retrieval now have a natural meeting place in theWeb with its semi-structured XML model. As web-style searching becomes an ubiquitous tool, the need for integrating these two viewpoints becomes even more important.
This tutorial will provide an overview of the different issues and approaches put forward by the Information Retrieval and the Database communities and survey the DB-IR integration efforts with a focus on techniques applicable to XML retrieval. A variety of application scenarios for DB-IR integration will be covered, including examples of current industrial tools.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Salton, G.: Automatic information organization and retrieval. McGraw-Hill, New York (1968)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Harlow (1999)
Crestani, F., Lalmas, M., van Rijsbergen, C.J., Campbell, I.: Is this document relevant? …probably: A survey of probabilistic models in information retrieval. ACM Computing Surveys 30, 528–552 (1998)
W3C: XQuery and XPath full-text requirements, W3C Working Draft (2003), http://www.w3.org/TR/xmlquery-full-text-requirements
W3C: XQuery and XPath full-text use cases, W3C Working Draft (2003), http://www.w3.org/TR/xmlquery-full-text-use-cases
Salminen, A., Tompa, F.W.: PAT expressions: An algebra for text search. Acta Linguistica Hungarica 41, 277–306 (1993)
Consens, M., Milo, T.: Algebras for querying text regions. In: Proceedings of the Symposium on Principles of Database Systems, San Jose, California, USA, pp. 11–22 (1995)
Clarke, C., Cormack, G., Burkowski, F.: An algebra for structured text search and a framework for its implementation. The Computer Journal 38, 43–56 (1995)
Navarro, G., Baeza-Yates, R.: Integrating content and structure in text retrieval. SIGMOD Record 25, 67–79 (1996)
Navarro, G., Baeza-Yates, R.: Proximal nodes: A model to query document databases by contents and structure. ACM Transactions on Information Systems 15, 401–435 (1997)
Lee, Y.K., Yoo, S.-J., Yoon, K., Berra, P.B.: Index structures for structured documents. In: Proceedings of the 1st ACM International Conference on Digital Libraries, pp. 91–99 (1996)
Navarro, G., Baeza-Yates, R.A.: Proximal nodes: A model to query document databases by content and structure. TOIS 15, 400–435 (1997)
Goldman, R., Shivakumar, N., Venkatasubramanian, S., Garcia-Molina, H.: Proximity search in databases. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 26–37 (1998)
Florescu, D., Kossmann, D., Manolescu, I.: Integrating keyword search into XML query processing. In: Proceedings of International World Wide Web Conference (2000)
Kanza, Y., Sagiv, Y.: Flexible queries over semistructured data. In: Proceedings of the Symposium on Principles of Database Systems, pp. 40–51 (2001)
Agrawal, S., Chaudhuri, S., Das, G.: DBXplorer: A system for keyword-based search over relational databases. In: Proceedings of International Conference on Data Engineering. (2002)
Bhalotia, G., Hulgeri, A., Nakhey, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: Proceedings of International Conference on Data Engineering (2002)
Hristidis, V., Papakonstantinou, Y.: DISCOVER: Keyword search in relational databases. In: Proceedings of the International Conference on Very Large Data Bases (2002)
Amer-Yahia, S., Cho, S., Srivastava, D.: Tree pattern relaxation. In: Proceedings of Conference on Extending Database Technology, pp. 496–513 (2002)
Amer-Yahia, S., Fernandez, M., Srivastava, D., Xu, Y.: Pix: exact and approximate phrase matching in xml. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, p. 664 (2003)
Kabra, N., Ramakrishnan, R., Ercegovac, V.: The QUIQ engine: A hybrid IR-DB system. In: Proceedings of the 19th International Conference on Data Engineering, p. 741 (2003)
Amer-Yahia, S., Koudas, N., Srivastava, D.: Approximate matching in xml. In: Proceedings of the 19th International Conference on Data Engineering, p. 803 (2003)
Hristidis, V., Papakonstantinou, Y., Balmin, A.: Keyword proximity search on XML graphs. In: Proceedings of International Conference on Data Engineering (2003)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked keyword search over XML documents. In: Proceedings of ACM SIGMOD International Conference on Management of Data (2003)
Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSearch: a semantic search engine for XML. In: Proceedings of the 29th International Conference on Very Large Data Bases (2003)
Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient IR-style keyword search over relational databases. In: Proceedings of the International Conference on Very Large Data Bases (2003)
Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: Flexible structure and full-text querying for XML. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 83–94 (2004)
Luk, R.: A survey of search engines for XML documents. In: SIGIR Workshop on XML and IR (2000)
Fuhr, N., Grobjohann, K.: XIRQL: An extension of XQL for information retrieval. In: ACM SIGIR Workshop on XML and Information Retrieval, pp. 11–17 (2000)
Theobald, A., Weikum, G.: Adding relevance to XML. In: Proceedings of International Workshop on the Web and Databases, pp. 35–40 (2000)
Fuhr, N., Grobjohann, K.: A query language for information retrieval in XML documents. In: Proceedings of ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 172–180 (2001)
Chinenyanga, T.T., Kushmerick, N.: Expressive and efficient ranked querying of XML data. In: Proceedings of International Workshop on the Web and Databases (2001)
Theobald, A., Weikum, G.: The index-based XXL search engine for querying XML data with relevance ranking. In: Proceedings of Conference on Extending Database Technology, pp. 477–495 (2002)
Chinenyanga, T.T., Kushmerick, N.: An expressive and efficient language for XML information retrieval. Journal of the American Society for Information Science and Technology 53, 438–453 (2002)
Grabs, T., Schek, H.-J.: Flexible information retrieval from XML with PowerDB-XML. In: Proceedings of the Third INEX Workshop (2003)
Mass, Y., Mandelbrod, M., Amitay, E., Carmel, D., Maarek, Y., Soffer, A.: JuruXML - an XML retrieval system at INEX 02. In: Proceedings of the First INEX Workshop (2002)
Fuhr, N., Grobjohann, K.: XIRQL: An XML query language based on information retrieval concepts. ACM Trans. Inf. Syst. 22, 313–356 (2004)
Schenkel, R., Theobald, A., Weikum, G.: Semantic similarity search on semistructured data with the XXL search engine. Information Retrieval 8, 521–545 (2005)
Goldman, R., Widom, J.: DataGuides: Enabling query formulation and optimization in semistructured databases. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 436–445 (1997)
Nestorov, S., Ullman, J.D., Wiener, J.L., Chawathe, S.S.: Representative objects: Concise representations of semistructured, hierarchial data. In: Proceedings of the 13th International Conference on Data Engineering, pp. 79–90 (1997)
Milo, T., Suciu, D.: Index structures for path expressions. In: Proceedings of the 7th International Conference on Database Theory, pp. 277–295 (1999)
Cooper, B., Sample, N., Franklin, M.J., Hjaltason, G.R., Shadmon, M.: A fast index for semistructured data. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 341–350 (2001)
Natsev, A., Chang, Y.-C., Smith, J.R., Li, C.-S., Vitter, J.S.: Supporting incremental join queries on ranked ranked inputs. In: Proceedings of the International Conference on Very Large Data Bases (2001)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proceedings of the Symposium on Principles of Database Systems (2001)
Rizzolo, F., Mendelzon, A.O.: Indexing XML data with ToXin. In: Proceedings of 4th International Workshop on the Web and Databases, pp. 49–54 (2001)
Li, Q., Moon, B.: Indexing and querying XML data for regular path expressions. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 361–370 (2001)
Kaushik, R., Bohannon, P., Naughton, J.F., Korth, H.F.: Covering indexes for branching path queries. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 133–144 (2002)
Chung, C.W., Min, J.-K., Shim, K.: APEX: An adaptive path index for XML data. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 121–132 (2002)
Kaushik, R., Shenoy, P., Bohannon, P., Gudes, E.: Exploiting local similarity for indexing paths in graph-structured data. In: Proceedings of the 18th International Conference on Data Engineering, pp. 129–140 (2002)
Kaushik, R., Bohannon, P., Naughton, J.F., Shenoy, P.: Updates for structure indexes. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 239–250 (2002)
Al-Khalifa, S., Jagadish, H.V., Patel, J.M., Wu, Y., Koudas, N., Srivastava, D.: Structural joins: A primitive for efficient XML query pattern matching. In: Proceedings of the 18th International Conference on Data Engineering, p. 141 (2002)
Chien, S.Y., Vagena, Z., Zhang, D., Tsotras, V.J., Zaniolo, C.: Efficient structural joins on indexed XML documents. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 263–274 (2002)
Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: Optimal XML pattern matching. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 310–321 (2002)
Hristidis, V., Papakonstantinou, Y.: Algorithms and applications for answering ranked queeries using ranked views. In: Proceedings of the International Conference on Very Large Data Bases (2003)
Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. In: Proceedings of the International Conference on Very Large Data Bases (2003)
Bremer, J.M.: Next-Generation Information Retrieval: Integrating Document and Data Retrieval Based on XML. PhD thesis, Department of Computer Science, University of California at Davis (2003)
Bremer, J.M., Gertz, M.: An efficient XML node identification and indexing scheme. Technical Report CSE-2003-04, Department of Computer Science, University of California at Davis (2003)
Chen, Z., Jagadish, H.V., Lakshmanan, L.V.S., Paparizos, S.: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 237–248 (2003)
Al-Khalifa, S., Yu, C., Jagadish, H.V.: Querying structured text in an XML database. In: Proceedings of ACM SIGMOD International Conference on Management of Data (2003)
Qun, C., Lim, A., Ong, K.W.: D(K)-index: An adaptive structural summary for graph-structured data. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 134–144 (2003)
Ramanan, P.: Covering indexes for XML queries: Bisimulation - simulation = negation. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 165–176 (2003)
Zezula, P., Amato, G., Debole, F., Rabitti, F.: Tree signatures for XML querying and navigation. In: Bellahsène, Z., Chaudhri, A.B., Rahm, E., Rys, M., Unland, R. (eds.) XSym 2003. LNCS, vol. 2824, pp. 149–163. Springer, Heidelberg (2003)
Wang, H., Park, S., Fan, W., Yu, P.S.: ViST: A dynamic index method for querying XML data by tree structures. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 110–121 (2003)
Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic twig joins on indexed XML documents. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 273–284 (2003)
Jiang, H., Lu, H., Wang, W., Ooi, B.C.: XR-Tree: Indexing XML data for efficient structural joins. In: Proceedings of the 19th International Conference on Data Engineering, pp. 253–263 (2003)
Li, Q., Moon, B.: Partition based path join algorithms for XML data. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 160–170. Springer, Heidelberg (2003)
Weigel, F., Meuss, H., Bry, F., Schulz, K.U.: Content-aware dataGuides: Interleaving IR and DB indexing techniques for efficient retrieval of textual XML data. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 378–393. Springer, Heidelberg (2004)
Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the integration of structure indexes and inverted lists. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 779–790 (2004)
Amato, G., Debole, F., Rabitti, F., Savino, P., Zezula, P.: A signature-based approach for efficient relationship search on XML data collections. In: Bellahsène, Z., Milo, T., Rys, M., Suciu, D., Unland, R. (eds.) XSym 2004. LNCS, vol. 3186, pp. 82–96. Springer, Heidelberg (2004)
Rao, P., Moon, B.: PRIX: Indexing and querying XML using Prüfer sequences. In: Proceedings of the 20th International Conference on Data Engineering, pp. 288–300 (2004)
Vagena, Z., Moro, M.M., Tsotras, V.J.: Efficient processing of XML containment queries using partition-based schemes. In: Proceedings of the 8th International Database Engineering and Applications Symposium, IDEAS 2004, pp. 161–170 (2004)
Wang, H., Meng, X.: On the sequencing of tree structures for XML indexing. In: Proceedings of the 21st International Conference on Data Engineering (2005)
Bremer, J.-M., Gertz, M.: Next-generation information retrieval. VLDB Journal (2006) (to appear)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Consens, M.P., Baeza-Yates, R. (2005). Database and Information Retrieval Techniques for XML. In: Grumbach, S., Sui, L., Vianu, V. (eds) Advances in Computer Science – ASIAN 2005. Data Management on the Web. ASIAN 2005. Lecture Notes in Computer Science, vol 3818. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11596370_4
Download citation
DOI: https://doi.org/10.1007/11596370_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30767-9
Online ISBN: 978-3-540-32249-8
eBook Packages: Computer ScienceComputer Science (R0)