Abstract
We develop a framework for representing XML documents and queries in vector spaces and build indexes for processing text-centric semi-structured queries that support a proximity measure between XML documents. The idea of using vector spaces for XML retrieval is not new. In this paper we (i) unify prior approaches into a single framework; (ii) develop techniques to eliminate special purpose auxiliary computations (outside the vector space) used previously; (iii) give experimental evidence on benchmark queries that our approach is competitive in its retrieval quality and (iv) as an immediate consequence of the framework, are able to classify and cluster XML documents.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Amer-Yahia, S., Koudas, N., Srivastava, D.: Approximate matching in XML, http://www.research.att.com/~sihem/publications/PART1.pdf
Amer-Yahia, S., Botev, C., Shanmugasundaram, J.: TeXQuery: A Full-Text Search Extension to XQuery. In: WWW 2004 (2004)
Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: Flexible Structure and Full-Text Querying for XML. In: SIGMOD 2004 (2004)
Carmel, D., Afraty, N., Landau, G., Maarek, Y., Mass, Y.: An extension of the vector space model for querying XML documents via XML fragments. In: XML and Information Retrieval Workshop at SIGIR (2002)
Carmel, D., Maarek, Y., Mandelbrod, M., Mass, Y., Soffer, A.: Searching XML documents via XML fragments. In: SIGIR 2003 (2003)
Chamberlin, D., Florescu, D., Robie, J., Siméon, J., Stefanescu, M.: XQuery: A query language for XML. W3C Technical Report
Crouch, C.J., Apte, S., Bapat, H.: Using the extended vector model for XML retrieval. [9], 95–98 (2002)
Doucet, A., Ahonen-Myka, H.: Naive clustering of a large XML document collection. [9], 81–88 (2002)
Fuhr, N., Gövert, N., Kazai, G., Lalmas, M.: Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval, INEX (2002)
Fuhr, N., Großjohann, K.: XIRQL: A Query Language for Information Retrieval in XML Documents. Research and Development in Information Retrieval, 172–180 (2001)
Fuhr, N., Weikum, G.: Classification and Intelligent Search on Information in XML. IEEE Data Engineering Bulletin 25(1) (2002)
Gövert, N., Abolhassani, M., Fuhr, N., Großjohann, K.: Content-oriented XML retrieval with HyRex. [9], 26–32 (2002)
Gövert, N., Kazai, G.: Overview of INEX 2002. [9], 1–17 (2002)
Grabs, T., Schek, H.-J.: Generating vector spaces on-the-fly for flexible XML retrieval. In: Second SIGIR XML workshop (2002)
Guillaume, D., Murtagh, F.: Clustering of XML documents. Computer Physics Communications 127, 215–227 (2000)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked Keyword Search over XML Documents. In: SIGMOD 2003 (2003)
Initiative for the evaluation of XML retrieval, http://qmir.dcs.qmul.ac.uk/INEX/
Kilpeläinen, P.: Tree Matching Problems with Applications to Structured Text Databases. PhD thesis, Dept. of Computer Science, University of Helsinki (1992)
Kazai, G., Lalmas, M., Fuhr, N., Gövert, N.: A report on the first year of the INitiative for the Evaluation of XML Retrieval (INEX 02). Journal of the American Society for Information Science and Technology 54 (2003)
Luk, R., Leong, H., Dillon, T., Chan, A., Bruce Croft, W., Allan, J.: A survey in indexing and searching XML documents. JASIST 53(6), 415–437 (2002)
Kazai, G., Masood, S., Lalmas, M.: A Study of the Assessment of Relevance for the INEX 2002 Test Collection. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 296–310. Springer, Heidelberg (2004)
Mass, Y., Mandelbrod, M., Amitay, E., Carmel, D., Maarek, Y., Soffer, A.: JuruXML – an XML retrieval system at INEX 2002. [9],73–80 (2002)
Meila, M.: Comparing Clusterings. Technical Report 418, University of Washington Statistics Dept. (2002)
Mignet, L., Barbosa, D., Veltri, P.: The XML Web: a First Study. In: Proceedings of the 12th International World Wide Web Conference. Evaluating Structural Similarity in XML Documents. Proceedings of the Fifth International Workshop on the Web and Databases, WebDB 2002 (2003)
Polyzotis, N., Garofalakis, M., Ioannidis, Y.: Approximate XML Query Answers. In: SIGMOD 2004 (2004)
Punin, J., Krishnamoorthy, M., Zaki, M.: LOGML: Log markup language for web usage mining. In: WEBKDD Workshop, with SIGKDD 2001 (2001)
Rizzolo, F., Mendelzon, A.: Indexing XML Data with ToXin. In: Proceedings of Fourth International Workshop on the Web and Databases (2001)
Salton, G.: The SMART Retrieval System – Experiments in automatic document processing. Prentice Hall Inc, Englewood Cliffs (1971)
Schlieder, T.: Similarity search in XML data using cost-based query transformations. In: Proc. 4th WebDB, pp. 19–24 (2001)
Schlieder, T., Meuss, H.: Querying and Ranking XML Documents. Journal of the American Society for Information Science and Technology 53(6), 489–503 (2002)
Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton, J.: Relational Databases for Querying XML Documents: Limitations and Opportunities. In: Proc. VLDB 1999 (1999)
Zaki, M.: Efficiently Mining Frequent Trees in a Forest. In: Proceedings of ACM KDD 2002 (2002)
Zaki, M., Aggarwal, C.: XRULES: An Effective Structural Classifier for XML Data. In: Proceedings of ACM KDD 2003 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kakade, V., Raghavan, P. (2005). Encoding XML in Vector Spaces. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-31865-1_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)