Skip to main content

SIRIUS: A Lightweight XML Indexing and Approximate Search System at INEX 2005

  • Conference paper
Advances in XML Information Retrieval and Evaluation (INEX 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3977))

Abstract

This paper reports on SIRIUS, a lightweight indexing and search engine for XML documents. The retrieval approach implemented is document oriented. It involves an approximate matching scheme of the structure and textual content. Instead of managing the matching of whole DOM trees, SIRIUS splits the documents object model in a set of paths. In this view, the request is a path-like expression with conditions on the attribute values. In this paper, we present the main functionalities and characteristics of this XML IR system and second we relate on our experience on adapting and using it for the INEX 2005 ad-hoc retrieval task. Finally, we present and analyze the SIRIUS retrieval performance obtained during the INEX 2005 evaluation campaign and show that despite the lightweight characteristics of SIRIUS we were able to retrieve highly relevant non overlapping XML elements and obtained quite good precision at low recall values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  • Lalmas, M.: INEX 2005 Retrieval Task and Result Submission Specification. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 385–390. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  • Kazai, G., Lalmas, M.: INEX 2005 evaluation metrics. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977. Springer, Heidelberg (2006)

    Google Scholar 

  • Trotman, A., Sigurbjörnsson, B.: Narrowed Extended XPath I (NEXI). In: Fuhr, N., Lalmas, M., Malik, S., Szlávik, Z. (eds.) INEX 2004. LNCS, vol. 3493, pp. 16–40. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  • Sigurbjörnsson, B., Trotman, A., Geva, S., Lalmas, M., Larsen, B., Malik, S.: INEX 2005 Guidelines for Topic Development. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 375–384. Springer, Heidelberg (2006)

    Google Scholar 

  • Kamps, J., de Rijke, M., Sigurbjörnsson, B.: The Importance of Length Normalization for XML Retrieval. Information Retrieval 8(4), 631–654 (2005)

    Article  Google Scholar 

  • Ménier, G., Marteau, P.F.: Information retrieval in heterogeneous XML knowledge bases. In: The 9th International Conference on Information Processing and Magement of Uncertainty in Knowledge-Based Systems, Annecy, France, July 1-5 (2002)

    Google Scholar 

  • Ménier, G., Marteau, P.F.: PARTAGE: Software prototype for dynamic management of documents and data. In: ICSSEA, Paris, November 29 - December 1 (2005)

    Google Scholar 

  • Popovici, E., Marteau, P.-F., Ménier, G.: Information retrieval of sequential data in heterogeneous XML databases. In: Detyniecki, M., Jose, J.M., Nürnberger, A., van Rijsbergen, C.J.K. (eds.) AMR 2005. LNCS, vol. 3877, pp. 236–250. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  • Tai, K.C.: The tree to tree correction problem. J. ACM 26(3), 422–433 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, T.L.J., Shapiro, B., Shasha, D., Zhang, K., Currey, K.M.: An algorithm for finding the largest approximately common substructures of two trees. J. IEEE Pattern Analysis and Machine Intelligence 20(8) (August 1998)

    Google Scholar 

  • Levenshtein, A.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phy. Dohl. 10, 707–710 (1966)

    MathSciNet  MATH  Google Scholar 

  • Wagner, R., Fisher, M.: The String-to-String Correction Problem. Journal of the Association for Computing Machinery 12(1), 168–173 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  • Mignet, L., Barbosa, D., Veltri, P.: The XML Web: A First Study. In: WWW 2003, May 20-24, Budapest, Hungary (2003)

    Google Scholar 

  • Carmel, D., Maarek, Y.S., Mandelbrod, M., Mass, Y., Soffer, A.: Searching XML documents via XML fragments. In: SIGIR 2003, Toronto, Canada, pp. 151–158 (2003)

    Google Scholar 

  • Fuhr, N., Groβjohann, K.: XIRQL: An XML query language based on information retrieval concepts (TOIS) 22(2), 313–356 (2004)

    Article  Google Scholar 

  • Clark, J., De Rose, S.: XML Path Language (XPath) Version 1.0, W3C Recommendation, November 16 (1999), http://www.w3.org/TR/xpath.html

  • Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  • Salton, G., Buckeley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)

    Article  Google Scholar 

  • Mihajlovic, V., Ramirez, G., Westerveld, T., Hiemstra, D., Blok, H.E., de Vries, A.: TIJAH Scratches INEX 2005: Vague Element Selection, Overlap, Image Search, Relevance Feedback, and Users. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 72–87. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Popovici, E., Ménier, G., Marteau, PF. (2006). SIRIUS: A Lightweight XML Indexing and Approximate Search System at INEX 2005. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds) Advances in XML Information Retrieval and Evaluation. INEX 2005. Lecture Notes in Computer Science, vol 3977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-34963-1_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-34963-1_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-34962-4

  • Online ISBN: 978-3-540-34963-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics