Skip to main content

Hierarchical Indexing and Flexible Element Retrieval for Structured Document

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Abstract

As more and more structured documents, such as the SGML or XML documents, become available on the Web, there is a growing demand to develop effective structured document retrieval which exploits both content and hierarchical structure of documents and return document elements with appropriate granularity. Previous work on partial retrieval of structured document has limited applications due to the requirement of structured queries and restriction that the document structure cannot be traversed according to queries. In this paper, we put forward a method for flexible element retrieval which can retrieve relevant document elements with arbitrary granularity against natural language queries. The proposed techniques constitute a novel hierarchical index propagation and pruning mechanism and an algorithm of ranking document elements based on the hierarchical index. The experimental results show that our method significantly outperforms other existing methods. Our method also shows robustness to the long-standing problems of text length normalization and threshold setting in structured document retrieval.

This work was performed when the author was a visiting student at Microsoft Research Asia.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul S., Quass D., McHugh J., Widom J. and Wiener J., 1996, The Lorel Query Language for Semi-structured Data, Department of Computer Science. Stanford University, California, USA, 1996.

    Google Scholar 

  2. Baeza-Yates, R., Navarro, G., 1996, Integrating contents and structure in text retrieval, ACM SIGMOD Record, 25(1):67–79, March 1996.

    Article  Google Scholar 

  3. Callan, J., 1994, Passage-level evidence in document retrieval. In Proceedings of the 17 Annual ACM SIGIR Conference on Research and Development in nformation Retrieval, Dublin, Ireland, 1994, Pages 302–310.

    Google Scholar 

  4. Frisse, M, 1988, Searching for Information in a hypertext medical handbook, Comm. of ACM, 31(7), July 1988, Pages 263–271.

    Article  Google Scholar 

  5. Fuhr, N., Grobjohann, K., 2001, XIRQL: a query language for information retrieval in XML documents, In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, Louisiana, USA, September 2001, Pages 172–180.

    Google Scholar 

  6. Geffet, M., Feitelson, D., 2001, Hierarchical indexing and document matching in BoW, In Proceedings of JCDL’01, Roanoke, Virginia, USA, 2001, pages 259–267.

    Google Scholar 

  7. Goldman, R., Shivakumar, N., Venkatasubramanian, S. and Garcia-Molina, H., Proximity search in databases, In Proceedings of the Twenty-Fourth International Conference on Very Large Data Bases, New York, USA, August 1998, Pages 26–37.

    Google Scholar 

  8. Kaszkiel, M., Zobel J. and Sacks-Davis R., 1999, Efficient passage ranking for document databases, ACM Transactions on Information Systems, Vol. 17, No. 4, October 1999, Pages 406–439.

    Article  Google Scholar 

  9. Kaszkiel, M., Zobel, J., 1997, Passage retrieval revisited, In Proceedings of the 20th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, 1997, Philadelphia, PA, USA, Pages 178–185.

    Google Scholar 

  10. Kazai, G., Lalmas, M., and Rölleke, T., 2001, Aggregated Representation for the Focussed Retrieval of Structured Documents, SIGIR 2001 Workshop, Mathematical/Formal Methods in IR, New Orleans, 2001.

    Google Scholar 

  11. Lee, Y., Yoo, S. Yoon, K. and Berra, P., 1996, Index structures for structured documents, In Proc. of the First ACM International Conf. on Digital Libraries, pp. 91–99, 1996, Bethesda, Maryland.

    Google Scholar 

  12. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J., 1997, Lore: a database management System for semistructured data, SIGMOD Record, 26(3), September 1997, Pages 54–66.

    Article  Google Scholar 

  13. Mittendorf, E., and Schauble, P., 1994, Document and Passage Retrieval Based on Hidden Markov Models, In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July, 1994, Pages 318–327.

    Google Scholar 

  14. Myaeng, S., Jang, D., Kim, M. and Zhoo Z., 1998, A flexible model for retrieval of SGML documents, In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998, Pages 138–145.

    Google Scholar 

  15. Salton, G., Allan, J. and Singhall, A., 1996, Automatic Text Decomposition and Structuring, Information Processing and Management. 32(2), Pages 127–138.

    Article  Google Scholar 

  16. Wilkinson, R., 1994, Effective retrieval of structured document, In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, 1994, Pages 311–317.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cui, H., Wen, JR., Chua, TS. (2003). Hierarchical Indexing and Flexible Element Retrieval for Structured Document. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_6

Download citation

  • DOI: https://doi.org/10.1007/3-540-36618-0_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-01274-0

  • Online ISBN: 978-3-540-36618-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics