Skip to main content

Content-Aware DataGuides: Interleaving IR and DB Indexing Techniques for Efficient Retrieval of Textual XML Data

  • Conference paper
Advances in Information Retrieval (ECIR 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2997))

Included in the following conference series:

Abstract

Not only since the advent of XML, many applications call for e.cient structured document retrieval, challenging both Information Retrieval (IR) and database (DB) research. Most approaches combining indexing techniques from both .elds still separate path and content matching, merging the hits in an expensive join. This paper shows that retrieval is signi.cantly accelerated by processing text and structure simultaneously. The Content-Aware DataGuide (CADG) interleaves IR and DB indexing techniques to minimize path matching and suppress joins at query time, also saving needless I/O operations during retrieval. Extensive experiments prove the CADG to outperform the DataGuide [11,14] by a factor 5 to 200 on average. For structurally unselective queries, it is over 400 times faster than the DataGuide. The best results were achieved on large collections of heterogeneously structured textual documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amer-Yahia, S., Case, P.: XQuery and XPath Full-Text Use Cases. W3C Working Draft (2003), See http://www.w3.org/TR/xmlquery-full-text-use-cases

  2. Baeza-Yates, R., Navarro, G.: Integrating Contents and Structure in Text Retrieval. SIGMOD Record 25(1), 67–79 (1996)

    Article  Google Scholar 

  3. Barg, M., Wong, R.K.: A Fast and Versatile Path Index for Querying Semi- Structured Data. In: Proc. 8th Int. Conf. on DBS for Advanced Applications (2003)

    Google Scholar 

  4. Buxton, S., Rys, M.: XQuery and XPath Full-Text Requirements. W3C Working Draft (2003), See http://www.w3.org/TR/xquery-full-text-requirements

  5. Chen, Y., Aberer, K.: Combining Pat-Trees and Signature Files for Query Eval. in Document DBs. In: Proc. 10th Int. Conf. on DB & Expert Systems Applic. (1999)

    Google Scholar 

  6. Cooper, B., Sample, N., Franklin, M.J., Hjaltason, G.R., Shadmon, M.: A Fast Index for Semistructured Data. In: Proc. 27th Int. Conf. on Very Large DB (2001)

    Google Scholar 

  7. Cui, H., Wen, J.-R., Chua, T.-S.: Hier. Indexing and Flexible Element Retrieval for Struct. Document. In: Proc. 25th Europ. Conf. on IR Research, pp. 73–87 (2003)

    Google Scholar 

  8. Faloutsos, C.: Signature Files: Design and Performance Comparison of Some Signature Extraction Methods. In: Proc. ACM-SIGIR Int. Conf. on Research and Development in IR, pp. 63–82 (1985)

    Google Scholar 

  9. Frakes, W.B. (ed.): IR. Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs (1992)

    Google Scholar 

  10. Fuhr, N., Großjohann, K.: XIRQL: A Query Language for IR in XML Documents. Research and Development in IR, pp. 172–180 (2001)

    Google Scholar 

  11. Goldman, R., Widom, J.: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In: Proc. 23rd Int. Conf. on Very Large DB (1997)

    Google Scholar 

  12. Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the Integration of Structure Indexes and Inverted Lists. In: Proc. 20th Int. Conf. on Data Engineering (2004) (to appear)

    Google Scholar 

  13. Li, Q., Moon, B.: Indexing and Querying XML Data for Regular Path Expressions. In: Proc. 27th Int. Conf. on Very Large DB, pp. 361–370 (2001)

    Google Scholar 

  14. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., Widom, J.: Lore: A DB Management System for Semistructured Data. SIGMOD Rec. 26(3), 54–66 (1997)

    Article  Google Scholar 

  15. Meuss, H., Schulz, K., Bry, F.: Visual Querying and Explor. of Large Answers in XML DBs with X2. In: Proc. 19th Int. Conf. on DB Engin., pp. 777–779 (2003)

    Google Scholar 

  16. Meuss, H., Strohmaier, C.: Improving Index Structures for Structured Document Retrieval. In: Proc. 21st Ann. Colloquium on IR Research (1999)

    Google Scholar 

  17. Oesterle, J., Maier-Meyer, P.: The GNoP (German Noun Phrase) Treebank. In: Proc. 1st Int. Conf. on Language Resources and Evaluation (1998)

    Google Scholar 

  18. Schlieder, T., Meuss, H.: Querying and Ranking XML Documents. JASIS Spec. Top. XML/IR 53(6), 489–503 (2002)

    Google Scholar 

  19. Shin, D., Jang, H., Jin, H.: BUS: An Effective Indexing and Retrieval Scheme in Structured Documents. In: Proc. 3rd ACM Int. Conf. on Digital Libraries (1998)

    Google Scholar 

  20. Weigel, F.: A Survey of Indexing Techniques for Semistructured Documents. Technical report, Dept. of Computer Science, University of Munich, Germany (2002)

    Google Scholar 

  21. Weigel, F.: Content-Aware DataGuides for Indexing Semi-Structured Data. Master’s thesis, Dept. of Computer Science, University of Munich, Germany (2003)

    Google Scholar 

  22. Wolff, J.E., Flörke, H., Cremers, A.B.: Searching and Browsing Collections of Structural Information. In: Advances in Digital Libraries, pp. 141–150 (2000)

    Google Scholar 

  23. XML Benchmark Project. A benchmark suite for evaluating XML repositories, See http://monetdb.cwi.nl/xml

  24. Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted Files Versus Signature Files for Text Indexing. ACM Transactions on DB Systems 23(4), 453–490 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Weigel, F., Meuss, H., Bry, F., Schulz, K.U. (2004). Content-Aware DataGuides: Interleaving IR and DB Indexing Techniques for Efficient Retrieval of Textual XML Data. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24752-4_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21382-6

  • Online ISBN: 978-3-540-24752-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics