ABSTRACT
Query languages for retrieval of XML documents allow for conditions referring both to the content and the structure of documents. In this paper, we investigate two different approaches for reducing index space of inverted files for XML documents. First, we consider methods for compressing index entries. Second, we develop the new XS tree data structure which contains the structural description of a document in a rather compact form, such that these descriptions can be kept in main memory. Experimental results on two large XML document collections show that very high compression rates for indexes can be achieved, but any compression increases retrieval time. On the other hand, highly compressed indexes may be feasible for applications where storage is limited, such as in PDAs or E-book devices.
- Peter Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2): 194--202, March 1975.Google ScholarDigital Library
- N. Fuhr and K. Großjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the SIGIR, pages 172--180, New York, 2001. ACM. Google ScholarDigital Library
- Daniel S. Hirschberg and Debra~A. Lelewer. Efficient decoding of prefix codes. Communications of the ACM, 33(4):449--459, 1990. Google ScholarDigital Library
- Alistair Moffat and Justin Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996. Google ScholarDigital Library
- James A. Thom, Justin Zobel, and Bruce Grima. Design of indexes for structured document databases. Technical Report TR-95-8, Collaborative Information Technology Research Institute, Melbourne, Australia, 1995.Google Scholar
- Ian H. Witten, Alistair Moffat, and Timothy~C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, 2nd edition, 1999. Google ScholarDigital Library
Index Terms
- Index compression vs. retrieval time of inverted files for XML documents
Recommendations
An Experimental Study of Bitmap Compression vs. Inverted List Compression
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataBitmap compression has been studied extensively in the database area and many efficient compression schemes were proposed, e.g., BBC, WAH, EWAH, and Roaring. Inverted list compression is also a well-studied topic in the information retrieval community ...
Sigma encoded inverted files
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementCompression of term frequency lists and very long document-id lists within an inverted file search engine are examined. Several compression schemes are compared including Elias γ and δ codes, Golomb Encoding, Variable Byte Encoding, and a class of word-...
Converting PDF files to XML files
Purpose -- The purpose of this paper is to develop a system that can convert PDF files to XML files.Design/methodology/approach -- The system works with XML as an information display model and XSLT as an information extraction rule. The process is ...
Comments