Abstract
Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of its original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning.
In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches.
Article PDF
Similar content being viewed by others
References
Araújo MD, Navarro G and Ziviani N (1997) Large text searching allowing errors. In: Baeza-Yates R, Ed., Proc. of the 4th South AmericanWorkshop on String Processing (WSP 197), Carleton University Press, Vol. 8, pp. 2-20.
Baeza-Yates R (2000) Another distributed searching architecture for the web. Personal communication.
Baeza-Yates R and Gonnet G (1992) A new approach to text searching. Communications of the ACM, 35(10):74-82.
Baeza-Yates R and Navarro G (1996) Integrating contents and structure in text retrieval. ACM Special Interest Group in Management of Data (SIGMOD) Record, 25(1):67-79.
Baeza-Yates R and Navarro G (2000) Block-addressing indices for approximate text retrieval. Journal of the American Society for Information Science (JASIS), 51(1):69-82.
Baeza-Yates R, Navarro G, Vegas J and de la Fuente P (1998) A model and a visual query language for structured text. In: Proc. of the 5th South American Symposium on String Processing and Information Retrieval (SPIRE'98), IEEE Computer Science Press, pp. 7-13.
Baeza-Yates R and Régnier M (1990) Fast algorithms for two dimensional and multiple pattern matching. In: Proc. of the 2nd Scandinavian Workshop on Algorithm Theory (SWAT'90), pp. 332-347.
Baeza-Yates R and Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesely.
Bell TC, Cleary JG and Witten IH (1990) Test Compression. Prentice Hall.
Bentley J, Sleator D, Tarjan R and Wei V (1986) A locally adaptive data compression scheme. Communications of the ACM 29, pp. 320-330.
Bowman C, Danzig P, Hardy D, Manber U and Schwartz M (1994) The harvest information discovery and access system. In: Proc. of the 2nd International World Wide Web Conference, pp. 763-771.
Brown EW, Callan JP and Croft WB (1994) Fast incremental indexing for full-text information retrieval. In: Proc. of the 20th Very Large Data Base Conference (VLDB'94), Santiago, Chile, pp. 192-202.
Elias P (1975) Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21, pp. 194-203.
Golomb SW (1966) Run-lenght encodings. IEEE Transactions on Information Theory, IT-12(3):399-401.
Harman DK (1995) Overview of the third text retrieval conference. In: Proc. of the 3rd Text Retrieval Conference (TREC-3), Gaithersburg, Maryland, pp. 1-19.
Harman D, Fox E, Baeza-Yates R and Lee W (1992) Inverted Files. Prentice-Hall, pp. 28-43.
Hawking D (1997) Scalable text retrieval for large digital libraries. In: Peters C and Thanos C, Eds., Proc. of the 1st European Conference on Digital Libraries, Pisa, Italy, pp. 127-146.
Heaps H (1978) Information retrieval-Computational and Theoretical Aspects. Academic Press, NY.
Huffman DA (1952) A method for the construction of minimum-redundancy codes. In: Proc. of the Institute of Electrical and Radio Engineers, Vol. 40, pp. 1090-1101.
Linoff G and Stanfill C (1993) Compression of indexes with full positional information in very large text databases. In: Proc. of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'93), pp. 88-95.
Manber U and Wu S (1994) Glimpse: A tool to search through entire file systems. In: Proc. of the USENIX Technical Conference, pp. 23-32.
Manber U, Smith M and Gopal B (1997) WebGlimpse: Combining browsing and searching. In: Proc. of USENIX Technical Conference.
Moffat A (1989) Word-based text compression. Software Practice and Experience, 19(2):185-198.
Moffat A(1992) Economical inversion of large text files. Computing Systems (USENIX Assoc. Journal), 5(2):125-139.
Moffat A and Bell T (1995) In-situ generation of compressed inverted files. Journal of the American Society for Information Science (JASIS), 46(7):537-550.
Moffat A, Zobel J and Sharman N (1997) Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering, 9(2):302-313.
Moura E, Navarro G and Ziviani N (1999) Linear time sorting of skewed distributions. In: Proc. of the 6th Symposium on String Processing and Information Retrieval (SPIRE'99), IEEE Computer Science Press, pp. 135-140.
Moura E, Navarro G, ZivianiNand Baeza-Yates R (To appear) Fast and flexibleword searching on compressed text. ACM Transactions on Information Systems. Previous versions in 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98) and 5th Symposium on String Processing and Information Retrieval (SPIRE'98).
Turpin A and Moffat A (1997) Fast file search using text compression. In: Proc. of the 20th Australian Computer Science Conference, pp. 1-8.
Williams H, Zobel J and Anderson P (1999) What's next? Efficient structures for phrase querying. In: Roddick J, Ed., Proc. of the 10th Australasian Database Conference, Auckland, NZ, pp. 141-152.
Witten I, Moffat A and Bell T (1999) Managing Gigabytes, 2nd ed. Morgan Kaufmann Publishers, New York.
Wu S and Manber U (1992) Fast text searching allowing errors. Communications of the ACM, 35(10):83-91.
Zipf G (1949) Human Behaviour and the Principle of Least Effort. Addison-Wesley.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Navarro, G., de Moura, E.S., Neubert, M. et al. Adding Compression to Block Addressing Inverted Indexes. Information Retrieval 3, 49–77 (2000). https://doi.org/10.1023/A:1009934302807
Issue Date:
DOI: https://doi.org/10.1023/A:1009934302807