Fast and Flexible Compression for Web Search Engines

https://doi.org/10.1016/j.entcs.2004.09.043Get rights and content
Under a Creative Commons license
open access

Abstract

In this paper we present the adaptation of a compression technique, specially designed to compress large textual databases, to the peculiarities of web search engines.

The (s,c)-Dense Code belongs to a new category of compression techniques [Silva de Moura, E., G. Navarro, N. Ziviani and R. Baeza-Yates, Fast and flexible word searching on compressed text, ACM Transactions on Information Systems 18 (2000), pp. 113–139; Brisaboa, N., A. Fariña, G. Navarro and M. Esteller, (s,c)-dense coding: An optimized compression code for natural language text databases, in: Proc. 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003), LNCS 2857, 2003, pp. 122–136] that allows fast and flexible search directly on compressed files. However these methods are only suitable for large natural texts containing at least 1 megabyte, otherwise they would not achieve an attractive amount of compression.

In order to take advantage of the search capabilities of these techniques (they allow searches on compressed files up to eight times faster than searching on the plain versions [Silva de Moura, E., G. Navarro, N. Ziviani and R. Baeza-Yates, Fast and flexible word searching on compressed text, ACM Transactions on Information Systems 18 (2000), pp. 113–139]), we present a modification of the basic compression technique (s,c)-Dense Code to achieve reasonable compression ratios with small files, a requirement when we work with search engines.

Keywords

Compression
document repositories

Cited by (0)

Partially supported by Ministrio de Ciencia y Tecnología grants (#TIC2003-06593) and (#FIT-150500-2003-588); and by Xunta de Galicia grant (#PGIDIT02SIN10501PR).