skip to main content
10.1145/1008992.1009046acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Published:25 July 2004Publication History

ABSTRACT

Web Search Engines provide a large-scale text document retrieval service by processing huge Inverted File indexes. Inverted File indexes allow fast query resolution and good memory utilization since their d-gaps representation can be effectively and efficiently compressed by using variable length encoding methods. This paper proposes and evaluates some algorithms aimed to find an assignment of the document identifiers which minimizes the average values of d-gaps, thus enhancing the effectiveness of traditional compression methods. We ran several tests over the Google contest collection in order to validate the techniques proposed. The experiments demonstrated the scalability and effectiveness of our algorithms. Using the proposed algorithms, we were able to sensibly improve (up to 20.81%) the compression ratios of several encoding schemes.

References

  1. V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 2004. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. N. Anh and A. Moffat. Index compression using fixed binary codewords. In K.-D. Schewe and H. Williams, editors, Proc. 15th Australasian Database Conference, Dunedin, New Zealand, Jan. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Blandford and G. Blelloch. Index compression through document reordering. In IEEE, editor, Proceedings of the Data Compression Conference (DCC'02). IEEE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Buckley. Implementation of the smart information retrieval system. Technical Report TR85--686, Cornell University, Computer Science Department, May 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 318--329. ACM Press, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. B. Frakes and E. R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms, chapter Clustering Algorithms (E. Rasmussen). Prentice Hall, Englewood Cliffs, NJ, 1992.Google ScholarGoogle Scholar
  8. G. Karypis. Metis: Family of multilevel partitioning algorithms. http://www-users.cs.umn.edu/karypis/metis/.Google ScholarGoogle Scholar
  9. A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25--47, July 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Rivest. Rfc 1321: The md5 algorithm.Google ScholarGoogle Scholar
  11. F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted index for fast query evaluation. In Proceedings of the 25rd annual international ACM SIGIR conference on Research and development in information retrieval, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Information Processing and Management, 39(1):117--131, January 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. Silvestri, R. Perego, and S. Orlando. Assigning document identifiers to enhance compressibility of web search. In Proceedings of the 19th Annual ACM Symposium on Applied Computing - Data Mining Track, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes -- Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, second edition edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Assigning identifiers to documents to enhance the clustering property of fulltext indexes

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
        July 2004
        624 pages
        ISBN:1581138814
        DOI:10.1145/1008992

        Copyright © 2004 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 July 2004

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader