Skip to main content

Entry Pairing in Inverted File

  • Conference paper
Book cover Web Information Systems Engineering - WISE 2009 (WISE 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5802))

Included in the following conference series:

Abstract

This paper proposes to exploit content and usage information to rearrange an inverted index for a full-text IR system. The idea is to merge the entries of two frequently co-occurring terms, either in the collection or in the answered queries, to form a single, paired, entry. Since postings common to paired terms are not replicated, the resulting index is more compact. In addition, queries containing terms that have been paired are answered faster since we can exploit the pre-computed posting intersection. In order to choose which terms have to be paired, we formulate the term pairing problem as a Maximum-Weight Matching Graph problem, and we evaluate in our scenario efficiency and efficacy of both an exact and a heuristic solution. We apply our technique: (i) to compact a compressed inverted file built on an actual Web collection of documents, and (ii) to increase capacity of an in-memory posting list. Experiments showed that in the first case our approach can improve the compression ratio of up to 7.7%, while we measured a saving from 12% up to 18% in the size of the posting cache.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Roy, S., Kumar, R., Prvulovic, M.: Improving system performance with compressed memory. In: IPDPS 2001: Proceedings of the 15th International Parallel & Distributed Processing Symposium, p. 66. IEEE Computer Society, Washington (2001)

    Google Scholar 

  2. Turpin, A., Tsegay, Y., Hawking, D., Williams, H.E.: Fast generation of result snippets in web search. In: Kraaij, W., de Vries, A.P., Clarke, C.L.A., Fuhr, N., Kando, N. (eds.) SIGIR, pp. 127–134. ACM, New York (2007)

    Google Scholar 

  3. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6 (2006)

    Article  Google Scholar 

  4. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes – Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)

    Google Scholar 

  5. Golomb, S.: Run-length encodings. IEEE Transactions on Information Theory 12(3), 399–401 (1966)

    Article  MATH  MathSciNet  Google Scholar 

  6. Rice, R.F., Plaunt, J.R.: Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Trans. Commun. COM-19, 889–897 (1971)

    Article  Google Scholar 

  7. Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in search engines. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 387–396. ACM, New York (2008)

    Chapter  Google Scholar 

  8. Blandford, D., Blelloch, G.: Index compression through document reordering. In: DCC 2002: Proceedings of the Data Compression Conference (DCC 2002), p. 342. IEEE Computer Society, Washington (2002)

    Chapter  Google Scholar 

  9. Shieh, W.Y., Chen, T.F., Shann, J.J.J., Chung, C.P.: Inverted file compression through document identifier reassignment. Inf. Process. Manage. 39(1), 117–131 (2003)

    Article  MATH  Google Scholar 

  10. Silvestri, F., Orlando, S., Perego, R.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 305–312. ACM, New York (2004)

    Google Scholar 

  11. Silvestri, F.: Sorting out the document identifier assignment problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  12. Blanco, R., Barreiro, A.: Tsp and cluster-based solutions to the reassignment of document identifiers. Inf. Retr. 9(4), 499–517 (2006)

    Article  Google Scholar 

  13. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)

    MATH  Google Scholar 

  14. Long, X., Suel, T.: Three-level caching for efficient query processing in large web search engines. In: WWW 2005: Proceedings of the 14th international conference on World Wide Web, pp. 257–266. ACM, New York (2005)

    Chapter  Google Scholar 

  15. Chaudhuri, S., Church, K.W., Knig, A.C., Sui, L.: Heavy-tailed distributions and multi-keyword queries. In: Kraaij, W., de Vries, A.P., Clarke, C.L.A., Fuhr, N., Kando, N. (eds.) SIGIR, pp. 663–670. ACM, New York (2007)

    Google Scholar 

  16. Edmonds, J., Johnson, E.L., Lockhart, S.C.: Blossom i: a computer code for the matching problem. Unpublished report, IBM T. J. Watson Research Center (1969)

    Google Scholar 

  17. Gabow, H.N.: An efficient implementation of edmonds’ algorithm for maximum matching on graphs. J. ACM 23(2), 221–234 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  18. Preis, R.: Linear time 1/2-approximation algorithm for maximum weighted matching in general graphs. In: Meinel, C., Tison, S. (eds.) STACS 1999. LNCS, vol. 1563, pp. 259–269. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  19. Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F.: The impact of caching on search engines. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 183–190. ACM, New York (2007)

    Chapter  Google Scholar 

  20. Blanco, R., Barreiro, A.: Static pruning of terms in inverted files. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 64–75. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lam, H.T., Perego, R., Quan, N.T.M., Silvestri, F. (2009). Entry Pairing in Inverted File. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds) Web Information Systems Engineering - WISE 2009. WISE 2009. Lecture Notes in Computer Science, vol 5802. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04409-0_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04409-0_50

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04408-3

  • Online ISBN: 978-3-642-04409-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics