Entry Pairing in Inverted File

Lam, Hoang Thanh; Perego, Raffaele; Quan, Nguyen Thoi Minh; Silvestri, Fabrizio

doi:10.1007/978-3-642-04409-0_50

Hoang Thanh Lam¹⁹,
Raffaele Perego²⁰,
Nguyen Thoi Minh Quan²¹ &
…
Fabrizio Silvestri²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5802))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1039 Accesses
6 Citations
3 Altmetric

Abstract

This paper proposes to exploit content and usage information to rearrange an inverted index for a full-text IR system. The idea is to merge the entries of two frequently co-occurring terms, either in the collection or in the answered queries, to form a single, paired, entry. Since postings common to paired terms are not replicated, the resulting index is more compact. In addition, queries containing terms that have been paired are answered faster since we can exploit the pre-computed posting intersection. In order to choose which terms have to be paired, we formulate the term pairing problem as a Maximum-Weight Matching Graph problem, and we evaluate in our scenario efficiency and efficacy of both an exact and a heuristic solution. We apply our technique: (i) to compact a compressed inverted file built on an actual Web collection of documents, and (ii) to increase capacity of an in-memory posting list. Experiments showed that in the first case our approach can improve the compression ratio of up to 7.7%, while we measured a saving from 12% up to 18% in the size of the posting cache.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Roy, S., Kumar, R., Prvulovic, M.: Improving system performance with compressed memory. In: IPDPS 2001: Proceedings of the 15th International Parallel & Distributed Processing Symposium, p. 66. IEEE Computer Society, Washington (2001)
Google Scholar
Turpin, A., Tsegay, Y., Hawking, D., Williams, H.E.: Fast generation of result snippets in web search. In: Kraaij, W., de Vries, A.P., Clarke, C.L.A., Fuhr, N., Kando, N. (eds.) SIGIR, pp. 127–134. ACM, New York (2007)
Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6 (2006)
Article Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes – Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)
Google Scholar
Golomb, S.: Run-length encodings. IEEE Transactions on Information Theory 12(3), 399–401 (1966)
Article MATH MathSciNet Google Scholar
Rice, R.F., Plaunt, J.R.: Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Trans. Commun. COM-19, 889–897 (1971)
Article Google Scholar
Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in search engines. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 387–396. ACM, New York (2008)
Chapter Google Scholar
Blandford, D., Blelloch, G.: Index compression through document reordering. In: DCC 2002: Proceedings of the Data Compression Conference (DCC 2002), p. 342. IEEE Computer Society, Washington (2002)
Chapter Google Scholar
Shieh, W.Y., Chen, T.F., Shann, J.J.J., Chung, C.P.: Inverted file compression through document identifier reassignment. Inf. Process. Manage. 39(1), 117–131 (2003)
Article MATH Google Scholar
Silvestri, F., Orlando, S., Perego, R.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 305–312. ACM, New York (2004)
Google Scholar
Silvestri, F.: Sorting out the document identifier assignment problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007)
Chapter Google Scholar
Blanco, R., Barreiro, A.: Tsp and cluster-based solutions to the reassignment of document identifiers. Inf. Retr. 9(4), 499–517 (2006)
Article Google Scholar
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)
MATH Google Scholar
Long, X., Suel, T.: Three-level caching for efficient query processing in large web search engines. In: WWW 2005: Proceedings of the 14th international conference on World Wide Web, pp. 257–266. ACM, New York (2005)
Chapter Google Scholar
Chaudhuri, S., Church, K.W., Knig, A.C., Sui, L.: Heavy-tailed distributions and multi-keyword queries. In: Kraaij, W., de Vries, A.P., Clarke, C.L.A., Fuhr, N., Kando, N. (eds.) SIGIR, pp. 663–670. ACM, New York (2007)
Google Scholar
Edmonds, J., Johnson, E.L., Lockhart, S.C.: Blossom i: a computer code for the matching problem. Unpublished report, IBM T. J. Watson Research Center (1969)
Google Scholar
Gabow, H.N.: An efficient implementation of edmonds’ algorithm for maximum matching on graphs. J. ACM 23(2), 221–234 (1976)
Article MATH MathSciNet Google Scholar
Preis, R.: Linear time 1/2-approximation algorithm for maximum weighted matching in general graphs. In: Meinel, C., Tison, S. (eds.) STACS 1999. LNCS, vol. 1563, pp. 259–269. Springer, Heidelberg (1999)
Chapter Google Scholar
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F.: The impact of caching on search engines. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 183–190. ACM, New York (2007)
Chapter Google Scholar
Blanco, R., Barreiro, A.: Static pruning of terms in inverted files. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 64–75. Springer, Heidelberg (2007)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dip. di Informatica, Università di Pisa, Italy
Hoang Thanh Lam
ISTI-CNR, Pisa, Italy
Raffaele Perego & Fabrizio Silvestri
Lomonosov Moscow State University, Russia
Nguyen Thoi Minh Quan

Authors

Hoang Thanh Lam
View author publications
You can also search for this author in PubMed Google Scholar
Raffaele Perego
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen Thoi Minh Quan
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Silvestri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

European Research Center for Information Systems, University of Münster,, Leonardo Campus 3, 48149, Münster, Germany
Gottfried Vossen
Dept. of Computer Science, School of Engineering, University of California, 95064, Santa Cruz, CA, USA
Darrell D. E. Long
Dept. of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, William M. W. Mong Engineering Building, Shatin, N. T.,, Hong Kong, China
Jeffrey Xu Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lam, H.T., Perego, R., Quan, N.T.M., Silvestri, F. (2009). Entry Pairing in Inverted File. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds) Web Information Systems Engineering - WISE 2009. WISE 2009. Lecture Notes in Computer Science, vol 5802. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04409-0_50

Download citation

DOI: https://doi.org/10.1007/978-3-642-04409-0_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04408-3
Online ISBN: 978-3-642-04409-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics