Skip to main content

Efficient Indexing of Versioned Document Sequences

  • Conference paper
Advances in Information Retrieval (ECIR 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4425))

Included in the following conference series:

Abstract

Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., et al.: Searching the web. ACM Transactions on Internet Technology 1(1), 2–43 (2001)

    Article  Google Scholar 

  2. Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, Inc., San Francisco (1999)

    Google Scholar 

  3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)

    Google Scholar 

  4. Heinz, S., Zobel, J.: Efficient single-pass index construction for text databases. JASIST 54(8), 713–729 (2003)

    Article  Google Scholar 

  5. Melnik, S., et al.: Building a distributed full-text index for the web. In: Proc. 10th International World Wide Web Conference (WWW 2001), pp. 396–406. ACM Press, New York (2001)

    Chapter  Google Scholar 

  6. Anick, P.G., Flynn, R.A.: Versioning a full-text information retrieval system. In: Proc. 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 98–111. ACM Press, New York (1992)

    Chapter  Google Scholar 

  7. Broder, A.Z., et al.: Indexing of shared content in information retrieval systems. In: Proc. 10th International EDBT Conference, pp. 313–330 (2006)

    Google Scholar 

  8. Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic clustering of the web. In: Proc. 6th International WWW Conference (1997)

    Google Scholar 

  9. Ferragina, P., et al.: Compressing and searching xml data via two zips. In: Proc. 15th International World Wide Web Conference (WWW’2006), pp. 751–760 (2006)

    Google Scholar 

  10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  11. Apostolico, A.: String editing and longest common subsequences. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, vol. 2 Linear Modeling: Background and Application, pp. 361–398. Springer, Heidelberg (1997)

    Google Scholar 

  12. Myers, E.W.: An o(ND) difference algorithm and its variations. Algorithmica 1(2), 251–266 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  13. Miller, W., Myers, E.W.: A file comparison program. Software – Practice and Experience 15(11), 1025–1040 (1985)

    Article  Google Scholar 

  14. Garey, M.R., Johnson, D.S.: Computers and Intractability, A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, New York (1979)

    MATH  Google Scholar 

  15. Turtle, H., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manage. 31(6) (1995)

    Google Scholar 

  16. Garcia-Molina, H., Ullman, J., Widom, J.: Database System Implementation. Prentice-Hall, Englewood Cliffs (2000)

    Google Scholar 

  17. Gathman, S.D.: Diff java class (2003), http://www.bmsi.com/java/Diff.java

  18. Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Molecular Biology 48(3), 443–453 (1970)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Giambattista Amati Claudio Carpineto Giovanni Romano

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Herscovici, M., Lempel, R., Yogev, S. (2007). Efficient Indexing of Versioned Document Sequences. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71496-5_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71494-1

  • Online ISBN: 978-3-540-71496-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics