skip to main content
10.1145/775152.775167acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Dynamic maintenance of web indexes using landmarks

Published:20 May 2003Publication History

ABSTRACT

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.

References

  1. L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, and J. S. Vitter. A unified approach for indexed and non-indexed spatial joins. Proceedings of the 7th Intl. Conf. on Extending Database Technology (EDBT '00), 1777, 413--429, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20, 762--772, 1976.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Brewington and G. Cybenko. Keeping up with the changing web. IEEE Computer, 33(5), 52--58, May 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. W. Brown, J. P. Callan, and W. B. Croft. Fast incremental indexing for full-text information retrieval. In 20th Intl. Conf. on Very Large Data Bases, 192--202, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Cho and H. Garcia-Molina. Estimating frequency of change. Submitted for publication, 2000.]]Google ScholarGoogle Scholar
  7. J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. 26th Intl. Conf. on Very Large Data Bases, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Clarke and G. Cormack. Dynamic inverted indexes for a distributed full-text retrieval system. Tech. Report CS-95-01, Univ. of Waterloo CS Dept., 1995.]]Google ScholarGoogle Scholar
  9. C. Clarke, G. Cormack, and F. Burkowski. Fast inverted indexes with on-line update. Tech. Report CS-94-40, Univ. of Waterloo CS Dept., 1994.]]Google ScholarGoogle Scholar
  10. D. Cutting and J. Perdersen. Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405--411, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. E. Knuth, J. H. Morris, and V. B. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6, 323--350, 1977.]]Google ScholarGoogle ScholarCross RefCross Ref
  13. S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400, 107--109, 1999.]]Google ScholarGoogle ScholarCross RefCross Ref
  14. Q. Li and B. Moon. Indexing and querying xml data for regular path expressions. In 27th Intl. Conf. on Very Large Data Bases, 361--370, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. C. Agarwal. Characterizing web document change. In Advances in Web-Age Information Management, 2nd Intl. Conf., WAIM 2001, 133--144, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the Winter 1994 USENIX Conf., 23--32. USENIX, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. Proceedings of the 10th Intl. WWW Conf., 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th Intl. WWW Conf., 107--117, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. Proceedings of 1994 ACM SIGMOD Intl. Conf. of Management of Data, 289--300, May 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Ukkonen. Algorithms for approximate string matching. Information and Control, 64, 100--118, 1985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. S. Vitter. Faster methods for random sampling. Communications of the ACM, 27, July 1984.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. S. Vitter. An efficient I/O interface for optical disks. ACM Trans. on Database Systems, 129--162, June 1985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems. In Proceedings of 2001 ACM SIGMOD Intl. Conf. of Management of Data, 361--370, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Dynamic maintenance of web indexes using landmarks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WWW '03: Proceedings of the 12th international conference on World Wide Web
        May 2003
        772 pages
        ISBN:1581136803
        DOI:10.1145/775152

        Copyright © 2003 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 May 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

        Upcoming Conference

        WWW '24
        The ACM Web Conference 2024
        May 13 - 17, 2024
        Singapore , Singapore

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader