ABSTRACT
Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.
- L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, and J. S. Vitter. A unified approach for indexed and non-indexed spatial joins. Proceedings of the 7th Intl. Conf. on Extending Database Technology (EDBT '00), 1777, 413--429, 2000.]] Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.]] Google ScholarDigital Library
- R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20, 762--772, 1976.]] Google ScholarDigital Library
- B. Brewington and G. Cybenko. Keeping up with the changing web. IEEE Computer, 33(5), 52--58, May 2000.]] Google ScholarDigital Library
- E. W. Brown, J. P. Callan, and W. B. Croft. Fast incremental indexing for full-text information retrieval. In 20th Intl. Conf. on Very Large Data Bases, 192--202, 1994.]] Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. Estimating frequency of change. Submitted for publication, 2000.]]Google Scholar
- J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. 26th Intl. Conf. on Very Large Data Bases, 2000.]] Google ScholarDigital Library
- C. Clarke and G. Cormack. Dynamic inverted indexes for a distributed full-text retrieval system. Tech. Report CS-95-01, Univ. of Waterloo CS Dept., 1995.]]Google Scholar
- C. Clarke, G. Cormack, and F. Burkowski. Fast inverted indexes with on-line update. Tech. Report CS-94-40, Univ. of Waterloo CS Dept., 1994.]]Google Scholar
- D. Cutting and J. Perdersen. Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405--411, 1990.]] Google ScholarDigital Library
- W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.]] Google ScholarDigital Library
- D. E. Knuth, J. H. Morris, and V. B. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6, 323--350, 1977.]]Google ScholarCross Ref
- S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400, 107--109, 1999.]]Google ScholarCross Ref
- Q. Li and B. Moon. Indexing and querying xml data for regular path expressions. In 27th Intl. Conf. on Very Large Data Bases, 361--370, 2001.]] Google ScholarDigital Library
- L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. C. Agarwal. Characterizing web document change. In Advances in Web-Age Information Management, 2nd Intl. Conf., WAIM 2001, 133--144, 2001.]] Google ScholarDigital Library
- U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the Winter 1994 USENIX Conf., 23--32. USENIX, 1994.]] Google ScholarDigital Library
- S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. Proceedings of the 10th Intl. WWW Conf., 2001.]] Google ScholarDigital Library
- L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th Intl. WWW Conf., 107--117, 1998.]] Google ScholarDigital Library
- A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. Proceedings of 1994 ACM SIGMOD Intl. Conf. of Management of Data, 289--300, May 1994.]] Google ScholarDigital Library
- E. Ukkonen. Algorithms for approximate string matching. Information and Control, 64, 100--118, 1985.]] Google ScholarDigital Library
- J. S. Vitter. Faster methods for random sampling. Communications of the ACM, 27, July 1984.]] Google ScholarDigital Library
- J. S. Vitter. An efficient I/O interface for optical disks. ACM Trans. on Database Systems, 129--162, June 1985.]] Google ScholarDigital Library
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999.]] Google ScholarDigital Library
- C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems. In Proceedings of 2001 ACM SIGMOD Intl. Conf. of Management of Data, 361--370, 2001.]] Google ScholarDigital Library
Index Terms
- Dynamic maintenance of web indexes using landmarks
Recommendations
Efficient Update of Indexes for Dynamically Changing Web Documents
Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index ...
Efficient Textual Web Retrieval using Wavelet Tree
Searching on the web is one of the most progressive and expanding field nowadays. A large amount of information is available on the World Wide Web, motivating the need of efficient text indexing method that support fast text retrieval. In the past, two ...
Incremental maintenance of XML structural indexes
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of dataIncreasing popularity of XML in recent years has generated much interest in query processing over graph-structured data. To support efficient evaluation of path expressions, many structural indexes have been proposed. The most popular ones are the 1-...
Comments