Article

Dynamic maintenance of web indexes using landmarks

Authors:
Lipyeow Lim

Duke University, Durham, NC

Duke University, Durham, NC
View Profile

,
Min Wang

IBM T. J. Watson Research Ctr., Hawthorne, NY

IBM T. J. Watson Research Ctr., Hawthorne, NY
View Profile

,
Sriram Padmanabhan

IBM T. J. Watson Research Ctr., Hawthorne, NY

IBM T. J. Watson Research Ctr., Hawthorne, NY
View Profile

,
Jeffrey Scott Vitter

Purdue University, West Lafayette, IN

Purdue University, West Lafayette, IN
View Profile

,
Ramesh Agarwal

IBM Almaden Research Ctr., San Jose, CA

IBM Almaden Research Ctr., San Jose, CA
View Profile

WWW '03: Proceedings of the 12th international conference on World Wide WebMay 2003Pages 102–111https://doi.org/10.1145/775152.775167

Published:20 May 2003Publication History

WWW '03: Proceedings of the 12th international conference on World Wide Web

Pages 102–111

ABSTRACT

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.

References

L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, and J. S. Vitter. A unified approach for indexed and non-indexed spatial joins. Proceedings of the 7th Intl. Conf. on Extending Database Technology (EDBT '00), 1777, 413--429, 2000.]] Google ScholarDigital Library
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.]] Google ScholarDigital Library
R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20, 762--772, 1976.]] Google ScholarDigital Library
B. Brewington and G. Cybenko. Keeping up with the changing web. IEEE Computer, 33(5), 52--58, May 2000.]] Google ScholarDigital Library
E. W. Brown, J. P. Callan, and W. B. Croft. Fast incremental indexing for full-text information retrieval. In 20th Intl. Conf. on Very Large Data Bases, 192--202, 1994.]] Google ScholarDigital Library
J. Cho and H. Garcia-Molina. Estimating frequency of change. Submitted for publication, 2000.]]Google Scholar
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. 26th Intl. Conf. on Very Large Data Bases, 2000.]] Google ScholarDigital Library
C. Clarke and G. Cormack. Dynamic inverted indexes for a distributed full-text retrieval system. Tech. Report CS-95-01, Univ. of Waterloo CS Dept., 1995.]]Google Scholar
C. Clarke, G. Cormack, and F. Burkowski. Fast inverted indexes with on-line update. Tech. Report CS-94-40, Univ. of Waterloo CS Dept., 1994.]]Google Scholar
D. Cutting and J. Perdersen. Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405--411, 1990.]] Google ScholarDigital Library
W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.]] Google ScholarDigital Library
D. E. Knuth, J. H. Morris, and V. B. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6, 323--350, 1977.]]Google ScholarCross Ref
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400, 107--109, 1999.]]Google ScholarCross Ref
Q. Li and B. Moon. Indexing and querying xml data for regular path expressions. In 27th Intl. Conf. on Very Large Data Bases, 361--370, 2001.]] Google ScholarDigital Library
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. C. Agarwal. Characterizing web document change. In Advances in Web-Age Information Management, 2nd Intl. Conf., WAIM 2001, 133--144, 2001.]] Google ScholarDigital Library
U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the Winter 1994 USENIX Conf., 23--32. USENIX, 1994.]] Google ScholarDigital Library
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. Proceedings of the 10th Intl. WWW Conf., 2001.]] Google ScholarDigital Library
L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th Intl. WWW Conf., 107--117, 1998.]] Google ScholarDigital Library
A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. Proceedings of 1994 ACM SIGMOD Intl. Conf. of Management of Data, 289--300, May 1994.]] Google ScholarDigital Library
E. Ukkonen. Algorithms for approximate string matching. Information and Control, 64, 100--118, 1985.]] Google ScholarDigital Library
J. S. Vitter. Faster methods for random sampling. Communications of the ACM, 27, July 1984.]] Google ScholarDigital Library
J. S. Vitter. An efficient I/O interface for optical disks. ACM Trans. on Database Systems, 129--162, June 1985.]] Google ScholarDigital Library
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999.]] Google ScholarDigital Library
C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems. In Proceedings of 2001 ACM SIGMOD Intl. Conf. of Management of Data, 361--370, 2001.]] Google ScholarDigital Library

Index Terms

Dynamic maintenance of web indexes using landmarks
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval

Recommendations

Efficient Update of Indexes for Dynamically Changing Web Documents

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index ...
Read More
Efficient Textual Web Retrieval using Wavelet Tree

Searching on the web is one of the most progressive and expanding field nowadays. A large amount of information is available on the World Wide Web, motivating the need of efficient text indexing method that support fast text retrieval. In the past, two ...
Read More
Incremental maintenance of XML structural indexes
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Increasing popularity of XML in recent years has generated much interest in query processing over graph-structured data. To support efficient evaluation of path expressions, many structural indexes have been proposed. The most popular ones are the 1-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '03: Proceedings of the 12th international conference on World Wide Web
May 2003
772 pages
ISBN:1581136803
DOI:10.1145/775152
Conference Chairs:
Gusztáv Hencsey
MTA SZTAKI, Hungary
,
Bebo White
Stanford Linear Accelerator Center, USA
,
Program Chairs:
Yih-Farn Robin Chen
AT&T Labs -- Research, USA
,
László Kovács
MTA SZTAKI, Hungary
,
Steve Lawrence
Google Inc., USA
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
inverted files
update processing
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 880
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dynamic maintenance of web indexes using landmarks

WWW '03: Proceedings of the 12th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient Update of Indexes for Dynamically Changing Web Documents

Efficient Textual Web Retrieval using Wavelet Tree

Incremental maintenance of XML structural indexes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Dynamic maintenance of web indexes using landmarks

WWW '03: Proceedings of the 12th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient Update of Indexes for Dynamically Changing Web Documents

Efficient Textual Web Retrieval using Wavelet Tree

Incremental maintenance of XML structural indexes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media