skip to main content
research-article

CRAYSE: design and implementation of efficient text search algorithm in a web crawler

Published: 20 July 2010 Publication History

Abstract

CRAYSE1 is a SEarch WHIle CRAwl application, intended to perform fast searching of text in web pages. A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. This process is also called spidering. Search engines, use spidering as a means of providing up-to-date data. Most of the existing web-crawlers archive the contents of the web starting from the input URL. Search engines index the results of web-crawlers and then perform searching when queried. As such, the searching is not performed while crawling. Hence such softwares can not be used for general use by web browsers. Also, the existing search mechanism in web browsers, search only on the current page and not recursively through all the links present in that page. In order to overcome such disadvantages, we propose in this paper to implement a web crawler that searches for a pattern efficiently and recursively through all the links including pdf links while crawling. CRAYSE can be used as a general purpose open source software by web browsers. It can also be used for offine searching. Further, the applications that require selective archival of web pages (based on the presence of a key word), can deploy CRAYSE for efficient search operations. This paper focusses on the design and implementation of CRAYSE and its demonstration through web applications.

References

[1]
Java online documentation, http://java.sun.com/j2se/1.4.2/docs/api/
[2]
Wikipedia, http://en.wikipedia.org/wiki/web crawler.
[3]
Carlos Castillo, Mauricio Marín, Andrea Rodríguez, and Ricardo A. Baeza-Yates. Scheduling algorithms for web crawling. pages 10--17, 2004.
[4]
Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Efficient crawling through url ordering. Computer Networks, 30(1-7):161--172, 1998.
[5]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to algorithms. pages 923--931, 2006.
[6]
Bruce Eckel. Thinking in java. pages 689--823, 2000.
[7]
Jenny Edwards, Kevin S. McCurley, and John A. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. pages 106--113, 2001.
[8]
Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4):219--229, 1999.
[9]
Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323--350, 1977.
[10]
Lipyeow Lim, MinWang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ramesh C. Agarwal. Characterizing web document change. 2118:133--144, 2001.
[11]
Marc Najork and Janet L. Wiener. Breadth-first crawling yields high-quality pages. pages 114--118, 2001.
[12]
Herbert Schildt. Java complete reference. pages 587--626, 2002.
[13]
Vladislav Shkapenyuk and Torsten Suel. Design and implementation of a high-performance distributed web crawler. In ICDE, pages 357--368. IEEE Computer Society, 2002.

Cited By

View all
  • (2013)A novel defense mechanism against web crawlers intrusion2013 International Conference on Electronics, Computer and Computation (ICECCO)10.1109/ICECCO.2013.6718280(269-272)Online publication date: Nov-2013
  • (2012)Surfing NotesProceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 0310.1109/WI-IAT.2012.174(301-305)Online publication date: 4-Dec-2012

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGSOFT Software Engineering Notes
ACM SIGSOFT Software Engineering Notes  Volume 35, Issue 4
July 2010
102 pages
ISSN:0163-5948
DOI:10.1145/1811226
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2010
Published in SIGSOFT Volume 35, Issue 4

Check for updates

Author Tags

  1. KMP
  2. prefix computation
  3. search while crawl
  4. web crawler

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2013)A novel defense mechanism against web crawlers intrusion2013 International Conference on Electronics, Computer and Computation (ICECCO)10.1109/ICECCO.2013.6718280(269-272)Online publication date: Nov-2013
  • (2012)Surfing NotesProceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 0310.1109/WI-IAT.2012.174(301-305)Online publication date: 4-Dec-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media