Abstract
There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually known as the Hidden Web. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in web pages with support for client-side dynamism, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bergman, M.: The Deep Web. Surfacing Hidden Value, http://www.brightplanet.com/technology/deepweb.asp
Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Search Engine. In: Proceedings of the 7th International World Wide Web Conference (1998)
Ipeirotis, P., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: Proceedings of the 28th International Conference on Very Large Databases, VLDB 2002 (2002)
Microsoft Internet Explorer WebBrowser Control, http://www.microsoft.com/windows/ie
Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context, EISIC 2002 (2002)
Raghavan, S., García-Molina, H.: Crawling the Hidden Web. In: Proceedings of the 27th International Conference on Very Large Databases (2001)
Mozilla Rhino - JavaScript Engine (Java), http://www.mozilla.org/rhino
Mozilla SpiderMonkey – JavaScript engine (C), http://www.mozilla.org/js/spidermonkey
WebCopier – Feel the Internet in your Hands, http://www.maximumsoft.com
Scripts in HTML Documents, http://www.w3.org/TR/html4/interact/scripts.html
Yahoo Mail, http://mail.yahoo.com
Naming and Addressing: http://www.w3.org/Addressing
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Álvarez, M., Pan, A., Raposo, J., Hidalgo, J. (2006). Crawling Web Pages with Support for Client-Side Dynamism. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_22
Download citation
DOI: https://doi.org/10.1007/11775300_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35225-9
Online ISBN: 978-3-540-35226-6
eBook Packages: Computer ScienceComputer Science (R0)