Abstract
An increasing amount of Web data is accessible only by filling out HTML forms to query an underlying data source. While this is most welcome from a user perspective (queries are easy and precise) and from a data management perspective (static pages need not be maintained; databases can be accessed directly), automated agents have greater difficulty accessing data behind forms. In this paper we present a method for automatically filling in forms to retrieve the associated dynamically generated pages. Using our approach automated agents can begin to systematically access portions of the “hidden Web.”
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Michael K. Bergman. The Deep Web: Surfacing Hidden Value. BrightPlanet.com, July 2000. Downloadable from http://www.brightplanet.com/deepcontent/deepwebwhitepaper.pdf, checked August 10, 2001.
Completeplanet.com home page. http://www.completeplanet.com. Checked August 10, 2001.
Hasan Davulcu, Juliana Freire, Michael Kifer, and I.V. Ramakrishnan. A layered architecture for querying dynamic Web content. In SIGMOD’ 99 Proceedings, pages 491–502, Philadelphia, PA, May 1999.
Robert B. Doorenbos, Oren Etzioni, and Daniel S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Confence on Autonomous Agents, pages 39–48, Marina del Rey, CA, February 1997.
Patil systems home page. http://www.patils.com. Describes LiveFORM and ebCARD services. Checked August 10, 2001.
eCode.com home page. http://www.eCode.com.. Checked August 10, 2001.
D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, and R.D. Smith. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering, 31:227–251, 1999.
Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database techniques for the World-Wide Web: A survey. SIGMOD Record, 27(3):59–74, 1998.
Alon Y. Halevy. Answering queries using views: A survey. VLDB Journal (online, to appear), 2001.
HTML 4.01 specification. http://www.w3.org/TR/html4, December 1999. Checked August 10, 2001.
InvisibleWeb.com home page. http://www.invisibleweb.com.. Checked August 10, 2001.
Henry Kautz, Bart Selman, and Mehul Shah. The hidden web. AI Magazine, 18(2):27–36, Summer 1997.
Steve Lawrence and C. Lee Giles. Accessibility of information on the Web. Nature, 400:107–109, 1999.
Steve Lawrence and C. Lee Giles. Searching the World Wide Web. Science, 280:98–100, April 1999.
Thomas Leonard. A Course In Categorical Data Analysis. Chapman & Hall/CRC, New York, 2000.
Robert A. McLean and Virgil L. Anderson. Applied Factorial and Fractional Designs. Marcel Dekker, Inc., New York, 1984.
Microsoft Passport and Wallet services. http://memberservices.passport.com.. Checked August 10, 2001.
R.L. Plackett. The Analysis of Categorical Data, 2 nd Edition. Charles Griffin & Company Ltd., London, 1981.
Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden Web. Technical Report 2000-36, Computer Science Department, Stanford University, December 2000. Available at http://dbpubs.stanford.edu/pub/2000-36.
Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden Web. In VLDB 2001 Proceedings, Rome, Italy, September 2001. To appear.
Anand Rajaraman, Yehoshua Sagiv, and Jeffrey D. Ullman. Answering queries using templates with binding patterns. In PODS’ 95 Proceedings, pages 105–112, San Jose, CA, 1995.
Randy D. Smith. Copy detection system for digital documents. Master’s thesis, Computer Science Department, Brigham Young University, 2000.
Ajit C. Tamhane and Dorothy D. Dunlop. Statistics and Data Analysis: From Elementary to Intermediate. Prentice-Hall, New Jersey, 2000.
Peter Tryfos. Sampling Methods For Applied Research: Text and Cases. Wiley, New York, 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liddle, S.W., Yau, S.H., Embley, D.W. (2002). On the Automatic Extraction of Data from the Hidden Web. In: Arisawa, H., Kambayashi, Y., Kumar, V., Mayr, H.C., Hunt, I. (eds) Conceptual Modeling for New Information Systems Technologies. ER 2001. Lecture Notes in Computer Science, vol 2465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46140-X_17
Download citation
DOI: https://doi.org/10.1007/3-540-46140-X_17
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44122-9
Online ISBN: 978-3-540-46140-1
eBook Packages: Springer Book Archive