skip to main content
10.1145/2187836.2187948acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

OPAL: automated form understanding for the deep web

Published:16 April 2012Publication History

ABSTRACT

Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).

References

  1. Z. Bar-Yossef and M. Gurevich. Random Sampling from a Search Engine's Index. J. ACM, 55(5), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Barbosa and J. Freire. Combining Classifiers to identify Online Databases. In WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Benedikt and C. Koch. XPath leashed. CSUR, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Bilke and F. Naumann. Schema Matching using Duplicates. In ICDE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. J. Cafarella, E. Y. Chang, A. Fikes, A. Y. Halevy, W. C. Hsieh, A. Lerner, J. Madhavan, and S. Muthukrishnan. Data Management Projects at Google. SIGMOD Rec., 37(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. C. Dragut, T. Kabisch, C. Yu, and U. Leser. A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. O. Kalijuvee, O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Efficient Web Form Entry on PDAs. In WWW, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Khare and Y. An. An Empirical Study on using Hidden Markov Model for Search Interface Segmentation. In CIKM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Khare, Y. An, and I.-Y. Song. Understanding Deep Web Search Interfaces: A Survey. SIGMOD Rec., 39(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's Deep Web Crawl. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Maiti, A. Dasgupta, N. Zhang, and G. Das. HDSampler: Revealing Data behind Web Form Interfaces. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. I. Navarrete and G. Sciavicco. Spatial Reasoning with Rectangular Cardinal Direction Relations. In ECAI, 2006.Google ScholarGoogle Scholar
  13. H. Nguyen, T. Nguyen, and J. Freire. Learning to Extract From Labels. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. . Wu, A. Doan, C. Yu, and W. Meng. Modeling and Extracting Deep-Web Query Interfaces. In Adv.@ in Inf.@ & Intelligent S., 2009.Google ScholarGoogle Scholar
  16. K. Chang, Z. Zhang, B. He. Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart. Real understanding of real estate forms. In WIMS '11, 2011 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. OPAL: automated form understanding for the deep web

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        WWW '12: Proceedings of the 21st international conference on World Wide Web
        April 2012
        1078 pages
        ISBN:9781450312295
        DOI:10.1145/2187836

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 April 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader