Skip to main content
Log in

Towards Deeper Understanding of the Search Interfaces of the Deep Web

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Many databases have become Web-accessible through form-based search interfaces (i.e., HTML forms) that allow users to specify complex and precise queries to access the underlying databases. In general, such a Web search interface can be considered as containing an interface schema with multiple attributes and rich semantic/meta-information; however, the schema is not formally defined in HTML. Many Web applications, such as Web database integration and deep Web crawling, require the construction of the schemas. In this paper, we first propose a schema model for representing complex search interfaces, and then present a layout-expression based approach to automatically extract the logical attributes from search interfaces. We also rephrase the identification of different types of semantic information as a classification problem, and design several Bayesian classifiers to help derive semantic information from extracted attributes. A system, WISE-iExtractor, has been implemented to automatically construct the schema from any Web search interfaces. Our experimental results on real search interfaces indicate that this system is highly effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bergamaschi, S., Castano, S., Vincini, M., Beneventano, D.: Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36, 215–249 (2001)

    Article  MATH  Google Scholar 

  2. Chang, K., Garcia-Molina, H.: Mind your vocabulary: query mapping across heterogeneous information sources. In: SIGMOD Conference, 1999

  3. Chang, K., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: observations and implications. SIGMOD Rec. 33(3) (2004) September

  4. Gal, A., Modica, G., Jamil, H.: OntoBuilder: fully automatic extraction and consolidation of ontologies from web sources. In: ICDE Conference, 2004

  5. Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann (2000) August

  6. He, B., Chang, K.: Statistical schema matching across web query interfaces. In: SIGMOD Conference, 2003.

  7. He, H., Meng, W., Yu, C., Wu, Z.: WISE-Integrator: an automatic integrator of web search interfaces for e-commerce. In: VLDB Conference, 2003

  8. He, B., Tao, T., Chang, K.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM Conference, 2004

  9. HTML4: http://www.w3.org/TR/html401/

  10. Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., Paepcke, A.: Efficient web form entry on PDAs. In: WWW Conference, 2000

  11. Kohavi, R., Becker, B., Sommerfield, D.: Improving simple Bayes. In: ECML Conference, 1997

  12. Kushmerick, N.: Learning to invoke web forms. In: ODBASE Conference, 2003

  13. Levy, A., Rajaraman, A., Ordille, J.: Querying heterogeneous information sources using source descriptions. In: VLDB Conference, 1996

  14. Lu, Y., He, H., Peng, Q., Meng, W., Yu, C.: Clustering e-commerce search engines based on their search interface pages using WISE-cluster. Data Knowl. Eng. (DKE), 2006

  15. MetaQuerier http://metaquerier.cs.uiuc.edu/

  16. Peng, Q., Meng, W., He, H., Yu, C.: WISE-cluster: clustering e-commerce search engines automatically. In: WIDM workshop, 2004

  17. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: VLDB Conference, 2001

  18. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW Conference, 2003

  19. WordNet: http://www.cogsci.princeton.edu

  20. Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, 2004

  21. Zhang, Z., He, B., Chang, K.: Understanding web query interfaces: best-effort parsing with hidden syntax. In: SIGMOD Conference, 2004

  22. Zhang, Z., He, B., Chang, K.: Light-weight domain-based form assistant: querying web databases on the fly. In: VLDB Conference, 2005

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiyi Meng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, H., Meng, W., Lu, Y. et al. Towards Deeper Understanding of the Search Interfaces of the Deep Web. World Wide Web 10, 133–155 (2007). https://doi.org/10.1007/s11280-006-0010-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-006-0010-9

Keywords

Navigation