Abstract
Many databases have become Web-accessible through form-based search interfaces (i.e., HTML forms) that allow users to specify complex and precise queries to access the underlying databases. In general, such a Web search interface can be considered as containing an interface schema with multiple attributes and rich semantic/meta-information; however, the schema is not formally defined in HTML. Many Web applications, such as Web database integration and deep Web crawling, require the construction of the schemas. In this paper, we first propose a schema model for representing complex search interfaces, and then present a layout-expression based approach to automatically extract the logical attributes from search interfaces. We also rephrase the identification of different types of semantic information as a classification problem, and design several Bayesian classifiers to help derive semantic information from extracted attributes. A system, WISE-iExtractor, has been implemented to automatically construct the schema from any Web search interfaces. Our experimental results on real search interfaces indicate that this system is highly effective.
Similar content being viewed by others
References
Bergamaschi, S., Castano, S., Vincini, M., Beneventano, D.: Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36, 215–249 (2001)
Chang, K., Garcia-Molina, H.: Mind your vocabulary: query mapping across heterogeneous information sources. In: SIGMOD Conference, 1999
Chang, K., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: observations and implications. SIGMOD Rec. 33(3) (2004) September
Gal, A., Modica, G., Jamil, H.: OntoBuilder: fully automatic extraction and consolidation of ontologies from web sources. In: ICDE Conference, 2004
Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann (2000) August
He, B., Chang, K.: Statistical schema matching across web query interfaces. In: SIGMOD Conference, 2003.
He, H., Meng, W., Yu, C., Wu, Z.: WISE-Integrator: an automatic integrator of web search interfaces for e-commerce. In: VLDB Conference, 2003
He, B., Tao, T., Chang, K.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM Conference, 2004
Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., Paepcke, A.: Efficient web form entry on PDAs. In: WWW Conference, 2000
Kohavi, R., Becker, B., Sommerfield, D.: Improving simple Bayes. In: ECML Conference, 1997
Kushmerick, N.: Learning to invoke web forms. In: ODBASE Conference, 2003
Levy, A., Rajaraman, A., Ordille, J.: Querying heterogeneous information sources using source descriptions. In: VLDB Conference, 1996
Lu, Y., He, H., Peng, Q., Meng, W., Yu, C.: Clustering e-commerce search engines based on their search interface pages using WISE-cluster. Data Knowl. Eng. (DKE), 2006
MetaQuerier http://metaquerier.cs.uiuc.edu/
Peng, Q., Meng, W., He, H., Yu, C.: WISE-cluster: clustering e-commerce search engines automatically. In: WIDM workshop, 2004
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: VLDB Conference, 2001
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW Conference, 2003
WordNet: http://www.cogsci.princeton.edu
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, 2004
Zhang, Z., He, B., Chang, K.: Understanding web query interfaces: best-effort parsing with hidden syntax. In: SIGMOD Conference, 2004
Zhang, Z., He, B., Chang, K.: Light-weight domain-based form assistant: querying web databases on the fly. In: VLDB Conference, 2005
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
He, H., Meng, W., Lu, Y. et al. Towards Deeper Understanding of the Search Interfaces of the Deep Web. World Wide Web 10, 133–155 (2007). https://doi.org/10.1007/s11280-006-0010-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-006-0010-9