Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

Jou, Chichang

doi:10.1007/s10796-018-9863-6

Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

Published: 07 June 2018

Volume 21, pages 163–174, (2019)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Chichang Jou ORCID: orcid.org/0000-0002-5698-5350¹

318 Accesses
7 Citations
Explore all metrics

A Correction to this article was published on 25 June 2019

This article has been updated

Abstract

Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user’s view and the designer’s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Fig. 4

Fig. 14

A survey on deep learning approaches for text-to-SQL

Article Open access 23 January 2023

Information extraction from electronic medical documents: state of the art and future research directions

Article 08 November 2022

Leveraging Semantic Search and LLMs for Domain-Adaptive Information Retrieval

Change history

25 June 2019
The original copy of this article included incorrect data for “authors and affiliations”.
25 June 2019
The original copy of this article included incorrect data for ��authors and affiliations��.

Notes

References

Awadallah, H., Bahaaeldin, M., Haw, S.-C., & Soon, L.-K. (2018). A review on utilising XML as the mediated ;ayer for data integration. Advanced Science Letters, 24(2), 1191–1195(5).
Article Google Scholar
Bergman, M. K. (2001). The deep web: surfacing hidden value. Technical report, BrightPlanet LLC.
Dragut, E. C., Kabisch, T., Yu, C., & Leser, U. (2009). A hierarchical approach to model web query interfaces for web source integration. In Proceedings of the 35th International Conference on Very Large Data Bases (pp. 325–335).
Google Scholar
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., & Schallhart, C. (2013). The ontological key: automatically understanding and integrating forms to access the deep web. The VLDB Journal, 22(5), 615–640.
Article Google Scholar
He, H., Meng, W., Yu, C., & Wu, Z. (2005). Constructing interface schemas for search interfaces of web databases. In Proceedings of the 6th International Conference on Web Information Systems Engineering (pp. 29–42).
Google Scholar
He, H., Meng, W., Lu, Y., Yu, C., & Wu, Z. (2007). Towards deeper understanding of the search interfaces of the deep web. World Wide Web, 10(2), 133–155.
Article Google Scholar
Jou, C. (2016). Deep web query interface integration based on incremental schema matching and merging. In Proceedings of the the 3rd Multidisciplinary International Social Networks Conference on Social Informatics, Data Science, Article No. 34.
Google Scholar
Khare, R., & An, Y. (2009). An empirical study on using hidden markov model for search interface segmentation. In Proceedings of the 18th International Conference on Information and Knowledge Management (pp. 17–26).
Google Scholar
Naz, T. (2006). An XML schema generator for HTML search interfaces. technical report, Institute Faculty of Informatics, DBAI, Technical University of Vienna, Austria.
Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the Very Large Data Bases Endowment, 1(1), 684–694.
Google Scholar
Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In Proceedings of 27th International Conference on Very Large Data Bases (pp. 129–138).
Google Scholar
Saissi, Y., Zellou, A., & Idri, A. (2016). Towards XML schema extraction from deep web. In Proceedings of 4th IEEE International Colloquium on Information Science and Technology (pp. 94–99).
Google Scholar
Salem, R., Boussaïd, O., & Darmont, J. (2013). Active XML-based web data integration. Information Systems Frintiers, 15(3), 371–398.
Article Google Scholar
Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F. H., Cai, H., & Huang, T. (2013). Understanding query interfaces by statistical parsing. ACM Transactions on the Web, 7(2) Article No. 8.
Wu, W., Doan, A., Yu, C., & Meng, W. (2009). Modeling and extracting deep-web query interfaces. Advances in Information & Intelligent Systems, SCI, 251, 65–90.
Article Google Scholar
Yu, H., & Ye, F. (2015). Research on extract the schema of query interfaces. In Proceedings of the 10th International Conference on Intelligent Systems and Knowledge Engineering (pp. 442–447).
Google Scholar
Zhang, Z., He, B., & Chang, K. C.-C. (2004). Understanding web query interfaces: best-effort parsing with hidden syntax. In Proceedings of the 2004 ACM SIGMOD Conference (pp. 107–118).
Chapter Google Scholar

Download references

Acknowledgements

The authors would like to thank the reviewers for their thoughtful comments, which greatly assisted improving our work. We also would like to thank the Ministry of Science and Technology, Taiwan (R.O.C.) for financially supporting this research under Grant MOST 105-2221-E-032-062. Our special thanks to Yucheng Cheng, Tzu-Chun Hsiao, and Shang Huang for participating in the design and implementation of the HSE system.

Author information

Authors and Affiliations

Department of Information Management, Tamkang University, 151 Ying-zhuan Road, Tamsui, Taiwan, 25137, People’s Republic of China
Chichang Jou

Authors

Chichang Jou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chichang Jou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jou, C. Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules. Inf Syst Front 21, 163–174 (2019). https://doi.org/10.1007/s10796-018-9863-6

Download citation

Published: 07 June 2018
Issue Date: 15 February 2019
DOI: https://doi.org/10.1007/s10796-018-9863-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

Abstract

Access this article

Similar content being viewed by others

A survey on deep learning approaches for text-to-SQL

Information extraction from electronic medical documents: state of the art and future research directions

Leveraging Semantic Search and LLMs for Domain-Adaptive Information Retrieval

Change history

25 June 2019

25 June 2019

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

Abstract

Access this article

Similar content being viewed by others

A survey on deep learning approaches for text-to-SQL

Information extraction from electronic medical documents: state of the art and future research directions

Leveraging Semantic Search and LLMs for Domain-Adaptive Information Retrieval

Change history

25 June 2019

25 June 2019

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation