Abstract
Web Query Interfaces (WQIs) play a very important role in retrieving Deep Web content. WQIs allow users to query domain-specific databases for obtaining information of interest from diverse domains such as car rentals, hotels, airfare, etc. As the number of WQIs on the web is increasing drastically, some research efforts are focused on building a single (unified) WQI that allows users to query and integrate information available in different web databases related to a specific domain. A very important task in this WQIs’ integration process is the extraction, modeling and understanding of WQIs’ semantic content. However, this task is challenging because of the great heterogeneity in the design of WQIs. This paper presents a novel tree-based approach for the modeling and understanding of WQIs. A tree schema called the Visual Reduced Tree (VR-Tree) is built from the tree produced by a web browser’s render engine, applying a set of well- defined functions and guided by a set of heuristic rules to identify the WQI’s main components and their relationships. The proposed strategy was evaluated by running a collection of experiments over the Tel-8 and ICQ datasets from the UIUC repository. The results show that the automatic modeling of WQIs is possible with a high degree of precision if compared against previous approaches, simplifying the modeling task by only considering visual and spatial properties of WQI components using the VR-Tree schema proposed in this work.
Similar content being viewed by others
References
Barbosa, L., & Freire, J. (2007). Combining classifiers to identify online databases. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 431–440). New York: ACM. 10.1145/1242572.1242631.
Boughammoura, R., Hlaoua, L., & Omri, M.N. (2012). Viqi: A new approach for visual interpretation of deep web query interfaces abs/1205.0917. http://dblp.uni-trier.de/db/journals/corr/corr1205.html#abs-1205-0917.
Chang, K.C.C., He, B., Li, C., & Zhang, Z. (2003). The UIUC Web Integration Repository. Computer Science Department, University of Illinois at Urbana-Champaign. URL: http://metaquerier.cs.uiuc.edu/repository. Online: accessed 07-December-2013.
Dragut, E.C., Kabisch, T., Yu, C., & Leser, U. (2009). A hierarchical approach to model web query interfaces for web source integration. Proceedings of the VLDB Endowment, 2(1), 325–336. doi:10.14778/1687627.1687665.
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., & Schallhart, C. (2012). Opal: automated form understanding for the deep web. In WWW (pp. 829–838).
Google (2013). The chromium projects: Blink. URL: http://www.chromium.org/blink/. Online: accessed 19-July-2013.
He, H., Meng, W., Yu, C., & Wu, Z. (2004). Automatic integration of web search interfaces with wise-integrator. The VLDB Journal, 13(3), 256–273. doi:10.1007/s00778-004-0126-4.
He, H., Meng, W., Yu, C.T., & Wu, Z. (2005). Constructing interface schemas for search interfaces of web databases. In WISE (pp. 29–42).
Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., & Paepcke, A. (2001). Efficient web form entry on pdas. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01 (pp. 663–672). New York: ACM. doi:10.1145/371920.372180.
Khare, R., & An, Y. (2009). An empirical study on using hidden markov model for search interface segmentation. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09 (pp. 17–26). New York: ACM. doi:10.1145/1645953.1645959.
Kushmerick, N. (2003). Learning to invoke web forms. In Proceedings of the 15th International Conference on Ontologies, Databases, and Applications of Semantics (pp. 997–1013). Springer-Verlag.
Marin Castro, H.M., Sosa Sosa, V.J., Martinez Trinidad, J.F., & Lopez-Arevalo, I. (2013). Automatic discovery of web query interfaces using machine learning techniques. Journal of Intelligent Information System, 40(1), 85–108.
Melto, D. (2003). The webkit open source project. URL: http://www.webkit.org/.Online:accessed19-July-2013.
Mozilla (2003). Project gecko. Mozilla Organization. URL: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko. Online: accessed 19-July-2013.
Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB Endowment, 1(1), 684–694. http://dl.acm.org/citation.cfm?id=1453856.1453931.
Opera (2003). Project presto. Opera Software ASA. URL: http://dev.opera.com/. Online: accessed 19-July-2013.
Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web, In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01 (pp. 129–138). San Francisco: Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=645927.672025.
Wu, W., Doan, A., Yu, C.T., & Meng, W. (2009). Modeling and extracting deep-web query interfaces, Advances in Information and Intelligent Systems (pp. 65–90).
Wu, W., Yu, C., Doan, A., & Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep Web, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD ’04 (pp. 95–106). New York: ACM. doi:10.1145/1007568.1007582.
Zakas, N.C. (2010). High Performance JavaScript. O’ Reilly Media, United States of America.
Zhang, Z., He, B., & Chang, K.C.C. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax, In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04 (pp. 107–118). New York: ACM. doi:10.1145/1007568.1007583.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Marin-Castro, H.M., Sosa Sosa, V.J. VR-Tree: A novel tree-based approach for modeling Web Query Interfaces. J Intell Inf Syst 49, 367–390 (2017). https://doi.org/10.1007/s10844-017-0449-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-017-0449-4