Abstract
Currently, Most back-end web databases cannot be indexed by traditional hyperlink-based search engines due to their requirement of users’ interactive queries via page form submission. In order to make hidden-Web information more easily accessible, this paper proposes a hierarchical classifier to locate domain-specific hidden Web entries at a large scale. The classifier is trained by appropriately selected page form features to get rid of non-relevant domains and non-searchable forms. Experiments conducted on eight different topics demonstrate that the technique can discover deep web interfaces accurately and efficiently.
L. Wang—Supported in part by the National Science Foundation under grant 61472382, 61272472 and 61232018
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brightplanets searchable databases directory. http://www.completeplanet.com
Classification Trees and Regression Trees. http://cn.mathworks.com/help/stats/classification-trees-and-regression-trees.html
Google Base. http://base.google.com/
The R Project for Statistical Computing. http://www.r-project.org
The uiuc Web integration repository. http://metaquerier.cs.uiuc.edu/repository/
Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)
Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International Conference on World Wide Web, pp. 431–440. ACM (2012)
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web, pp. 441–450. ACM (2013)
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, pp. 309–321 (2014)
Bergholz, A., Childlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of the Fourth International Conference on Web Information Systems Engineering, WISE 2003, pp. 125–133. IEEE (2003)
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks 31(11), 1623–1640 (1999)
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011)
Chang, K.C.C., He, B., Zhang, Z.: Toward large scale integration: building a metaquerier over databases on the web. In: CIDR, vol. 5, pp. 44–55 (2005)
Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the web. In: Proceedings of the 14th Australasian Database Conference, vol. 17, pp. 181–189. Australian Computer Society, Inc. (2003)
Du, X., Zheng, Y., Yan, Z.: Automate discovery of deep web interfaces. In: 2010 2nd International Conference on Information Science and Engineering (ICISE), pp. 3572–3575. IEEE (2010)
Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: Proceedings of the 12th International Conference on World Wide Web, pp. 669–678. ACM (2003)
Galperin, M.Y.: The molecular biology database collection: 2008 update. Nucleic Acids Research 36(suppl 1), D2–D4 (2008)
Gravano, L., García-Molina, H., Tomasic, A.: Gloss: text-source discovery over the internet. ACM Transactions on Database Systems (TODS) 24(2), 229–264 (1999)
He, H., Meng, W., Yu, C., Wu, Z.: Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 357–368. VLDB Endowment (2013)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web (2014)
Torgo, L., Gama, J.: Regression by classification. In: Borges, D.L., Kaestner, C.A.A. (eds.) SBIA 1996. LNCS, vol. 1159, pp. 51–60. Springer, Heidelberg (1996)
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95–106. ACM (2014)
Xu, J., Callan, J.: Effective retrieval with distributed collections. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 112–120. ACM (2008)
Yu, C., Liu, K.L., Meng, W., Wu, Z., Rishe, N.: A methodology to retrieve text documents from multiple databases. IEEE Transactions on Knowledge and Data Engineering 14(6), 1347–1361 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, L., Hawbani, A., Wang, X. (2015). Focused Deep Web Entrance Crawling by Form Feature Classification. In: Wang, Y., Xiong, H., Argamon, S., Li, X., Li, J. (eds) Big Data Computing and Communications. BigCom 2015. Lecture Notes in Computer Science(), vol 9196. Springer, Cham. https://doi.org/10.1007/978-3-319-22047-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-22047-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22046-8
Online ISBN: 978-3-319-22047-5
eBook Packages: Computer ScienceComputer Science (R0)