Abstract
In the era of big data, the vast majority of the data are not from the surface Web, the Web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep Web, the Web that is hidden behind query interfaces. Since numerous applications, like data integration and vertical portals, require deep Web data, various crawling methods were developed for exhaustively harvesting a deep Web data source with the minimal (or near-minimal) cost. Most existing crawling methods assume that all the documents matched by queries are returned. In practice, data sources often return the top k matches. This makes exhaustive data harvesting difficult: highly ranked documents will be returned multiple times, while documents ranked low have small chance being returned. In this paper, we decompose this problem into two orthogonal sub-problems, i.e., query and ranking bias problems, and propose a document frequency based crawling method to overcome the ranking bias problem. The rational of our method is to use the queries whose document frequencies are within the specified range to avoid the effect of search ranking plus return limit and significantly reduce the difficulty of crawling ranked data source. The method is extensively tested on a variety of datasets and compared with two existing methods. The experimental result demonstrates that our method outperforms the two algorithms by 58 % and 90 % on average respectively.
Similar content being viewed by others
Notes
In this paper, we use the two words ‘term’ and ‘query’ interchangeably and the minor difference is that a query is an issued term.
References
Alvarez, M., Raposo.Raposo, J., Pan, A., Cacheda, F., Bellas, O., Carneiro, V.: Crawling the Content Hidden behind Web Forms. In: ICCSA, pp 322–333 (2007)
Barbosa, L., Freire, J.: An Adaptive Crawler for Locating Hidden-Web Entry Points. In: Proceedings of WWW, pp 441–450 (2007)
Barbosa, M.L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In: Proceedings of SBBD (2004)
Bar-Yossef, Z., Gurevich, M.: Random Sampling from a Search Engine’s Index. In: WWW, pp 367–376 (2006)
Bergman, M.K.: The deepWeb: Surfacing hidden value. J. Electron. Publ. 7(1) (2001)
Dong, X.L., Srivastava, D.: Big Data Integration. In: ICDE, pp 1245–1248 (2013)
Dong, Y., Li, Q.: A deep Web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012)
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: The ontological key: automatically understanding and integrating forms to access the deep Web. VLDB J. 22(5), 615–640 (2013)
Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. Journal of Quantitative Linguistics (1995)
Hatcher, E., Gospodnetic, O.: Lucene in action manning publications (2004)
He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep Web: a survey. Commun. ACM 50(5), 94–101 (2007)
He, Y., Xin, D., Ganti, V., Rajaraman, S., Shah, N.: Crawling Deep Web Entity Pages. In: Proceedings of WSDM’13, pp 355–364 (2013)
Ipeirotis, P.G., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: VLDB (2002)
Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient Deep Web Crawling Using Reinforcement Learning. In: Proceedings of PAKDD, pp 428–439 (2010)
Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning Deep Web Crawling with Diverse Featueres. In: WI-IAT, pp 72–575 (2009)
Khare, R., An, Y., Song, I.: Understanding deep Web search interfaces: a survey. ACM SIGMOD Rec. 39(1), 33–40 (2010)
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of IJCAI (97)
Liakos, P., Ntoulas, A., A, L., Delis, A.: Focused crawling for the hidden Web. World Wide Web 2015, 1–27 (2015)
Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting Data behind Web Forms. In: Proceedings of Advanced Conceptual Modeling Techniques (2002)
Liu, J., Wu, Z.H., Jiang, L., Zheng, Q.H., Liu, X.: Crawling Deep Web Content through Query Forms. In: Proceedings of WebIST, 634–642. Lisbon Portugal (2009)
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep Web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)
Lu, J.: Ranking bias in deep Web size estimation using capture recapture method. J. Data Knowl. Eng. 69(8), 866–879 (2010)
Lu, J., Li, D.: Estimating deep Web data source size by capture-recapture method. Inf. Retr. 13(1), 70–95 (2010)
Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An Approach to Deep Web Crawling by Sampling. In: Proceedings of Web Intelligence, pp 718–724 (2008)
Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the Deep Web: Present and Future. In: Proceedings of CIDR (2009)
Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’S Deep-Web Crawl. In: Proceedings of VLDB, pp 1241–1252 (2008)
Mandelbrot, B.B.: Fractal Geometry of Nature. W.H. Freeman Press (1988)
Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D.: Prequery discovery of domain-specific query forms: a survey. Knowledge and data engineering. IEEE Trans. Knowl. Data Eng. 25(8), 1830–1848 (2013)
Myung, I.J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47, 90–100 (2003)
Peng, M., Zhu, J., Li, X., Huang, J., Wang, H., Zhang, Y.: Central Topic Model for Event-Oriented Topics Mining in Microblog Stream. In: Proceedings of CIKM, pp 1611–1620 (2015)
Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proceedings of the 27Th International Conference on Very Large Data Bases (VLDB), pp 129–138 (2001)
Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep Web. J. Data Knowl. Eng. 52(3), 273–311 (2005)
Song, S., Chen, L.: Indexing dataspaces with partitions. World Wide Web 16 (2), 141–170 (2013)
Valkanas, G., Ntoulas, A., Gunopulos, D.: Rank-Aware Crawling of Hidden Web Sites. In: Proceedings of in WebDB (2011)
Wang, Y., Li, H., Wang, H., Zhou, B., Zhang, Y.: Multi-Window Based Ensemble Learning for Classification of Imbalanced Streaming Data. In: Proceedings of WISE, pp 78–92 (2015)
Wang, Y., Li, Y., Pi, N., Lu, J.: Crawling Ranked Deep Web Data Sources. In: Proceedings of WISE, pp 384–398 (2015)
Wang, Y., Liang, J., Lu, J.: Discover hidden Web properties by random walk on bipartite graph. Inf. Retr. 17(3), 203–228 (2014)
Wang, Y., Lu, J., Chen, J.: Crawling Deep Web Using a New Set Covering Algorithm. In: Proceedings of ADMA, pp 326–337 (2009)
Wang, Y., Lu, J., Chen, J.: Ts-Ids Algorithm for Query Selection in the Deep Web Crawling. In: ApWeb, pp 189–200 (2014)
Wang, Y., Lu, J., Liang, J., Chen, J., Liu, J.: Selecting queries from sample to crawl deep Web data sources. Web Intelligence Agent Syst. 10(1), 75–88 (2012)
Wen, L., van der Aalst, W.M., Wang, J., Sun, J.: Mining process models with non-free-choice constructs. Data Min. Knowl. Disc. 15(2), 145–180 (2007)
Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query Selection Techniques for Efficient Crawling of Structured Web Sources. In: Proceedings of ICDE, pp 47–56 (2006)
Yang, M., Wang, H.L.L., Wang, M.: Optimizing Content Freshness of Relations Extracted from the Web Using Keyword Search. In: Proceedings of SIGMOND, pp 819–830 (2010)
Zerfos, P., Cho, J., Ntoulas, A.: Downloading Textual Hidden Web Content through Keyword Queries. In: Proceedings of the Joint Conference on Digital Libraries (JCDL), pp 100–109 (2005)
Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep Web. Inf. Syst. 38(6), 801–819 (2013)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work has been partially supported by National Key Research Program of China (2016YFB1001101), NSERC Discovery grant (RGPIN-2014-04463), NSFC (No.61440020, No.61272398 and N0.61309030), Programs for Innovation Research and 121 Project in CUFE.
Rights and permissions
About this article
Cite this article
Wang, Y., Lu, J., Chen, J. et al. Crawling ranked deep Web data sources. World Wide Web 20, 89–110 (2017). https://doi.org/10.1007/s11280-016-0410-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-016-0410-4