Crawling ranked deep Web data sources

Wang, Yan; Lu, Jianguo; Chen, Jessica; Li, Yaxin

doi:10.1007/s11280-016-0410-4

Crawling ranked deep Web data sources

Published: 03 September 2016

Volume 20, pages 89–110, (2017)
Cite this article

World Wide Web Aims and scope Submit manuscript

Yan Wang ORCID: orcid.org/0000-0002-9876-5823¹,
Jianguo Lu²,
Jessica Chen² &
…
Yaxin Li¹

1109 Accesses
4 Citations
Explore all metrics

Abstract

In the era of big data, the vast majority of the data are not from the surface Web, the Web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep Web, the Web that is hidden behind query interfaces. Since numerous applications, like data integration and vertical portals, require deep Web data, various crawling methods were developed for exhaustively harvesting a deep Web data source with the minimal (or near-minimal) cost. Most existing crawling methods assume that all the documents matched by queries are returned. In practice, data sources often return the top k matches. This makes exhaustive data harvesting difficult: highly ranked documents will be returned multiple times, while documents ranked low have small chance being returned. In this paper, we decompose this problem into two orthogonal sub-problems, i.e., query and ranking bias problems, and propose a document frequency based crawling method to overcome the ranking bias problem. The rational of our method is to use the queries whose document frequencies are within the specified range to avoid the effect of search ranking plus return limit and significantly reduce the difficulty of crawling ranked data source. The method is extensively tested on a variety of datasets and compared with two existing methods. The experimental result demonstrates that our method outperforms the two algorithms by 58 % and 90 % on average respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DB-GPT: Large Language Model Meets Database

Article Open access 19 January 2024

Xuanhe Zhou, Zhaoyan Sun & Guoliang Li

A Survey of Personalized News Recommendation

Article Open access 02 September 2023

Xiangfu Meng, Hongjin Huo, … Jinxia Zhu

Boolean interpretation, matching, and ranking of natural language queries in product selection systems

Article Open access 03 April 2024

Matthew Moulton & Yiu-Kai Ng

Notes

In this paper, we use the two words ‘term’ and ‘query’ interchangeably and the minor difference is that a query is an issued term.

References

Alvarez, M., Raposo.Raposo, J., Pan, A., Cacheda, F., Bellas, O., Carneiro, V.: Crawling the Content Hidden behind Web Forms. In: ICCSA, pp 322–333 (2007)
Barbosa, L., Freire, J.: An Adaptive Crawler for Locating Hidden-Web Entry Points. In: Proceedings of WWW, pp 441–450 (2007)
Barbosa, M.L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In: Proceedings of SBBD (2004)
Bar-Yossef, Z., Gurevich, M.: Random Sampling from a Search Engine’s Index. In: WWW, pp 367–376 (2006)
Bergman, M.K.: The deepWeb: Surfacing hidden value. J. Electron. Publ. 7(1) (2001)
Dong, X.L., Srivastava, D.: Big Data Integration. In: ICDE, pp 1245–1248 (2013)
Dong, Y., Li, Q.: A deep Web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012)
Google Scholar
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: The ontological key: automatically understanding and integrating forms to access the deep Web. VLDB J. 22(5), 615–640 (2013)
Article Google Scholar
Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. Journal of Quantitative Linguistics (1995)
Hatcher, E., Gospodnetic, O.: Lucene in action manning publications (2004)
He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep Web: a survey. Commun. ACM 50(5), 94–101 (2007)
Article Google Scholar
He, Y., Xin, D., Ganti, V., Rajaraman, S., Shah, N.: Crawling Deep Web Entity Pages. In: Proceedings of WSDM’13, pp 355–364 (2013)
Ipeirotis, P.G., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: VLDB (2002)
Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient Deep Web Crawling Using Reinforcement Learning. In: Proceedings of PAKDD, pp 428–439 (2010)
Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning Deep Web Crawling with Diverse Featueres. In: WI-IAT, pp 72–575 (2009)
Khare, R., An, Y., Song, I.: Understanding deep Web search interfaces: a survey. ACM SIGMOD Rec. 39(1), 33–40 (2010)
Article Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of IJCAI (97)
Liakos, P., Ntoulas, A., A, L., Delis, A.: Focused crawling for the hidden Web. World Wide Web 2015, 1–27 (2015)
Google Scholar
Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting Data behind Web Forms. In: Proceedings of Advanced Conceptual Modeling Techniques (2002)
Liu, J., Wu, Z.H., Jiang, L., Zheng, Q.H., Liu, X.: Crawling Deep Web Content through Query Forms. In: Proceedings of WebIST, 634–642. Lisbon Portugal (2009)
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep Web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)
Article Google Scholar
Lu, J.: Ranking bias in deep Web size estimation using capture recapture method. J. Data Knowl. Eng. 69(8), 866–879 (2010)
Article Google Scholar
Lu, J., Li, D.: Estimating deep Web data source size by capture-recapture method. Inf. Retr. 13(1), 70–95 (2010)
Article Google Scholar
Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An Approach to Deep Web Crawling by Sampling. In: Proceedings of Web Intelligence, pp 718–724 (2008)
Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the Deep Web: Present and Future. In: Proceedings of CIDR (2009)
Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’S Deep-Web Crawl. In: Proceedings of VLDB, pp 1241–1252 (2008)
Mandelbrot, B.B.: Fractal Geometry of Nature. W.H. Freeman Press (1988)
Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D.: Prequery discovery of domain-specific query forms: a survey. Knowledge and data engineering. IEEE Trans. Knowl. Data Eng. 25(8), 1830–1848 (2013)
Article Google Scholar
Myung, I.J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47, 90–100 (2003)
Article MathSciNet MATH Google Scholar
Peng, M., Zhu, J., Li, X., Huang, J., Wang, H., Zhang, Y.: Central Topic Model for Event-Oriented Topics Mining in Microblog Stream. In: Proceedings of CIKM, pp 1611–1620 (2015)
Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proceedings of the 27Th International Conference on Very Large Data Bases (VLDB), pp 129–138 (2001)
Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep Web. J. Data Knowl. Eng. 52(3), 273–311 (2005)
Article Google Scholar
Song, S., Chen, L.: Indexing dataspaces with partitions. World Wide Web 16 (2), 141–170 (2013)
Article MathSciNet Google Scholar
Valkanas, G., Ntoulas, A., Gunopulos, D.: Rank-Aware Crawling of Hidden Web Sites. In: Proceedings of in WebDB (2011)
Wang, Y., Li, H., Wang, H., Zhou, B., Zhang, Y.: Multi-Window Based Ensemble Learning for Classification of Imbalanced Streaming Data. In: Proceedings of WISE, pp 78–92 (2015)
Wang, Y., Li, Y., Pi, N., Lu, J.: Crawling Ranked Deep Web Data Sources. In: Proceedings of WISE, pp 384–398 (2015)
Wang, Y., Liang, J., Lu, J.: Discover hidden Web properties by random walk on bipartite graph. Inf. Retr. 17(3), 203–228 (2014)
Article Google Scholar
Wang, Y., Lu, J., Chen, J.: Crawling Deep Web Using a New Set Covering Algorithm. In: Proceedings of ADMA, pp 326–337 (2009)
Wang, Y., Lu, J., Chen, J.: Ts-Ids Algorithm for Query Selection in the Deep Web Crawling. In: ApWeb, pp 189–200 (2014)
Wang, Y., Lu, J., Liang, J., Chen, J., Liu, J.: Selecting queries from sample to crawl deep Web data sources. Web Intelligence Agent Syst. 10(1), 75–88 (2012)
Google Scholar
Wen, L., van der Aalst, W.M., Wang, J., Sun, J.: Mining process models with non-free-choice constructs. Data Min. Knowl. Disc. 15(2), 145–180 (2007)
Article MathSciNet Google Scholar
Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query Selection Techniques for Efficient Crawling of Structured Web Sources. In: Proceedings of ICDE, pp 47–56 (2006)
Yang, M., Wang, H.L.L., Wang, M.: Optimizing Content Freshness of Relations Extracted from the Web Using Keyword Search. In: Proceedings of SIGMOND, pp 819–830 (2010)
Zerfos, P., Cho, J., Ntoulas, A.: Downloading Textual Hidden Web Content through Keyword Queries. In: Proceedings of the Joint Conference on Digital Libraries (JCDL), pp 100–109 (2005)
Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep Web. Inf. Syst. 38(6), 801–819 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Information, Central University of Finance and Economics, Beijing, China
Yan Wang & Yaxin Li
School of Computer Science, University of Windsor, Windsor, Canada
Jianguo Lu & Jessica Chen

Authors

Yan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yaxin Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Wang.

Additional information

This work has been partially supported by National Key Research Program of China (2016YFB1001101), NSERC Discovery grant (RGPIN-2014-04463), NSFC (No.61440020, No.61272398 and N0.61309030), Programs for Innovation Research and 121 Project in CUFE.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Lu, J., Chen, J. et al. Crawling ranked deep Web data sources. World Wide Web 20, 89–110 (2017). https://doi.org/10.1007/s11280-016-0410-4

Download citation

Received: 29 February 2016
Revised: 05 July 2016
Accepted: 18 August 2016
Published: 03 September 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11280-016-0410-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Crawling ranked deep Web data sources

Abstract

Access this article

Similar content being viewed by others

DB-GPT: Large Language Model Meets Database

A Survey of Personalized News Recommendation

Boolean interpretation, matching, and ranking of natural language queries in product selection systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Crawling ranked deep Web data sources

Abstract

Access this article

Similar content being viewed by others

DB-GPT: Large Language Model Meets Database

A Survey of Personalized News Recommendation

Boolean interpretation, matching, and ranking of natural language queries in product selection systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation