An effective approach to enhancing a focused crawler using Google

Lee, Jae-Gil; Bae, Donghwan; Kim, Sansung; Kim, Jungeun; Yi, Mun Yong

doi:10.1007/s11227-019-02787-9

An effective approach to enhancing a focused crawler using Google

Published: 20 February 2019

Volume 76, pages 8175–8192, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jae-Gil Lee¹,
Donghwan Bae¹,
Sansung Kim¹,
Jungeun Kim¹ &
…
Mun Yong Yi¹

926 Accesses
5 Citations
Explore all metrics

Abstract

In this paper, we share our experience in augmenting a focused crawler of our vertical search engine designed to work with academic slides. The goal of the focused crawler was to collect Microsoft PowerPoint files from academic institutions. A previous approach based on a general web crawler can fail to collect a sufficient number of files mainly because of the robots exclusion protocol and missing hyperlinks. As a remedy to these problems, we propose a combinatory approach in which the indexing information maintained by a general web search engine such as Google is utilized for target URL list generation through our query generator, further then complemented by our URL extractor and file downloader. Because Google has already crawled billions of web pages, it will be more cost-efficient and potentially effective to systematically retrieve the desired information from Google than to redo crawling from scratch by ourselves. Our focused crawler, which we call SlideCrawler, has been used for our vertical search engine CourseShare since the fall of 2011. The capability of SlideCrawler was verified for the top-500 world wide universities. SlideCrawler collected about one million files from the top-500 universities. Further, the study results show that SlideCrawler outperforms Nutch, collecting 3.7 times more slide files.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Focused crawling for the hidden web

Article 21 May 2015

Big Scale Text Analytics and Smart Content Navigation

ScienScan – An Efficient Visualization and Browsing Tool for Academic Search

Notes

Apache Nutch is an open source web-search software project, and its project homepage is http://nutch.apache.org/. Its crawler has been written from scratch specifically for this project.
http://www.gnu.org/software/wget/.
We understand that the same slide file can be located at many different URLs, and this type of duplication will have to be removed during indexing time after crawling.
http://www.topuniversities.com/university-rankings/world-university-rankings/.

References

Boldi P, Codenotti B, Santini M, Vigna S (2004) UbiCrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726
Article Google Scholar
Bonato A, del Río-Chanona RM, MacRury C, Nicolaidis J, Pérez-Giménez X, Prałat P, Ternovsky K (2018) The robot crawler graph process. Discrete Appl Math 247:23–36
Article MathSciNet Google Scholar
Boukadi K, Rekik M, Rekik M, Ben-Abdallah H (2018) FC4CD: a new SOA-based focused crawler for cloud service discovery. Computing 100(10):1081–1107
Article Google Scholar
Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11–16):1623–1640
Article Google Scholar
Chakrabarti S, Punera K, Subramanyam M (2002) Accelerated focused crawling through online relevance feedback. In: Proceedings of 11th International World Wide Web Conference, Honolulu, Hawaii, pp 148–159
Chau M, Chen H (2003) Comparison of three vertical search spiders. IEEE Comput 36(5):56–62
Article Google Scholar
Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In: Proceedings of 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp 200–209
Cho J, Garcia-Molina H (2000) Synchronizing a database to improve freshness. In: Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, pp 117–128
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating System Design and Implementation, San Francisco, California, pp 137–150
Diligenti M, Coetzee F, Lawrence S, Giles CL, Gori M (2000) Focused crawling using context graphs. In: Proceedings of 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp 527–534
Edwards J, McCurley KS, Tomlin JA (2001) An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings 10th International World Wide Web Conference, Hong Kong, China, pp 106–113
Gantz J, Reinsel D (2012) The digital universe in 2020: bigger digital shadows, and biggest growth in the far east. Technical Report, IDC
Heydon A, Najork M (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2(4):219–229
Article Google Scholar
Kleinberg JM (2001) Small-world phenomena and the dynamics of information. In: Proceedings of Advances in Neural Information Processing Systems, vol 14, Vancouver, British Columbia, pp 431–438
Koster M (2018) A standard for robot exclusion. http://www.robotstxt.org/orig.html. Accessed on 07 Jan 2018
Kunder M (2018) The size of the world wide web (the internet). http://www.worldwidewebsize.com/. Accessed on 07 Jan 2018
Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton
Book Google Scholar
Lee W, Leung CKS, Lee JJH (2011) Mobile web navigation in digital ecosystems using rooted directed trees. IEEE Trans Ind Electron 58(6):2154–2162
Article Google Scholar
Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Technol 4(4):378–419
Article Google Scholar
Pal A, Tomar DS, Shrivastava S (2009) Effective focused crawling based on content and link structure analysis. Int J Comput Sci Inf Secur 2(1):80
Google Scholar
Pant G, Srinivasan P, Menczer F (2004) Crawling the web. In: Poulovassilis A, Levene M (eds) Web dynamics. Springer, Berlin, pp 153–178
Chapter Google Scholar
Pirkola A (2007) Focused crawling: a means to acquire biological data from the web. In: Proceedings of VLDB workshop on data mining in bioinformatics, Austria, Vienna
Shemshadi A, Sheng QZ, Qin Y (2016) ThingSeek: a crawler and search engine for the internet of things. In: Proceedings of 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, pp 1149–1152
Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings of 18th International Conference on Data Engineering, San Jose, California, pp 357–368
Tatli EI, Urgun B (2017) WIVET-benchmarking coverage qualities of web crawlers. Comput J 60(4):555–572
Google Scholar
Vieira K, Barbosa L, da Silva AS, Freire J, Moura E (2016) Finding seeds to bootstrap focused crawlers. World Wide Web 19(3):449–474
Article Google Scholar
Wikipedia (2018) Focused crawler. http://en.wikipedia.org/wiki/Focused_crawler. Accessed on 07 Jan 2018
Wikipedia (2018) Vertical search. http://en.wikipedia.org/wiki/Vertical_search. Accessed on 07 Jan 2018
Yin C, Liu J, Yang C, Zhang H (2009) A novel method for crawler in domain-specific search. J Comput Inf Syst 5(6):1749–1755
Google Scholar
Zhao F, Zhou J, Nie C, Huang H, Jin H (2016) SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput 9(4):608–620
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (No. 2017R1E1A1A01075927).

Author information

Authors and Affiliations

Graduate School of Knowledge Service Engineering, KAIST, Daejeon, Republic of Korea
Jae-Gil Lee, Donghwan Bae, Sansung Kim, Jungeun Kim & Mun Yong Yi

Authors

Jae-Gil Lee
View author publications
You can also search for this author in PubMed Google Scholar
Donghwan Bae
View author publications
You can also search for this author in PubMed Google Scholar
Sansung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jungeun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Mun Yong Yi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jae-Gil Lee.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, JG., Bae, D., Kim, S. et al. An effective approach to enhancing a focused crawler using Google. J Supercomput 76, 8175–8192 (2020). https://doi.org/10.1007/s11227-019-02787-9

Download citation

Published: 20 February 2019
Issue Date: October 2020
DOI: https://doi.org/10.1007/s11227-019-02787-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective approach to enhancing a focused crawler using Google

Abstract

Access this article

Similar content being viewed by others

Focused crawling for the hidden web

Big Scale Text Analytics and Smart Content Navigation

ScienScan – An Efficient Visualization and Browsing Tool for Academic Search

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation