Abstract
Web crawler is an important tool to obtain information from the Internet in a timely manner. In a typical web crawler system with limited bandwidth, many websites are crawled with different time constraints. Existing studies regarding web crawler systems do not consider the bandwidth allocation in such a complex environment; hence, the time constraints may not be satisfied. In this study, we investigate the bandwidth allocation approaches for such a web crawler system. The approaches are designed for two scenarios, i.e., when the number of websites exceeds or does not exceed the maximum number of web crawlers that the system can execute simultaneously. For the latter situation, we propose approaches to control the bandwidth for web crawlers to minimize the maximum complete time or minimize the sum of execution times of all web crawlers, considering assumptions of both sufficient and insufficient bandwidths. For the former situation, we propose a round-based reallocation approach to schedule both the sequence and bandwidth allocation of the web crawlers. Extensive simulations are conducted to validate the proposed approaches, and the results show that our approaches satisfy the time constraints well and achieve desirable execution performances in various scenarios.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Afzal Z, Shah PA, Awan KM, ur Rehman Z, (2019) Optimum bandwidth allocation in wireless networks using differential evolution. J Ambient Intell Humaniz Comput 10(4):1401–1412. https://doi.org/10.1007/s12652-018-0858-4
Amamou A, Bourguiba M, Haddadou K, Pujolle G (2012) A dynamic bandwidth allocator for virtual machines in a cloud environment. In: Proceeding of 2012 IEEE consumer communications and networking conference (CCNC), Las Vegas, NV, USA, January 14–17, 2012, pp 99–104. https://doi.org/10.1109/CCNC.2012.6181065
Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the web. ACM Trans Internet Technol 1(1):2–43. https://doi.org/10.1145/383034.383035
Cai R, Yang JM, Lai W, Wang Y, Zhang L (2008) Irobot: an intelligent crawler for web forums. In: Proceeding of the 17th international conference on world wide web (WWW), ACM, New York, NY, USA, pp 447–456. https://doi.org/10.1145/1367497.1367558
Castillo C, Marin M, Rodriguez A, Baeza-Yates R (2004) Scheduling algorithms for web crawling. In: Proceeding of joint conference 10th Brazilian symposium on multimedia and the web & 2nd Latin American Web Congress (WebMedia & LA-Web), 12–15 October 2004, Ribeirao Preto, SP, Brazil, pp 10–17. https://doi.org/10.1109/WEBMED.2004.1348139
Das S, Khatua M, Misra S (2019) Cheating-resilient bandwidth distribution in mobile cloud computing. IEEE Trans Cloud Comput 7(2):469–482. https://doi.org/10.1109/TCC.2016.2638909
Diligenti M, Maggini M, Pucci FM, Scarselli F (2004) Design of a crawler with bounded bandwidth. In: Proceeding of the 13th international conference on world wide web (WWW), May 17–20, 2004, ACM, New York, NY, USA, pp 292–293. https://doi.org/10.1145/1013367.1013441
Ding R, Wang M (2018) Design and implementation of web crawler based on coroutine model. In: Proceeding of cloud computing and security—4th international conference (ICCCS), Haikou, China, June 8–10, 2018. Springer International Publishing, Cham, pp 427–435. https://doi.org/10.1007/978-3-030-00006-6
Edwards J, McCurley K, Tomlin J (2001) An adaptive model for optimizing performance of an incremental web crawler. In: Proceeding of the 10th international conference on world wide web (WWW), ACM, New York, NY, USA, pp 106–113. https://doi.org/10.1145/371920.371960
Fei R, Yang K, Ou S (2010) A qos-aware dynamic bandwidth allocation algorithm for relay stations in ieee 802.16j-based vehicular networks. In: Proceeding of 2010 IEEE wireless communications and networking conference (WCNC), Proceedings, Sydney, Australia, 18–21 April 2010, pp 1–6. https://doi.org/10.1109/WCNC.2010.5506376
Ge D, Ding Z (2014) A task scheduling strategy based on weighted round-robin for distributed crawler. In: Proceeding of the 7th IEEE/ACM international conference on utility and cloud computing (UCC), London, United Kingdom, December 8–11, 2014, pp 848–852. https://doi.org/10.1109/UCC.2014.138
Guojun Z, Wenchao J, Jihui S, Fan S, Hao Z, Jiang L (2017) Design and application of intelligent dynamic crawler for web data mining. In: Proceeding of the 32nd youth academic annual conference of Chinese Association of Automation (YAC), pp 1098–1105. https://doi.org/10.1109/YAC.2017.7967575
Jiang L, Fu M, Le Z (2011) Hierarchical QOS-aware dynamic bandwidth allocation algorithm for wireless optical broadband access network. In: Proceeding of 2011 international conference on electronics, communications and control (ICECC), pp 4329–4332. https://doi.org/10.1109/ICECC.2011.6066634
Kausar MdA, Dhaka SKSVS (2013) Web crawler based on mobile agent and java aglets. Int J Inf Technol Comput Sci 5(10):85–91. https://doi.org/10.5815/ijitcs.2013.10.09
Kc M, Hagenbuchner M, Tsoi AC (2008) A scalable lightweight distributed crawler for crawling with limited resources. In: Proceeding of the 2008 IEEE/WIC/ACM international conference on web intelligence and international conference on intelligent agent technology—workshops, 9–12 December 2008, Sydney, NSW, Australia, vol 3, pp 663–666. https://doi.org/10.1109/WIIAT.2008.234
Kumar M, Bhatia R, Rattan D (2017) A survey of web crawlers for information retrieval. Wiley Interdiscip Rev Data Min Knowl Discov 7(6):e1218. https://doi.org/10.1002/widm.1218
Lau CH, Tao X, Tjondronegoro D, Li Y (2012) Retrieving information from microblog using pattern mining and relevance feedback. In: Proceedings of data and knowledge engineering. Springer, Berlin, pp 152–160. https://doi.org/10.1007/978-3-642-34679-8_15
Meng S, Wang Y, Miao Z, Sun K (2018) Joint optimization of wireless bandwidth and computing resource in cloudlet-based mobile cloud computing environment. Peer-to-Peer Netw Appl 11(3):462–472. https://doi.org/10.1007/s12083-017-0544-x
Punj D, Dixit A (2017) Design of a migrating crawler based on a novel URL scheduling mechanism using AHP. Int J Rough Sets Data Anal 4(1):95–110. https://doi.org/10.4018/IJRSDA.2017010106
Raina S, Prakash Agarwal A (2014) How crawlers aid regression testing in web applications: the state of the art. Int J Comput Appl 68(14):33–38. https://doi.org/10.5120/11651-7153
Shams R, Khan FH, Abbass S, Javaid R (2017) Bandwidth allocation for wireless cellular network by using genetic algorithm. Wirel Pers Commun 95(2):245–260. https://doi.org/10.1007/s11277-016-3890-8
Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceeding of the 18th international conference on data engineering, San Jose, CA, USA, February 26–March 1, 2002, pp 357–368. https://doi.org/10.1109/ICDE.2002.994750
Singhal N, Agarwal RP, Dixit A, Sharma AK (2011) Information retrieval from the web and application of migrating crawler. In: Proceeding of the 2011 international conference on computational intelligence and communication networks, pp 476–480. https://doi.org/10.1109/CICN.2011.99
Srl S (2014) Netbalancer. http://netbalancer.com/. Accessed 25 June 2020
Thelwall M (2001) A web crawler design for data mining. J Inf Sci 27(5):319–325. https://doi.org/10.1177/016555150102700503
Wang J, Guo Y (2012) Scrapy-based crawling and user-behavior characteristics analysis on taobao. In: Proceeding of 2012 international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), Sanya, China, October 10–12, 2012, pp 44–52. https://doi.org/10.1109/CyberC.2012.17
Wang J, Zhao H, Liu F, Zhang J (2018a) A queue-based bandwidth allocation method for streaming media servers in m-learning VOD systems. In: Proceeding of the 12th international conference, edutainment 2018: e-learning and games, Xi’an, China, June 28–30, 2018, Springer, lecture notes in computer science, vol 11462, pp 107–114. https://doi.org/10.1007/978-3-030-23712-7_15
Wang Y, Hong Z, Shi M (2018b) Research on lDA model algorithm of news-oriented web crawler. In: Proceeding of 17th IEEE/ACIS international conference on computer and information science (ICIS), Singapore, June 6–8, 2018, pp 748–753. https://doi.org/10.1109/ICIS.2018.8466502
Yadav D, Sharma A, Gupta J (2008) Parallel crawler architecture and web page change detection. World Sci Eng Acad Soc Trans Comput 7:929–940. https://doi.org/10.5555/1457973.1457982
Zhu W, Li Y, Xu Y, Cui X (2019) Optimal bandwidth allocation for web crawler systems. In: Proceeding of 2019 IEEE SmartWorld, ubiquitous intelligence & computing, advanced & trusted computing, scalable computing & communications, cloud & big data computing, internet of people and smart city innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Leicester, United Kingdom, pp 1146–1153. https://doi.org/10.1109/SmartWorld-UIC-ATC-SCALCOM-IOP-SCI.2019.00215
Acknowledgements
This research is supported in part by National Key R&D Program of China no. 2018YFC1604000.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Ethical approval
This article dose not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhu, W., Li, Y., Li, S. et al. Optimal bandwidth allocation for web crawler systems with time constraints. J Ambient Intell Human Comput 14, 5279–5292 (2023). https://doi.org/10.1007/s12652-020-02377-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-020-02377-1