Skip to main content
Log in

Optimal bandwidth allocation for web crawler systems with time constraints

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Web crawler is an important tool to obtain information from the Internet in a timely manner. In a typical web crawler system with limited bandwidth, many websites are crawled with different time constraints. Existing studies regarding web crawler systems do not consider the bandwidth allocation in such a complex environment; hence, the time constraints may not be satisfied. In this study, we investigate the bandwidth allocation approaches for such a web crawler system. The approaches are designed for two scenarios, i.e., when the number of websites exceeds or does not exceed the maximum number of web crawlers that the system can execute simultaneously. For the latter situation, we propose approaches to control the bandwidth for web crawlers to minimize the maximum complete time or minimize the sum of execution times of all web crawlers, considering assumptions of both sufficient and insufficient bandwidths. For the former situation, we propose a round-based reallocation approach to schedule both the sequence and bandwidth allocation of the web crawlers. Extensive simulations are conducted to validate the proposed approaches, and the results show that our approaches satisfy the time constraints well and achieve desirable execution performances in various scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Afzal Z, Shah PA, Awan KM, ur Rehman Z, (2019) Optimum bandwidth allocation in wireless networks using differential evolution. J Ambient Intell Humaniz Comput 10(4):1401–1412. https://doi.org/10.1007/s12652-018-0858-4

    Article  Google Scholar 

  • Amamou A, Bourguiba M, Haddadou K, Pujolle G (2012) A dynamic bandwidth allocator for virtual machines in a cloud environment. In: Proceeding of 2012 IEEE consumer communications and networking conference (CCNC), Las Vegas, NV, USA, January 14–17, 2012, pp 99–104. https://doi.org/10.1109/CCNC.2012.6181065

  • Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the web. ACM Trans Internet Technol 1(1):2–43. https://doi.org/10.1145/383034.383035

    Article  Google Scholar 

  • Cai R, Yang JM, Lai W, Wang Y, Zhang L (2008) Irobot: an intelligent crawler for web forums. In: Proceeding of the 17th international conference on world wide web (WWW), ACM, New York, NY, USA, pp 447–456. https://doi.org/10.1145/1367497.1367558

  • Castillo C, Marin M, Rodriguez A, Baeza-Yates R (2004) Scheduling algorithms for web crawling. In: Proceeding of joint conference 10th Brazilian symposium on multimedia and the web & 2nd Latin American Web Congress (WebMedia & LA-Web), 12–15 October 2004, Ribeirao Preto, SP, Brazil, pp 10–17. https://doi.org/10.1109/WEBMED.2004.1348139

  • Das S, Khatua M, Misra S (2019) Cheating-resilient bandwidth distribution in mobile cloud computing. IEEE Trans Cloud Comput 7(2):469–482. https://doi.org/10.1109/TCC.2016.2638909

    Article  Google Scholar 

  • Diligenti M, Maggini M, Pucci FM, Scarselli F (2004) Design of a crawler with bounded bandwidth. In: Proceeding of the 13th international conference on world wide web (WWW), May 17–20, 2004, ACM, New York, NY, USA, pp 292–293. https://doi.org/10.1145/1013367.1013441

  • Ding R, Wang M (2018) Design and implementation of web crawler based on coroutine model. In: Proceeding of cloud computing and security—4th international conference (ICCCS), Haikou, China, June 8–10, 2018. Springer International Publishing, Cham, pp 427–435. https://doi.org/10.1007/978-3-030-00006-6

  • Edwards J, McCurley K, Tomlin J (2001) An adaptive model for optimizing performance of an incremental web crawler. In: Proceeding of the 10th international conference on world wide web (WWW), ACM, New York, NY, USA, pp 106–113. https://doi.org/10.1145/371920.371960

  • Fei R, Yang K, Ou S (2010) A qos-aware dynamic bandwidth allocation algorithm for relay stations in ieee 802.16j-based vehicular networks. In: Proceeding of 2010 IEEE wireless communications and networking conference (WCNC), Proceedings, Sydney, Australia, 18–21 April 2010, pp 1–6. https://doi.org/10.1109/WCNC.2010.5506376

  • Ge D, Ding Z (2014) A task scheduling strategy based on weighted round-robin for distributed crawler. In: Proceeding of the 7th IEEE/ACM international conference on utility and cloud computing (UCC), London, United Kingdom, December 8–11, 2014, pp 848–852. https://doi.org/10.1109/UCC.2014.138

  • Guojun Z, Wenchao J, Jihui S, Fan S, Hao Z, Jiang L (2017) Design and application of intelligent dynamic crawler for web data mining. In: Proceeding of the 32nd youth academic annual conference of Chinese Association of Automation (YAC), pp 1098–1105. https://doi.org/10.1109/YAC.2017.7967575

  • Jiang L, Fu M, Le Z (2011) Hierarchical QOS-aware dynamic bandwidth allocation algorithm for wireless optical broadband access network. In: Proceeding of 2011 international conference on electronics, communications and control (ICECC), pp 4329–4332. https://doi.org/10.1109/ICECC.2011.6066634

  • Kausar MdA, Dhaka SKSVS (2013) Web crawler based on mobile agent and java aglets. Int J Inf Technol Comput Sci 5(10):85–91. https://doi.org/10.5815/ijitcs.2013.10.09

    Article  Google Scholar 

  • Kc M, Hagenbuchner M, Tsoi AC (2008) A scalable lightweight distributed crawler for crawling with limited resources. In: Proceeding of the 2008 IEEE/WIC/ACM international conference on web intelligence and international conference on intelligent agent technology—workshops, 9–12 December 2008, Sydney, NSW, Australia, vol 3, pp 663–666. https://doi.org/10.1109/WIIAT.2008.234

  • Kumar M, Bhatia R, Rattan D (2017) A survey of web crawlers for information retrieval. Wiley Interdiscip Rev Data Min Knowl Discov 7(6):e1218. https://doi.org/10.1002/widm.1218

    Article  Google Scholar 

  • Lau CH, Tao X, Tjondronegoro D, Li Y (2012) Retrieving information from microblog using pattern mining and relevance feedback. In: Proceedings of data and knowledge engineering. Springer, Berlin, pp 152–160. https://doi.org/10.1007/978-3-642-34679-8_15

  • Meng S, Wang Y, Miao Z, Sun K (2018) Joint optimization of wireless bandwidth and computing resource in cloudlet-based mobile cloud computing environment. Peer-to-Peer Netw Appl 11(3):462–472. https://doi.org/10.1007/s12083-017-0544-x

    Article  Google Scholar 

  • Punj D, Dixit A (2017) Design of a migrating crawler based on a novel URL scheduling mechanism using AHP. Int J Rough Sets Data Anal 4(1):95–110. https://doi.org/10.4018/IJRSDA.2017010106

    Article  Google Scholar 

  • Raina S, Prakash Agarwal A (2014) How crawlers aid regression testing in web applications: the state of the art. Int J Comput Appl 68(14):33–38. https://doi.org/10.5120/11651-7153

    Article  Google Scholar 

  • Shams R, Khan FH, Abbass S, Javaid R (2017) Bandwidth allocation for wireless cellular network by using genetic algorithm. Wirel Pers Commun 95(2):245–260. https://doi.org/10.1007/s11277-016-3890-8

    Article  Google Scholar 

  • Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceeding of the 18th international conference on data engineering, San Jose, CA, USA, February 26–March 1, 2002, pp 357–368. https://doi.org/10.1109/ICDE.2002.994750

  • Singhal N, Agarwal RP, Dixit A, Sharma AK (2011) Information retrieval from the web and application of migrating crawler. In: Proceeding of the 2011 international conference on computational intelligence and communication networks, pp 476–480. https://doi.org/10.1109/CICN.2011.99

  • Srl S (2014) Netbalancer. http://netbalancer.com/. Accessed 25 June 2020

  • Thelwall M (2001) A web crawler design for data mining. J Inf Sci 27(5):319–325. https://doi.org/10.1177/016555150102700503

    Article  Google Scholar 

  • Wang J, Guo Y (2012) Scrapy-based crawling and user-behavior characteristics analysis on taobao. In: Proceeding of 2012 international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), Sanya, China, October 10–12, 2012, pp 44–52. https://doi.org/10.1109/CyberC.2012.17

  • Wang J, Zhao H, Liu F, Zhang J (2018a) A queue-based bandwidth allocation method for streaming media servers in m-learning VOD systems. In: Proceeding of the 12th international conference, edutainment 2018: e-learning and games, Xi’an, China, June 28–30, 2018, Springer, lecture notes in computer science, vol 11462, pp 107–114. https://doi.org/10.1007/978-3-030-23712-7_15

  • Wang Y, Hong Z, Shi M (2018b) Research on lDA model algorithm of news-oriented web crawler. In: Proceeding of 17th IEEE/ACIS international conference on computer and information science (ICIS), Singapore, June 6–8, 2018, pp 748–753. https://doi.org/10.1109/ICIS.2018.8466502

  • Yadav D, Sharma A, Gupta J (2008) Parallel crawler architecture and web page change detection. World Sci Eng Acad Soc Trans Comput 7:929–940. https://doi.org/10.5555/1457973.1457982

    Article  Google Scholar 

  • Zhu W, Li Y, Xu Y, Cui X (2019) Optimal bandwidth allocation for web crawler systems. In: Proceeding of 2019 IEEE SmartWorld, ubiquitous intelligence & computing, advanced & trusted computing, scalable computing & communications, cloud & big data computing, internet of people and smart city innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Leicester, United Kingdom, pp 1146–1153. https://doi.org/10.1109/SmartWorld-UIC-ATC-SCALCOM-IOP-SCI.2019.00215

Download references

Acknowledgements

This research is supported in part by National Key R&D Program of China no. 2018YFC1604000.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaohui Cui.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Ethical approval

This article dose not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, W., Li, Y., Li, S. et al. Optimal bandwidth allocation for web crawler systems with time constraints. J Ambient Intell Human Comput 14, 5279–5292 (2023). https://doi.org/10.1007/s12652-020-02377-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-020-02377-1

Keywords

Navigation