Skip to main content

MalCrawler: A Crawler for Seeking and Crawling Malicious Websites

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10109))

Abstract

Over the years, internet has become the major source of security threat to computer systems. With the number of people browsing internet increasing exponentially in the last couple of years, browser based attacks have become the preferred means of infecting a computer system. These browser based attacks, known as ‘Drive-by Download’ attacks, inject malicious JavaScript from the server hosting the malicious web application to the browser. Since, the numbers of malicious websites launching such attacks have increased in the past few years; it has become critical to detect them. Typically, search for malicious web pages involves three steps- crawling URLs on the internet, using fast analysis filters to reject benign pages, and then running complex but slow detailed analysis (using Honey Clients) on the filtered list. While effective, these techniques consume substantial time and computing resources. This limitation can be overcome by designing a crawler which can seek more malicious sites than benign sites, thus, increasing the “toxicity” of the URLs collected in the first step. In this paper, we propose a focused web crawler, named “MalCrawler”, which has been designed to crawl and search malicious websites efficiently. This crawler, when compared to a generic crawler, will not only seek more malicious sites than benign sites, but will also handle cloaking, entanglement and AJAX content in malicious sites. MalCrawler, designed, developed and tested, as part of the scope of this paper, proved to be more efficient than generic crawlers.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Seed refers to the initial set of URLs from where crawl starts.

References

  1. Symantec Corporation: Internet Security Threat Report 2016. Symantec (2016). http://www.symantec.com

  2. Jayasinghe, G.K., Culpepper, J.S., Bertok, P.: Efficient and effective realtime prediction of drive-by download attacks. J. Netw. Comput. Appl. 38, 135–149 (2014)

    Article  Google Scholar 

  3. Cao, Y., Pan, X., Chen, Y., Zhuge, J.: JShield: towards real-time and vulnerability-based detection of polluted drive-by download attacks. In: Proceedings of the 30th Annual Computer Security Applications Conference, pp. 466–475 (2014)

    Google Scholar 

  4. Sarwade, S., Patil, P.D.D.: Document-based and URL-based features for automatic classification of cross-site scripting in web pages. IOSR J. Eng. 3, 1–10 (2013)

    Google Scholar 

  5. Invernizzi, L., Benvenuti, S., Cova, M., Kruegel, C., Vigna, G.: EVILSEED : a guided approach to finding malicious web pages. In: IEEE Symposium on Security and Privacy (SP), pp. 428–442 (2012)

    Google Scholar 

  6. Canali, D., Vigna, G., Kruegel, C.: Prophiler : a fast filter for the large-scale detection of malicious web pages. In: Proceeding of 20th International Conference on World Wide Web, pp. 197–206 (2011)

    Google Scholar 

  7. Rohit, P.S., Krishnaveni, R.: Deep malicious website detection. Int. J. Comput. Sci. Mob. Comput. 2(4), 517–522 (2013)

    Google Scholar 

  8. Provos, N., Mavrommatis, P., Rajab, M.A., Monrose, F.: All your iFRAMEs point to us. In: USENIX Security Symposium (2008)

    Google Scholar 

  9. Hou, Y.-T., Chang, Y., Chen, T., Laih, C.-S., Chen, C.-M.: Malicious web content detection by machine learning. Expert Syst. Appl. 37(1), 55–60 (2010)

    Article  Google Scholar 

  10. Pham, K., Santos, A., Freire, J.: Understanding website behavior based on user agent. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (2016)

    Google Scholar 

  11. Likarish, P., Jung, E.: A targeted web crawling for building malicious javascript collection. In: Proceeding of the ACM DSMM, vol. 21, issue 4, pp. 23–26 (2009)

    Google Scholar 

  12. Jo, H.Y.I., Jung, E.: Interactive website filter for safe web browsing. J. Inf. Sci. 131, 115–131 (2013)

    Google Scholar 

  13. Qassrawi, M.T., Zhang, H.: Detecting malicious web servers with honeyclients. Directory Open Access J. (DOAJ) 6(1), 145–152 (2011)

    Google Scholar 

  14. Ikinci, A., Holz, T., Freiling, F.C.: Monkey-spider: detecting malicious websites with low-interaction honeyclients. Sicherheit, vol. 8 (2008)

    Google Scholar 

  15. JSoup- JSoup Java Library. http://www.jsoup.org

  16. N.Z. Univeristy of Waikato, WEKA. http://www.cs.waikato.ac.nz/ml/weka

  17. HtmlUnit. http://htmlunit.sourceforge.net/

  18. Rhino-Mozilla. https://developer.mozilla.org/docs/Mozilla/Projects/Rhino

  19. Karbalaie, F., Sami, A., Ahmadi, M.: Semantic malware detection by deploying graph mining. Int. J. Comput. Sci. Issues (IJCSI) 9(1), 373–379 (2012)

    Google Scholar 

  20. Kaplan, S., Siefert, C., Livshits, B., Zorn, B., Curtsinger, C.: NoFus : automatically detecting obfuscated javascript code (2011)

    Google Scholar 

  21. Pintol, B.S., Barnete, R.: A novel algorithm for obfuscated code analysis. In: 2011 IEEE International Workshop of Information Forensics and Security (WIFS), pp. 1–5 (2011)

    Google Scholar 

  22. Safe Browsing API. https://developers.google.com/safe-browsing

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. K. Singh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Singh, A.K., Goyal, N. (2017). MalCrawler: A Crawler for Seeking and Crawling Malicious Websites. In: Krishnan, P., Radha Krishna, P., Parida, L. (eds) Distributed Computing and Internet Technology. ICDCIT 2017. Lecture Notes in Computer Science(), vol 10109. Springer, Cham. https://doi.org/10.1007/978-3-319-50472-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-50472-8_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-50471-1

  • Online ISBN: 978-3-319-50472-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics