Skip to main content

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1267))

  • 786 Accesses

Abstract

In a broad set of domains, from protecting people from harmful content to segmenting online customers, we need to know the kind of content hosted by a web domain. But there are nearly 2 billion unique hostnames today and curated domain label lists carry at best a few million domains. We bridge the gap by exploiting labeled data from multiple large, curated lists—Shallalist, PhishTank, Malware Domains, and Squidguard—to build models that predict the kind of content hosted by a domain using the sequence of characters in the domain name. Given identifying domains that carry harmful material or adult content is particularly important, we primarily focus on those categories. The models do very well at predicting domains that host pornographic content, with f1-scores of about .9 or higher. We are less successful at predicting domains that carry harmful content with f1-scores of two of our best models around .8. To illustrate the utility of our models, we use them to answer two questions: 1. Do poor people, racial or ethnic minorities, and the less well-educated visit malware sites more often than their respective complementary groups, and 2. Does the consumption of pornography vary by age and education?

Replication materials and Supplementary Information are posted at https://github.com/themains/pydomains and https://github.com/themains/domain_knowledge. The python package that implements the method discussed in the paper is available at https://github.com/themains/pydomains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)

    Google Scholar 

  2. Aldwairi, M., Alsalman, R.: Malurls: a lightweight malicious website classification based on url features. J. Emerg. Technol. Web Intell. 4(2), 128–133 (2012)

    Google Scholar 

  3. Amazon: Alexa top 1m domains (2017)

    Google Scholar 

  4. Chollet, F., et al.: Keras (2015)

    Google Scholar 

  5. Cor, K., Sood, G.: Pwned: how often are Americans’ online accounts breached? arXiv preprint arXiv:1808.01883 (2018)

  6. Deri, L., Martinelli, M., Sartiano, D., Serrecchia, M., Sideri, L., Prignoli, S.: Implementing web classification for TLDS. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 85–88. IEEE (2015)

    Google Scholar 

  7. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999)

    Google Scholar 

  8. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)

    Article  Google Scholar 

  9. Hald, G.M., Malamuth, N.M., Yuen, C.: Pornography and attitudes supporting violence against women: revisiting the relationship in nonexperimental studies. Aggress. Behav.: Off. J. Int. Soc. Res. Aggress. 36(1), 14–20 (2010)

    Article  Google Scholar 

  10. Jain, A.K., Gupta, B.: Phish-safe: url features-based phishing detection system using machine learning. In: Cyber Security, pp. 467–474. Springer (2018)

    Google Scholar 

  11. Shalla Secure Services KG: Shalla’s blacklists (2017)

    Google Scholar 

  12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  13. Netcraft: January 2018 web server survey (2018). Accessed 11 Nov 2018

    Google Scholar 

  14. OpenDNS, L.: PhishTank: an anti-phishing site (2017)

    Google Scholar 

  15. Prigent, F.: Toulouse/squidguard blacklist (2017)

    Google Scholar 

  16. RiskAnalytics: Malware domains (2017)

    Google Scholar 

  17. Shen, D., Chen, Z., Yang, Q., Zeng, H.J., Zhang, B., Lu, Y., Ma, W.Y.: Web-page classification through summarization. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 242–249 (2004)

    Google Scholar 

  18. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  19. Wang, H.H., Yu, L., Tian, S.W., Peng, Y.F., Pei, X.J.: Bidirectional LSTM malicious webpages detection algorithm based on convolutional neural network and independent recurrent neural network. Appl. Intell. 49(8), 3016–3026 (2019)

    Article  Google Scholar 

  20. Westcott, K., Loucks, J., Littmann, D., Wilson, P., Srivastava, S., Ciampa, D.: Connectivity and mobile trends survey (2019). https://www2.deloitte.com/us/en/pages/technology-media-and-telecommunications/articles/global-mobileconsumer-survey-us-edition.html

  21. Zhang, J.B., Xu, Z.M., Xiu, K.I., Pan, Q.S.: A web site classification approach based on its topological structure. Int. J. Asian Lang. Proc. 20(2), 75–86 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Suriyan Laohaprapanon or Gaurav Sood .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Laohaprapanon, S., Sood, G. (2021). Domain Knowledge: Predicting the Kind of Content Hosted by a Domain. In: Herrero, Á., Cambra, C., Urda, D., Sedano, J., Quintián, H., Corchado, E. (eds) 13th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2020). CISIS 2019. Advances in Intelligent Systems and Computing, vol 1267. Springer, Cham. https://doi.org/10.1007/978-3-030-57805-3_15

Download citation

Publish with us

Policies and ethics