Abstract
In a broad set of domains, from protecting people from harmful content to segmenting online customers, we need to know the kind of content hosted by a web domain. But there are nearly 2 billion unique hostnames today and curated domain label lists carry at best a few million domains. We bridge the gap by exploiting labeled data from multiple large, curated lists—Shallalist, PhishTank, Malware Domains, and Squidguard—to build models that predict the kind of content hosted by a domain using the sequence of characters in the domain name. Given identifying domains that carry harmful material or adult content is particularly important, we primarily focus on those categories. The models do very well at predicting domains that host pornographic content, with f1-scores of about .9 or higher. We are less successful at predicting domains that carry harmful content with f1-scores of two of our best models around .8. To illustrate the utility of our models, we use them to answer two questions: 1. Do poor people, racial or ethnic minorities, and the less well-educated visit malware sites more often than their respective complementary groups, and 2. Does the consumption of pornography vary by age and education?
Replication materials and Supplementary Information are posted at https://github.com/themains/pydomains and https://github.com/themains/domain_knowledge. The python package that implements the method discussed in the paper is available at https://github.com/themains/pydomains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)
Aldwairi, M., Alsalman, R.: Malurls: a lightweight malicious website classification based on url features. J. Emerg. Technol. Web Intell. 4(2), 128–133 (2012)
Amazon: Alexa top 1m domains (2017)
Chollet, F., et al.: Keras (2015)
Cor, K., Sood, G.: Pwned: how often are Americans’ online accounts breached? arXiv preprint arXiv:1808.01883 (2018)
Deri, L., Martinelli, M., Sartiano, D., Serrecchia, M., Sideri, L., Prignoli, S.: Implementing web classification for TLDS. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 85–88. IEEE (2015)
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999)
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)
Hald, G.M., Malamuth, N.M., Yuen, C.: Pornography and attitudes supporting violence against women: revisiting the relationship in nonexperimental studies. Aggress. Behav.: Off. J. Int. Soc. Res. Aggress. 36(1), 14–20 (2010)
Jain, A.K., Gupta, B.: Phish-safe: url features-based phishing detection system using machine learning. In: Cyber Security, pp. 467–474. Springer (2018)
Shalla Secure Services KG: Shalla’s blacklists (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Netcraft: January 2018 web server survey (2018). Accessed 11 Nov 2018
OpenDNS, L.: PhishTank: an anti-phishing site (2017)
Prigent, F.: Toulouse/squidguard blacklist (2017)
RiskAnalytics: Malware domains (2017)
Shen, D., Chen, Z., Yang, Q., Zeng, H.J., Zhang, B., Lu, Y., Ma, W.Y.: Web-page classification through summarization. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 242–249 (2004)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Wang, H.H., Yu, L., Tian, S.W., Peng, Y.F., Pei, X.J.: Bidirectional LSTM malicious webpages detection algorithm based on convolutional neural network and independent recurrent neural network. Appl. Intell. 49(8), 3016–3026 (2019)
Westcott, K., Loucks, J., Littmann, D., Wilson, P., Srivastava, S., Ciampa, D.: Connectivity and mobile trends survey (2019). https://www2.deloitte.com/us/en/pages/technology-media-and-telecommunications/articles/global-mobileconsumer-survey-us-edition.html
Zhang, J.B., Xu, Z.M., Xiu, K.I., Pan, Q.S.: A web site classification approach based on its topological structure. Int. J. Asian Lang. Proc. 20(2), 75–86 (2010)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Laohaprapanon, S., Sood, G. (2021). Domain Knowledge: Predicting the Kind of Content Hosted by a Domain. In: Herrero, Á., Cambra, C., Urda, D., Sedano, J., Quintián, H., Corchado, E. (eds) 13th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2020). CISIS 2019. Advances in Intelligent Systems and Computing, vol 1267. Springer, Cham. https://doi.org/10.1007/978-3-030-57805-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-57805-3_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57804-6
Online ISBN: 978-3-030-57805-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)