Domain Knowledge: Predicting the Kind of Content Hosted by a Domain

Laohaprapanon, Suriyan; Sood, Gaurav

doi:10.1007/978-3-030-57805-3_15

Suriyan Laohaprapanon²⁰ &
Gaurav Sood²⁰

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1267))

Included in the following conference series:

Computational Intelligence in Security for Information Systems Conference

786 Accesses

Abstract

In a broad set of domains, from protecting people from harmful content to segmenting online customers, we need to know the kind of content hosted by a web domain. But there are nearly 2 billion unique hostnames today and curated domain label lists carry at best a few million domains. We bridge the gap by exploiting labeled data from multiple large, curated lists—Shallalist, PhishTank, Malware Domains, and Squidguard—to build models that predict the kind of content hosted by a domain using the sequence of characters in the domain name. Given identifying domains that carry harmful material or adult content is particularly important, we primarily focus on those categories. The models do very well at predicting domains that host pornographic content, with f1-scores of about .9 or higher. We are less successful at predicting domains that carry harmful content with f1-scores of two of our best models around .8. To illustrate the utility of our models, we use them to answer two questions: 1. Do poor people, racial or ethnic minorities, and the less well-educated visit malware sites more often than their respective complementary groups, and 2. Does the consumption of pornography vary by age and education?

Replication materials and Supplementary Information are posted at https://github.com/themains/pydomains and https://github.com/themains/domain_knowledge. The python package that implements the method discussed in the paper is available at https://github.com/themains/pydomains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)
Google Scholar
Aldwairi, M., Alsalman, R.: Malurls: a lightweight malicious website classification based on url features. J. Emerg. Technol. Web Intell. 4(2), 128–133 (2012)
Google Scholar
Amazon: Alexa top 1m domains (2017)
Google Scholar
Chollet, F., et al.: Keras (2015)
Google Scholar
Cor, K., Sood, G.: Pwned: how often are Americans’ online accounts breached? arXiv preprint arXiv:1808.01883 (2018)
Deri, L., Martinelli, M., Sartiano, D., Serrecchia, M., Sideri, L., Prignoli, S.: Implementing web classification for TLDS. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 85–88. IEEE (2015)
Google Scholar
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999)
Google Scholar
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)
Article Google Scholar
Hald, G.M., Malamuth, N.M., Yuen, C.: Pornography and attitudes supporting violence against women: revisiting the relationship in nonexperimental studies. Aggress. Behav.: Off. J. Int. Soc. Res. Aggress. 36(1), 14–20 (2010)
Article Google Scholar
Jain, A.K., Gupta, B.: Phish-safe: url features-based phishing detection system using machine learning. In: Cyber Security, pp. 467–474. Springer (2018)
Google Scholar
Shalla Secure Services KG: Shalla’s blacklists (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Netcraft: January 2018 web server survey (2018). Accessed 11 Nov 2018
Google Scholar
OpenDNS, L.: PhishTank: an anti-phishing site (2017)
Google Scholar
Prigent, F.: Toulouse/squidguard blacklist (2017)
Google Scholar
RiskAnalytics: Malware domains (2017)
Google Scholar
Shen, D., Chen, Z., Yang, Q., Zeng, H.J., Zhang, B., Lu, Y., Ma, W.Y.: Web-page classification through summarization. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 242–249 (2004)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Wang, H.H., Yu, L., Tian, S.W., Peng, Y.F., Pei, X.J.: Bidirectional LSTM malicious webpages detection algorithm based on convolutional neural network and independent recurrent neural network. Appl. Intell. 49(8), 3016–3026 (2019)
Article Google Scholar
Westcott, K., Loucks, J., Littmann, D., Wilson, P., Srivastava, S., Ciampa, D.: Connectivity and mobile trends survey (2019). https://www2.deloitte.com/us/en/pages/technology-media-and-telecommunications/articles/global-mobileconsumer-survey-us-edition.html
Zhang, J.B., Xu, Z.M., Xiu, K.I., Pan, Q.S.: A web site classification approach based on its topological structure. Int. J. Asian Lang. Proc. 20(2), 75–86 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Appeler, Seattle, Washington, USA
Suriyan Laohaprapanon & Gaurav Sood

Authors

Suriyan Laohaprapanon
View author publications
You can also search for this author in PubMed Google Scholar
Gaurav Sood
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Suriyan Laohaprapanon or Gaurav Sood .

Editor information

Editors and Affiliations

Grupo de Inteligencia Computacional Aplicada (GICAP), Departamento de Ingeniería Informática, Escuela Politécnica Superior, Universidad de Burgos, Burgos, Spain
Álvaro Herrero
Grupo de Inteligencia Computacional Aplicada (GICAP), Departamento de Ingeniería Informática, Escuela Politécnica Superior, Universidad de Burgos, Burgos, Spain
Carlos Cambra
Grupo de Inteligencia Computacional Aplicada (GICAP), Departamento de Ingeniería Informática, Escuela Politécnica Superior, Universidad de Burgos, Burgos, Spain
Daniel Urda
Technological Institute of Castilla y León, Burgos, Spain
Javier Sedano
Department of Industrial Engineering, University of A Coruña, La Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Laohaprapanon, S., Sood, G. (2021). Domain Knowledge: Predicting the Kind of Content Hosted by a Domain. In: Herrero, Á., Cambra, C., Urda, D., Sedano, J., Quintián, H., Corchado, E. (eds) 13th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2020). CISIS 2019. Advances in Intelligent Systems and Computing, vol 1267. Springer, Cham. https://doi.org/10.1007/978-3-030-57805-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-57805-3_15
Published: 28 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57804-6
Online ISBN: 978-3-030-57805-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics