A Bag of Tokens Neural Network to Predict Webpage Age

Meinke, Klaas; van der Laan, Tamis Achilles; Iancu, Tiberiu; Cakir, Ceyhun

doi:10.1007/978-3-031-34671-2_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13914))

Included in the following conference series:

International Symposium on Cyber Security, Cryptology, and Machine Learning

597 Accesses

Abstract

Outdated technologies pose a significant security threat to websites and hackers often hone in on the oldest pages on a site to discover vulnerabilities. To improve the efficiency of (automated) penetration testers, we invent a machine learning method that predicts the age of a webpage. An HTML-specific tokenizer is trained and used to tokenize HTML bodies, which are then transformed into binary vector encodings (a “bag of tokens”). We train a Multi-Layer Perceptron neural network on such encodings, using historical snapshots of webpages as our training data. Our method achieves a mean absolute error of 1.58 years on validation data held out from training.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). Under Review of ICLR2016 (1997). (11 2015)
Google Scholar
Demir, N., Urban, T., Wittek, K., Pohlmann, N.: Our (in)secure web: understanding update behavior of websites and its impact on security. In: Hohlfeld, O., Lutu, A., Levin, D. (eds.) Passive and Active Measurement. PAM 2021. LNCS, vol. 12671, pp. 76–92. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72582-2_5
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding, October 2018
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification, February 2015
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, December 2014
Google Scholar
Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks, June 2017
Google Scholar
Kudo, T.: Subword regularization: improving neural network translation models with multiple subword candidates, pp. 66–75, January 2018. https://doi.org/10.18653/v1/P18-1007
Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing, August 2018
Google Scholar
Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with naive bayes - which naive bayes? January 2006
Google Scholar
Mielke, S., et al.: Between words and characters: a brief history of open-vocabulary modeling and tokenization in NLP, December 2021
Google Scholar
Nair, V., Hinton, G.: Rectified linear units improve restricted Boltzmann machines vinod nair, vol. 27, pp. 807–814, June 2010
Google Scholar
Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152 (2012)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units, August 2015
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need, June 2017
Google Scholar
Zhang, Y., Jin, R., Zhou, Z.H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1, 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0
Zheng, H., Yang, Z., Liu, W.J., Liang, J., Li, Y.: Improving deep neural networks using softplus units, pp. 1–4, July 2015. https://doi.org/10.1109/IJCNN.2015.7280459

Download references

Author information

Authors and Affiliations

Hadrian Security, Leidseplein 1, Amsterdam, The Netherlands
Klaas Meinke, Tamis Achilles van der Laan, Tiberiu Iancu & Ceyhun Cakir

Authors

Klaas Meinke
View author publications
You can also search for this author in PubMed Google Scholar
Tamis Achilles van der Laan
View author publications
You can also search for this author in PubMed Google Scholar
Tiberiu Iancu
View author publications
You can also search for this author in PubMed Google Scholar
Ceyhun Cakir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Klaas Meinke .

Editor information

Editors and Affiliations

Ben-Gurion University of the Negev, Be’er Sheva, Israel
Shlomi Dolev
Ben-Gurion University of the Negev, Be’er Sheva, Israel
Ehud Gudes
Zama, Meythet, France
Pascal Paillier

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Meinke, K., van der Laan, T.A., Iancu, T., Cakir, C. (2023). A Bag of Tokens Neural Network to Predict Webpage Age. In: Dolev, S., Gudes, E., Paillier, P. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2023. Lecture Notes in Computer Science, vol 13914. Springer, Cham. https://doi.org/10.1007/978-3-031-34671-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-34671-2_12
Published: 21 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34670-5
Online ISBN: 978-3-031-34671-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics