Abstract
Outdated technologies pose a significant security threat to websites and hackers often hone in on the oldest pages on a site to discover vulnerabilities. To improve the efficiency of (automated) penetration testers, we invent a machine learning method that predicts the age of a webpage. An HTML-specific tokenizer is trained and used to tokenize HTML bodies, which are then transformed into binary vector encodings (a “bag of tokens”). We train a Multi-Layer Perceptron neural network on such encodings, using historical snapshots of webpages as our training data. Our method achieves a mean absolute error of 1.58 years on validation data held out from training.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). Under Review of ICLR2016 (1997). (11 2015)
Demir, N., Urban, T., Wittek, K., Pohlmann, N.: Our (in)secure web: understanding update behavior of websites and its impact on security. In: Hohlfeld, O., Lutu, A., Levin, D. (eds.) Passive and Active Measurement. PAM 2021. LNCS, vol. 12671, pp. 76–92. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72582-2_5
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding, October 2018
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification, February 2015
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, December 2014
Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks, June 2017
Kudo, T.: Subword regularization: improving neural network translation models with multiple subword candidates, pp. 66–75, January 2018. https://doi.org/10.18653/v1/P18-1007
Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing, August 2018
Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with naive bayes - which naive bayes? January 2006
Mielke, S., et al.: Between words and characters: a brief history of open-vocabulary modeling and tokenization in NLP, December 2021
Nair, V., Hinton, G.: Rectified linear units improve restricted Boltzmann machines vinod nair, vol. 27, pp. 807–814, June 2010
Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152 (2012)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units, August 2015
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Vaswani, A., et al.: Attention is all you need, June 2017
Zhang, Y., Jin, R., Zhou, Z.H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1, 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0
Zheng, H., Yang, Z., Liu, W.J., Liang, J., Li, Y.: Improving deep neural networks using softplus units, pp. 1–4, July 2015. https://doi.org/10.1109/IJCNN.2015.7280459
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Meinke, K., van der Laan, T.A., Iancu, T., Cakir, C. (2023). A Bag of Tokens Neural Network to Predict Webpage Age. In: Dolev, S., Gudes, E., Paillier, P. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2023. Lecture Notes in Computer Science, vol 13914. Springer, Cham. https://doi.org/10.1007/978-3-031-34671-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-34671-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34670-5
Online ISBN: 978-3-031-34671-2
eBook Packages: Computer ScienceComputer Science (R0)