Skip to main content

A Bag of Tokens Neural Network to Predict Webpage Age

  • Conference paper
  • First Online:
Cyber Security, Cryptology, and Machine Learning (CSCML 2023)

Abstract

Outdated technologies pose a significant security threat to websites and hackers often hone in on the oldest pages on a site to discover vulnerabilities. To improve the efficiency of (automated) penetration testers, we invent a machine learning method that predicts the age of a webpage. An HTML-specific tokenizer is trained and used to tokenize HTML bodies, which are then transformed into binary vector encodings (a “bag of tokens”). We train a Multi-Layer Perceptron neural network on such encodings, using historical snapshots of webpages as our training data. Our method achieves a mean absolute error of 1.58 years on validation data held out from training.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://hackerone.com/smiegles?type=user.

  2. 2.

    https://web.archive.org/.

  3. 3.

    https://downloads.majestic.com/majestic_million.csv.

  4. 4.

    https://archive.org/help/wayback_api.php.

  5. 5.

    https://pypi.org/project/pyppeteer/.

  6. 6.

    https://pypi.org/project/pyppeteer-stealth/.

  7. 7.

    https://huggingface.co/docs/transformers/fast_tokenizers.

  8. 8.

    https://www.tensorflow.org.

  9. 9.

    https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html.

References

  1. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). Under Review of ICLR2016 (1997). (11 2015)

    Google Scholar 

  2. Demir, N., Urban, T., Wittek, K., Pohlmann, N.: Our (in)secure web: understanding update behavior of websites and its impact on security. In: Hohlfeld, O., Lutu, A., Levin, D. (eds.) Passive and Active Measurement. PAM 2021. LNCS, vol. 12671, pp. 76–92. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72582-2_5

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding, October 2018

    Google Scholar 

  4. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification, February 2015

    Google Scholar 

  5. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, December 2014

    Google Scholar 

  6. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks, June 2017

    Google Scholar 

  7. Kudo, T.: Subword regularization: improving neural network translation models with multiple subword candidates, pp. 66–75, January 2018. https://doi.org/10.18653/v1/P18-1007

  8. Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing, August 2018

    Google Scholar 

  9. Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with naive bayes - which naive bayes? January 2006

    Google Scholar 

  10. Mielke, S., et al.: Between words and characters: a brief history of open-vocabulary modeling and tokenization in NLP, December 2021

    Google Scholar 

  11. Nair, V., Hinton, G.: Rectified linear units improve restricted Boltzmann machines vinod nair, vol. 27, pp. 807–814, June 2010

    Google Scholar 

  12. Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152 (2012)

    Google Scholar 

  13. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units, August 2015

    Google Scholar 

  14. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)

    Google Scholar 

  15. Vaswani, A., et al.: Attention is all you need, June 2017

    Google Scholar 

  16. Zhang, Y., Jin, R., Zhou, Z.H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1, 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0

  17. Zheng, H., Yang, Z., Liu, W.J., Liang, J., Li, Y.: Improving deep neural networks using softplus units, pp. 1–4, July 2015. https://doi.org/10.1109/IJCNN.2015.7280459

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Klaas Meinke .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Meinke, K., van der Laan, T.A., Iancu, T., Cakir, C. (2023). A Bag of Tokens Neural Network to Predict Webpage Age. In: Dolev, S., Gudes, E., Paillier, P. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2023. Lecture Notes in Computer Science, vol 13914. Springer, Cham. https://doi.org/10.1007/978-3-031-34671-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34671-2_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34670-5

  • Online ISBN: 978-3-031-34671-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics