Skip to main content

Improving Deep Learning Based Password Guessing Models Using Pre-processing

  • Conference paper
  • First Online:
Information and Communications Security (ICICS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13407))

Included in the following conference series:

  • 1376 Accesses

Abstract

Passwords are the most widely used authentication method and play an important role in users’ digital lives. Password guessing models are generally used to understand password security, yet statistic-based password models (like the Markov model and probabilistic context-free grammars (PCFG)) are subject to the inherent limitations of overfitting and sparsity. With the improvement of computing power, deep-learning based models with higher crack rates are emerging. Since neural networks are generally used as black boxes for learning password features, a key challenge for deep-learning based password guessing models is to choose the appropriate preprocessing methods to learn more effective features.

To fill the gap, this paper explores three new preprocessing methods and makes an attempt to apply them to two promising deep-learning networks, i.e., Long Short-Term Memory (LSTM) neural networks and Generative Adversarial Networks (GAN). First, we propose a character-feature based method for encoding to replace the canonical one-hot encoding. Second, we add so far the most comprehensive recognition rules of words, keyboard patterns, years, and website names into the basic PCFG, and find that the frequency distribution of extracted segments follows the Zipf’s law. Third, we adopt Xu et al.’s PCFG improvement with chunk segmentation at CCS’21, and study the performance of the Chunk+PCFG preprocessing method when applied to LSTM and GAN.

Extensive experiments on six large real-world password datasets show the effectiveness of our preprocessing methods. Results show that within 50 million guesses: 1) When we apply the PCFG preprocessing method to PassGAN (a GAN-based password model proposed by Hitja et al. at ACNS’19), 13.83%–38.81% (26.79% on average) more passwords can be cracked; 2) Our LSTM based model using PCFG for preprocessing (short for PL) outperforms Wang et al.’s original PL model by 0.35%–3.94% (1.36% on average). Overall, our preprocessing methods can improve the attacking rates in four over seven tested cases. We believe this work provides new feasible directions for guessing optimization, and contributes to a better understanding of deep-learning based models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Blocki, J., Harsha, B., Zhou, S.: On the economics of offline password cracking. In: Proceedings of IEEE S &P 2018, pp. 853–871 (2018)

    Google Scholar 

  2. Bonneau, J., Herley, C., Van Oorschot, P.C., Stajano, F.: The request to replace passwords: a framework for comparative evaluation of web authentication schemes. In: Proceedings of IEEE S &P 2012, pp. 553–567 (2012)

    Google Scholar 

  3. Bonneau, J., Herley, C., Van Oorschot, P.C., Stajano, F.: Passwords and the evolution of imperfect authentication. Commun. ACM 58(7), 78–87 (2015)

    Article  Google Scholar 

  4. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Proceedings of the NIPS 2017, pp. 5769–5779 (2017)

    Google Scholar 

  5. Hitaj, B., Gasti, P., Ateniese, G., Perez-Cruz, F.: PassGAN: a deep learning approach for password guessing. In: Proceedings of the ACNS 2019 (2019)

    Google Scholar 

  6. Houshmand, S., Aggarwal, S., Flood, R.: Next gen PCFG password cracking. IEEE Trans. Inf. Forensics Secur. 10(8), 1776–1791 (2015)

    Article  Google Scholar 

  7. Li, Z., Han, W., Xu, W.: A large-scale empirical analysis of Chinese web passwords. In: Proceedings of the USENIX Security 2014, pp. 559–574 (2014)

    Google Scholar 

  8. Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 (2015)

  9. Liu, Y., et al.: GENPass: a general deep learning model for password guessing with PCFG rules and adversarial generation. In: Proceedings of ICC 2018, pp. 1–6 (2018)

    Google Scholar 

  10. Ma, J., Yang, W., Luo, M., Li, N.: A study of probabilistic password models. In: Proceedings of IEEE S &P 2014, pp. 689–704 (2014)

    Google Scholar 

  11. Melicher, W., Ur, B., Komanduri, S., Bauer, L., Christin, N., Cranor, L.F.: Fast, lean and accurate: modeling password guessability using neural networks. In: Proceedings of the USENIX SEC 2017, pp. 1–17 (2017)

    Google Scholar 

  12. Narayanan, A., Shmatikov, V.: Fast dictionary attacks on passwords using time-space tradeoff. In: Proceedings of the ACM CCS 2005, pp. 364–372 (2005)

    Google Scholar 

  13. Rodríguez, P., Bautista, M.A., Gonzàlez, J., Escalera, S.: Beyond one-hot encoding: lower dimensional target embedding. Image Vis. Comput. 75, 21–31 (2018)

    Article  Google Scholar 

  14. Wang, D., Cheng, H., Wang, P., Huang, X., Jian, G.: Zipf’s law in passwords. IEEE Trans. Inf. Forensics Secur. 12(11), 2776–2791 (2017)

    Article  Google Scholar 

  15. Wang, D., Wang, P., He, D., Tian, Y.: Birthday, name and bifacial-security: understanding passwords of Chinese web users. In: Proceedings of the USENIX SEC 2019 (2019)

    Google Scholar 

  16. Wang, D., Zhang, Z., Wang, P., Yan, J., Huang, X.: Targeted online password guessing: an underestimated threat. In: Proceedings of the ACM CCS 2016, pp. 1242–1254 (2016)

    Google Scholar 

  17. Wang, D., Zou, Y., Tao, Y., Wang, B.: Password guessing based on recurrent neural networks and generative adversarial networks. Chin. J. Comput. 1519–1534 (2021)

    Google Scholar 

  18. Weir, M., Aggarwal, S., de Medeiros, B., Glodek, B.: Password cracking using probabilistic context-free grammars. In: Proceedings of the IEEE S &P 2009, pp. 391–405 (2009)

    Google Scholar 

  19. Xie, Z., Zhang, M., Yin, A., Li, Z.: A new targeted password guessing model. In: Liu, J.K., Cui, H. (eds.) ACISP 2020. LNCS, vol. 12248, pp. 350–368. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55304-3_18

    Chapter  Google Scholar 

  20. Xu, M., Wang, C., Yu, J., Zhang, J., Zhang, K., Han, W.: Chunk-level password guessing: towards modeling refined password composition representations. In: Proceedings of the ACM CCS 2021, pp. 5–20 (2021)

    Google Scholar 

  21. Yang, K., Hu, X., Zhang, Q., Wei, J., Liu, W.: Studies of keyboard patterns in passwords: recognition, characteristics and strength evolution. In: Gao, D., Li, Q., Guan, X., Liao, X. (eds.) ICICS 2021. LNCS, vol. 12918, pp. 153–168. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86890-1_9

    Chapter  Google Scholar 

Download references

Acknowledgment

The authors are grateful to the anonymous reviewers for their invaluable comments. Ding Wang is the corresponding author. This research was in part supported by the National Natural Science Foundation of China under Grant No.62172240, and by the Natural Science Foundation of Tianjin, China under Grant No. 21JCZDJC00190. There is no competing interests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ding Wang .

Editor information

Editors and Affiliations

Appendices

Appendix 1 Some Statistics About User-Chosen Passwords

The length distributions of each dataset are shown in Table 7. Most passwords’ length are between six and nine (avg. 73.81%). The length distribution is affected by the password policy. For example, CSDN dataset has much fewer passwords of length under eight as compared to other datasets, which may be caused by the fact that CSDN website changed the password policy to a more strict one. The character composition information is summarized in Table 8. Chinese users prefer to use digits in passwords, while English users prefer to use letters. This may be caused by cultural differences because most Chinese users use more digits in their daily lives than English words. In addition, English users prefer lowercase letters rather than uppercase letters. The top-10 passwords information is shown in Table 9. The password “123456" is the most commonly used password except for CSDN (due to its password policy). It is also interesting to see that the top-10 passwords in Chinese datasets are almost all pure digits.

Table 7. Length distribution information of each web service.
Table 8. Character composition information of each web service\(*\).
Table 9. Top-10 password information of each web service.

Appendix 2 Exploratory Experiments

In Sect. 4.3, Probabilistic context-free grammars (i.e., PCFG) [10, 18] can be used for data preprocessing when integrated with neural networks. Our refined PCFG are based on the basic PCFG with four additional recognition rules, including keyboard pattern, word, website and year. The experiment result in Sect. 5.2 has already shown that our refiend PCFG can improve the performance by 1.36% on average compared to the basic PCFG when integrated with Long Short-Term Memory neural networks (i.e., LSTM) [17]. To explore the impact of different recognition rules on the experiment results, we evaluate the performance of LSTM based models using PCFG for preprocessing, where only one recognition rule is added to basic PCFG each time.

The result in Table 10 shows that compared to the LSTM based model with basic PCFG for preprocessing: (1) Using PCFG with additional word recognition for preprocessing has a 0.26% improvement on average; (2) Using PCFG with additional keyboard recognition for preprocessing has a 0.06% improvement on average; (3) The remaining recognition rules (i.e., website and year) have little improvement on the results (less than 0.01% on average). In general, adding one recognition rule to the basic PCFG [10] alone is not as effective as adding all the rules (i.e., our refined PCFG) when integrated with LSTM. The reason why the year recognition rule has the worst performance can be can be attributed to two reasons. Firstly, years are part of birthdays and birthdays vary widely among users, which has little effect on trawling password guessing attack. Secondly, individual year segments can be replaced by digit segments. Moreover, the promotion effect of different recognition rules to some extent reflects the pattern that users tend to use when creating passwords.

Table 10. Cracking results of LSTM based models using PCFG based preprocessing methods (Guess number = \(5*10^7\))\(\dagger \)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, Y., Wang, D., Zou, Y., Huang, Z. (2022). Improving Deep Learning Based Password Guessing Models Using Pre-processing. In: Alcaraz, C., Chen, L., Li, S., Samarati, P. (eds) Information and Communications Security. ICICS 2022. Lecture Notes in Computer Science, vol 13407. Springer, Cham. https://doi.org/10.1007/978-3-031-15777-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15777-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15776-9

  • Online ISBN: 978-3-031-15777-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics