Elsevier

Computers & Security

Volume 121, October 2022, 102848
Computers & Security

#Segments: A Dominant Factor of Password Security to Resist against Data-driven Guessing

https://doi.org/10.1016/j.cose.2022.102848Get rights and content

Abstract

Understanding which factors dominate password security is vital for users to create their secure passwords. Prior works generally consider the password length and the number of character classes as the dominant factors. However, creating secure passwords based on the above two factors becomes much more challenging than before due to the emergence of powerful data-driven guessing methods, e.g., the Probabilistic Context-free Grammars (PCFG) and its variations, Markov-based methods, and neural-network-based methods. In this paper, inspired by the segments used in PCFG, where a segment is a continuous string whose characters have a strong correlation, we conduct a comprehensive empirical analysis and find that the number of segments (# Segments for short) is a dominant factor of password security to resist against data-driven guessing. That is, the increase of # Segments generally leads to a significant improvement of password security. The observation helps us explore an optimised identification method for segments, referred to as re-segment, which reduces # Segments as much as possible to obtain accurate # Segments by leveraging five popular patterns (i.e., keyboard, abbreviation, leet, mixture, and component), to evaluate password security more accurately from an adversary’s viewpoint. Then we propose an efficient data-driven guessing method, referred to as ReSeg-PCFG, by leveraging re-segment based on the latest version of PCFG. Our study shows that ReSeg-PCFG outperforms the state-of-the-art data-driven guessing methods in almost all scenarios; e.g., it outperforms the latest version of PCFG by up to 79.34% at 1014 guesses, a commonly used threshold of off-line attacks.

Introduction

Textual passwords are still active in protecting system security due to their two remarkable advantages: low cost and ease of use (Oesch and Ruoti, 2020). Besides, it is still very popular for users to create passwords manually rather than through password managers (Pearman, Zhang, Bauer, Christin, Cranor, 2019, Ray, Wolf, Kuber, Aviv, 2021). As a result, the study of password security, especially manually created password security, is still a hot topic in system security.

To help users create securer passwords, researchers proposed many password composition policies, such as the guidelines that require longer password length and more character classes from the National Institute of Standards and Technology (NIST) (Burr, Dodson, Newton, Perlner, Polk, Gupta, Nabbus, Grassi, Garcia, Fenton). Meanwhile, to evaluate password security accurately, researchers proposed a series of data-driven guessing methods, including the Probabilistic Context-free Grammars (PCFG) proposed in 2009 (Weir et al., 2009) (referred to as PCFG-2009) and its variations (Han, Xu, Zhang, Wang, Zhang, Wang, 2020, Kelley, Komanduri, Mazurek, Shay, Vidas, Bauer, Christin, Cranor, Lopez, 2012, Li, Han, Xu, 2014, Matt, Veras, Collins, Thorpe, 2014, Wang, Wang, He, Tian, 2019), Markov-based methods (Ma, Yang, Luo, Li, 2014, Narayanan, Shmatikov, 2005), and neural-network-based methods (e.g., FLA Melicher et al., 2017). These guessing methods could leverage password characteristics (Das, Bonneau, Caesar, Borisov, Wang, 2014, Li, Han, Xu, 2014, Matt, Narayanan, Shmatikov, 2005, Veras, Collins, Thorpe, 2014, Wang, Wang, He, Tian, 2019, Weir, Aggarwal, Medeiros, Glodek, 2009), such as natural language words, to improve their accuracy in evaluating password security by improving their guessing efficiency. Therefore, passwords with these characteristics could still be vulnerable, even though these passwords meet the minimal password length requirement and the minimal number of character class requirement (Kelley et al., 2012). Han et al. (2020) analysed long passwords security and recommended that users create long passwords with four or more segments. The above findings indicate that the password length and the number of character classes would be inadequate to determine password security. As a result, we should consider more factors, such as the number of segments (# Segments for short), and then find out the dominant factors of password security to help users create securer passwords.

In this paper, we argue that # Segments could be a dominant factor of password security when inspired by the segments used in PCFG-2009 and its variations. We consider a segment as a continuous string whose characters have a strong correlation. The strong correlation indicates that the segment could be meaningful or follow a specific pattern, e.g., the segment of hello is a meaningful English word, and 1q2w3e follows the keyboard pattern (note that inserting any character into or deleting any character from a segment could damage the strong correlation in the segment). When creating passwords, users have several ways to add a segment (e.g., 1234) into the existing segment(s) (e.g., pass), such as appending (pass1234) and interleaving (p1a2s3s4).

Prior works show that larger # Segments could lead to securer passwords. Passwords with larger # Segments are usually more resistant against PCFG-2009 and its variations, as is discussed by Wang et al. (2019). Besides, long passwords with larger # Segments are usually more resistant against Markov-based methods and FLA, as is concluded by Han et al. (2020). However, the deep analysis of the impacts of # Segments on password security (i.e., the importance of # Segments based on the comparison with the password length and the number of character classes, and the method of exploiting these segments to improve data-driven guessing and evaluate password security accurately) is still absent.

To analyse the impacts of # Segments on password security, we conduct a comprehensive empirical study on four leaked real-world password datasets with over 110 million passwords in total. First, we qualitatively analyse the impacts of the three factors, i.e., # Segments, the password length, and the number of character classes, on password security. Then we use the Random Forest regression (Breiman, 2001) to quantitatively analyse the impacts of these factors on password security to resist against two categories of data-driven guessing methods, i.e., the template-based methods (PCFG-2009 (Weir et al., 2009), Semantic Guesser (Veras et al., 2014) and PCFGv4.1*  1) and the whole-string methods (n-Gram Markov, where n=4,5,6, Backoff Markov (Ma, Yang, Luo, Li, 2014, Narayanan, Shmatikov, 2005), and FLA (Melicher et al., 2017)). The results show that # Segments is a dominant factor of password security to resist against data-driven guessing, i.e., the increase of # Segments generally leads to a significant improvement of password security.

Furthermore, to evaluate password security more accurately from an adversary’s viewpoint, we propose an optimised identification method for segments, referred to as re-segment, which regards a password as the composition of as few segments as possible to obtain accurate # Segments by leveraging the strong correlations in segments. The five popular patterns in re-segment, i.e., keyboard, abbreviation, leet, mixture, and component, are often used to create so-called secure passwords (Das et al., 2014). Our statistical analysis shows that re-segment could be much more efficient than other identification methods. Then we propose an efficient data-driven guessing method, referred to as ReSeg-PCFG, by leveraging re-segment based on PCFGv4.1*. The results show that ReSeg-PCFG achieves a better guessing efficiency than the state-of-the-art data-driven guessing methods in almost all scenarios.

The contributions in this paper are as follows.

  • We find that # Segments is a dominant factor of password security to resist against data-driven guessing by qualitatively and quantitatively analysing the impacts of # Segments, the password length, and the number of character classes. The increase of # Segments generally leads to a significant improvement of password security.

  • We propose an efficient data-driven guessing method, i.e., ReSeg-PCFG, to evaluate password security more accurately by leveraging re-segment. Our empirical results show that ReSeg-PCFG outperforms the state-of-the-art data-driven guessing methods in almost all scenarios. For example, ReSeg-PCFG outperforms PCFG-2009, Semantic Guesser, PCFGv4.1*, Backoff Markov and FLA by up to 95.41%, 75.19%, 79.34%, 10.11% and 51.06%, respectively, at 1014 guesses.

Roadmap Section 2 shows the background knowledge; Section 3 empirically studies the impacts of # Segments; Section 4 explores the reduction of # Segments to obtain accurate # Segments; Section 5 proposes a novel attack; Section 6 discusses the limitations and future work; Section 7 investigates the related work; Finally, Section 8 summarises our work.

Section snippets

Data-driven guessing methods

Users are often encouraged to create longer passwords with more character classes to improve password security (Ur et al., 2012). However, users generally have their patterns to create so-called secure passwords, such as adding “!” at the end (Das, Bonneau, Caesar, Borisov, Wang, 2014, Ur, Noma, Bees, Segreti, Shay, Bauer, Christin, Cranor, 2015).

Motivated to analyse password security more accurately, researchers proposed several data-driven guessing methods by considering the patterns. We

Distributions of # segments

Fig. 1 shows the distributions of # Segments under three identification methods, i.e., non-terminal, word and pattern, for four password datasets. We find that most of the passwords contain three or fewer segments. Then we compare the distributions and show the three comparisons as follows.

Comparison-1: non-terminal vs. word As is shown in Fig. 1(a) and (b), the distributions of # Segments under non-terminal show that most passwords contain one or two segments, while the percentage of these

Reduction of # segments

The results of our empirical analysis show that # Segments is a dominant factor of password security to resist against data-driven guessing for each of the three identification methods. Therefore, we argue that the reduction of # Segments, i.e., regarding a password as the composition of as few segments as possible to obtain accurate # Segments by leveraging the strong correlations in segments, would help evaluate password security more accurately from an adversary’s viewpoint.

Design

ReSeg-PCFG has three phases: training, enumerating, and simulating. The training phase leverages re-segment to obtain the grammars, i.e., segments and structures, from a training dataset. It is vital because the grammars, which reveals how to regard a password as the composition of several segments, directly impact the guessing efficiency.

In the training phase of ReSeg-PCFG, we first obtain two dictionaries: (1) segment candidates which might follow the component pattern and (2) segment

Why does # segments have significant impacts on password security?

Our empirical study shows that # Segments is a dominant factor of password security because: (1) segments, which are like words and are contained in manually created passwords for easy creation, memory and input, significantly reduce the randomness of the passwords; (2) the proposed ReSeg-PCFG captures the segments, although they could be broken by patterns, like interleaving two words into lots of pseudo segments. Furthermore, although the emerging of password managers can help users create

Impacts of factors on password security

Password security has long been discussed. Researchers conducted a number of studies on the creation of secure passwords and factors impacting password security. In 2006, NIST proposed that password security depends on the password length and the number of character classes (Burr et al., 2006). In 2017, NIST suggested that a password should be long enough (Grassi et al., 2017). Ur et al. (2012) concluded that stringent meters, which usually require longer length and more character classes, lead

Conclusion

Our study shows that # Segments is a dominant factor of password security to resist against data-driven guessing compared with the password length and the number of character classes. That is, the increase of # Segments generally leads to a significant improvement of password security. Furthermore, to evaluate password security more accurately, we propose re-segment, which reduces # Segments as much as possible to obtain accurate # Segments by leveraging the strong correlations in segments from

CRediT authorship contribution statement

Chuanwang Wang: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing. Junjie Zhang: Conceptualization, Validation, Writing – original draft, Writing – review & editing. Ming Xu: Conceptualization, Writing – original draft. Haodong Zhang: Conceptualization, Writing – original draft. Weili Han: Conceptualization, Methodology, Validation, Supervision, Project administration, Funding acquisition, Writing – original draft, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This paper is supported by NSFC (Grant no. U1836207) and STCSM (Grant no. 21511101600). We thank all anonymous reviewers for their insightful comments.

Chuanwang Wang is a graduate student in Fudan University. He received his B.S. degree from Fudan University in 2019. He is currently a member of the Laboratory of Data Analysis and Security. His research interest mainly includes password security and system security.

References (43)

  • M. Akinwande et al.

    Variance inflation factor: as a condition for the inclusion of suppressor variable(s) in regression analysis

    Open J. Stat.

    (2015)
  • A. Baddeley

    Human Memory: Theory and Practice, Revised Edition

    (1997)
  • J. Bland et al.

    The logrank test

    BMJ

    (2004)
  • J. Bonneau et al.

    Towards reliable storage of 56-bit secrets in human memory

    USENIX

    (2014)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • Burnett, M., 2015. Today i am releasing ten million passwords....
  • Burr, W., Dodson, D., Newton, E., Perlner, R., Polk, W., Gupta, S., Nabbus, E., 2006. NIST special publication...
  • X. de Carné de Carnavalet et al.

    From very weak to very strong: analyzing password-strength meters

    NDSS

    (2014)
  • A. Das et al.

    The tangled web of password reuse

    NDSS

    (2014)
  • M. Dell’Amico et al.

    Monte carlo strength evaluation: fast and reliable password checking

    CCS

    (2015)
  • D. Florêncio et al.

    An administrator’s guide to internet password research

    Large Installation System Administration Conference

    (2014)
  • Grassi, P., Garcia, M., Fenton, J., 2017. NIST special publication 800-63, revision 3....
  • guidetogrammar, 2020. Abbreviations....
  • W. Han et al.

    TransPCFG: transferring the grammars from short passwords to guess long passwords effectively

    IEEE Trans. Inf. Forensics Secur.

    (2020)
  • R. Hranický et al.

    Distributed PCFG password cracking

    European Symposium on Research in Computer Security

    (2020)
  • P. Kelley et al.

    Guess again (and again and again): measuring password strength by simulating password-cracking algorithms

    IEEE Security & Privacy

    (2012)
  • W. Li et al.

    Leet usage and its effect on password security

    IEEE Trans. Inf. Forensics Secur.

    (2021)
  • Z. Li et al.

    A Large-Scale empirical analysis of Chinese web passwords

    USENIX

    (2014)
  • E. Liu et al.

    Reasoning analytically about password-cracking software

    IEEE Security & Privacy

    (2019)
  • J. Ma et al.

    A study of probabilistic password models

    IEEE Security & Privacy

    (2014)
  • Matt, W., 2019. Pretty cool fuzzy guesser....
  • Cited by (3)

    Chuanwang Wang is a graduate student in Fudan University. He received his B.S. degree from Fudan University in 2019. He is currently a member of the Laboratory of Data Analysis and Security. His research interest mainly includes password security and system security.

    Junjie Zhang is a graduate student in Fudan University. He received his B.S. degree from Fudan University in 2019. He is currently a member of the Laboratory of Data Analysis and Security. His research interest mainly includes password security and system security.

    Ming Xu is a Ph.D. student at Fudan University. She received her B.S. degree from Yunnan University in 2018. She is currently a member of the Laboratory of Data Analytics and Security. Her research interest mainly includes the password security and system security.

    Haodong Zhang is a graduate student in Fudan University. He received his B.S. degree from Fudan University in 2020. He is currently a member of the Laboratory of Data Analysis and Security. His research interest mainly includes password security and system security.

    Weili Han is a full Professor at Software School, Fudan University. He received his Ph.D. at Zhejiang University in 2003. Then, he joined the faculty of Software School at Fudan University. From 2008 to 2009, he visited Purdue University as a visiting professor funded by China Scholarship Council and Purdue University. His research interests are mainly in the fields of Data Systems Security, Access Control, and Password Security. He is now the distinguished member of CCF and the members of the IEEE, ACM, SIGSAC. He serves in several leading conferences and journals as PC members, reviewers, and an associate editor.

    View full text