Skip to main content
Log in

A Bad Day to Die Hard: Correcting the Dieharder Battery

  • Research Article
  • Published:
Journal of Cryptology Aims and scope Submit manuscript

Abstract

We analyze Dieharder statistical randomness tests according to accuracy and correct interpretation of their results. We used all tests, processed 8 TB of quantum-generated data, and obtained null distributions of first-level and second-level p-values. We inspected whether the p-values are uniformly distributed. The analysis showed that more than half (out of 110) of Dierharder atomic tests (test with particular setting) produce null distributions of p-values that are biased from the expected uniform one. Additional analysis of the Kolmogorov–Smirnov (KS) test showed that the key KS test is also biased. This increases the probability of false positives (in the right tail) for all Dieharder tests as KS is used to post-process their results. Moreover, 12 tests (22 atomic) produce results significantly biased from the null distribution of the KS test which may suggest problems with the implementation of these tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The rest of the tests can be found in the Table 4 in Appendix.

References

  1. M. Sýs, P. Švenda, M. Ukrop, V. Matyas, Constructing empirical tests of randomness. Proceedings of the 11th International Conference on Security and Cryptography (SECRYPT), pp 1-9, IEEE (2014).

  2. National Institute of Standards and Technology, FIPS 140-2:Security Requirements for Cryptographic Modules, Federal Information Processing Standards Publications (2001).

  3. ISO/IEC 15408-1:2009, Informational technology - Security techniques-Evaluation criteria for IT security, Part 1: Introduction and general model, (2009). https://www.iso.org/standard/50341.html

  4. L. E. Bassham, A. L. Rukhin, J. Soto, J. R. Nechvatal, M. E. Smid, E. B. Barker, S. D. Leigh, M. Levenson, M. Vangel, D. L. Banks, N. A. Heckert, J. F. Dray, S. Vo, SP 800-22 Rev. 1a. A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications, National Institute of Standards and Technology, Gaithersburg, MD, USA, Tech. Rep. (2010).

  5. R. G. Brown, Dieharder: A random number test suite, version 3.31.1, Duke University Physics Department (2014). http://www.phy.duke.edu/~rgb/General/dieharder.php.

  6. G. Marsaglia, Diehard Battery of Tests of Randomness, Floridan State University (1995). https://web.archive.org/web/20120102192622/www.stat.fsu.edu/pub/diehard.

  7. P. L’Ecuyer, R. Simard, TestU01: AC library for empirical testing of random number generators, ACM Transactions on Mathematical Software, vol. 33, no. 4, pp 1-40, ACM (2007).

  8. C. Kao, H. C. Tang, Several extensively tested multiple recursive random number generators, Computers and Mathematics with Applications, vol. 36, no. 6, pp 129–136, (1998). http://www.sciencedirect.com/science/article/pii/S0898122198001667.

  9. P. Leopardi, Testing the Tests: Using Random Number Generators to Improve Empirical Tests. Monte Carlo and Quasi-Monte Carlo Methods, pp. 501–512, Springer (2008).

  10. Nano-Optics group and PicoQuant GmbH, High bit rate quantum random number generator service. Humboldt University of Berlin (2010). http://qrng.physik.hu-berlin.de/.

  11. M. Wahl, M. Leifgen, M. Berlin, T. Röhlicke, H. J. Rahn, O. Benson, An ultrafast quantum random number generator with provably bounded output bias based on photon arrival time measurements, Applied Physics Letters, vol. 98, no. 17, pp. 171105, American Institute of Physics (2011).

  12. Czech National Grid Organization, Metacentrum, online. https://metavo.metacentrum.cz/.

  13. F. Yates, Contingency tables involving small numbers and the \(\chi ^2\) test, Supplement to the Journal of the Royal Statistical Society, vol. 1, no. 2, pp. 217–235, JSTOR (1934).

  14. Dieharder - Linux man page, online (2020). https://linux.die.net/man/1/dieharder.

  15. S. J. Kim, K. Umeno, A. Hasegawa, Corrections of the NIST statistical test suite for randomness. Admin. Agency, Inc, Tokyo, Japan (2004).

  16. L. Obratil, The automated testing of randomness with multiple statistical batteries, Master’s thesis, Masaryk University, Brno, Czechia (2017). https://is.muni.cz/th/uepbs/.

  17. D. J. Sheskin, Handbook of parametric and nonparametric statistical procedures, 3rd ed. Chapman and Hall/CRC (2003).

Download references

Acknowledgements

This project has been made possible in part by a grant from the Cisco University Research Program Fund, an advised fund of Silicon Valley Community Foundation. Marek Sýs Vashek Matyáš and Dušan Klinec were supported by Czech Science Foundation project GA20-03426S. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA CZ ID:90140) supported by the Ministry of Education, Youth and Sports of the Czech Republic.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marek Sýs.

Additional information

Communicated by François-Xavier Standaert

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Testing Procedure

1.1 A.1 Single Test

Empirical tests of randomness are based on hypothesis testing. Tests examine null hypothesis \({\mathcal {H}}_0\) that output sequence of the RNG consist of/imitate independent uniform random variables usually over the interval [0, 1] or binary set \(\{0,1\}.\) Each statistical test T is defined by a specific test statistic Y that is a real-valued function of values generated by RNG. Test computes statistic \(y= Y(s_0,\cdots ,s_n)\) for the analyzed sequence \(s_0,\cdots ,s_n\). The test evaluates how extreme is the observed test statistic y in case of good RNG. Result of an empirical test of randomness is typically p-value defined as:

$$\begin{aligned} p = P[Y \ge y | {\mathcal {H}}_0]. \end{aligned}$$

The exact p-value can be computed using distribution F of the test statistic Y under \({\mathcal {H}}_0.\) Function F is typically a complex function, and it is quite inefficient to compute the p-value using F. Hence, p-value is computed using continuous \({\tilde{F}}\) that is approximation of (typically discrete) distribution F.

p-value represents the probability that good RNG would generate the more extreme sequence (bigger y) according to the analyzed pattern (e.g., a bigger difference of zeros and ones) than sequence being tested. Extremely small p-value means that analyzed sequence is too extreme to be generated by a good RNG and in such cases hypothesis \({\mathcal {H}}_0\) is rejected, and we say RNG fails the test. If the p-value is small but does not clearly indicate that a RNG fails the test, new sequences should be generated by the RNG and tested.

1.2 A.2 Two-Level Testing with a Battery

To increase confidence in the result of a single test, batteries replicate each“first-order” test on disjoint sequences generated by the RNG being analyzed. The idea of the two-level testing is that for continuous distribution F of a test statistic Y p-values are uniformly distributed over the interval (0, 1] under \({\mathcal {H}}_0\). This, in fact, forms a second hypothesis \({\mathcal {H}}_1\): “p-values of computed by the test are uniformly distributed over the interval (0, 1]”. The hypothesis \({\mathcal {H}}_1\) is tested statistically by the “second-order” test in the batteries.

Two-level testing is performed as follows:

  1. 1.

    First-level: a test of the battery is applied to N disjoint sequences computing set \(P=\{p_1, \ldots , p_N\}\) of N independent p-values.

  2. 2.

    Second-level: The uniformity of the first-order p-values is tested. Batteries use two ways to test the uniformity of p-values in P: Dieharder and TestU01 use goodness-of-fit tests such as Kolmogorov–Smirnov, Anderson–Darling, Crámer–von Mises, etc. which compares empirical distribution of first-order p-values to the expected uniform distribution. NIST STS uses \(\chi ^2\) test for categorical data that tests whether expected frequencies fit the observed frequencies in one or more categories. In the context of uniformity testing, frequencies of p-values falling into one or more intervals (subintervals of (0, 1]) are computed and compared with expected frequencies. The result of a uniformity test is again p-value (of p-values)—second-level p-value.

Fig. 4
figure 4

Illustration of two-level testing. Second-level p-value is computed here using the Kolmogorov–Smirnov (KS) test

1.3 A.3 Interpretation of Battery Results

Assessment of the RNG is based on a set of second-level p-values computed by all tests of the battery for multiple sequences generated by RNG. Two scenarios can be considered when testing RNG:

  1. 1.

    RNG is evaluated only based on second-level p-values computed in a single run of the battery.

  2. 2.

    New sequences are generated by the RNG and tested in order to confirm suspicious results.

Both scenarios are equivalent according to the interpretation of results in the situations when RNG clearly fails some test (computed p-value is extremely small, e.g., less than \(10^{-10}\)), or RNG passes all tests (e.g., p-values are larger than 0.01). The scenarios are different when suspect (weak fail) p-value (e.g., 0.001) is computed. In the second scenario, a failed test is replicated with new sequences from the same generator until either failure is confirmed or suspicion disappears. We will focus on the first scenario, where we have to evaluate RNG based on a number of “weak failures” and/or structure of small p-values. The correct interpretation of this common situation is a hard task in general. There are two things which need to be taken into account when interpreting results of multiple tests:

  • uniformity tests often compute inaccurate second-level p-value (smaller second-level p-value),

  • results (first and second-level p-values) of tests of the battery can be correlated.

Except for the case when “weak sequence” is generated by chance, the following situations or their combinations can cause that small (not extremely small) second-level p-value is computed incorrectly:

  1. 1.

    First-level p-values are not accurate—this happens when function \({\tilde{F}}\) is not a good approximation to exact distribution F of the test statistic,

  2. 2.

    uniformity test computes the inaccurate second-level p-value,

  3. 3.

    hypothesis \({\mathcal {H}}_1\) (first-level p-values are uniform for random data) is not true: In fact, \({\mathcal {H}}_1\) is never strictly true. The reason is that the test statistic Y is a discrete function (for fixed sequence size) which implies that F is also discrete function on (0, 1], hence different to the continuous uniform distribution. This failure is revealed only when a number of possible outcomes of test statistic Y (for given sequence size) are relatively small compared to the number N of first-level p-values used to compute the second-level p-value.

The interpretation of single test result/results (for disjoint sequences) is easy and statistically clear. The interpretation of multiple tests may be problematic when the results of tests are correlated. This happens when tests analyze the same data and look for similar patterns. In order to interpret multiple tests correctly, the correlation between the test results for good RNG needs to be known.

B Dieharder

Dieharder is a tool containing a set of statistical tests designed to evaluate the randomness of source based on its output developed by Robert G. Brown. While its name may lead people to believe it is just a reimplementation of older Diehard, this is not entirely true. The more true statement would be that Diehard is a subset of more complete and faster Dieharder. In total, Dieharder contains 31 statistical tests, 18 of them are reimplemented tests from Diehard, three of them are tests from NIST STS battery, and the remaining ten are original tests designed by the authors. Tests OPSO, OQSO, DNA, and Sums are flagged as suspect and should not be used in the analysis. Some tests are executed with different parameters so 110 “atomic” and correct tests can be used from the Dieharder battery.

The full execution of Dieharder consists of running all tests in their default settings on the given data. For each test, one or more p-values are provided. If more p-values are provided for a single test, it means that the test has several variants. Test variants differ only in certain parameters, but they measure similar property. An example of such a test is the STS Serial Test. This test measures the uniformity of n-bit patterns in the sequence, and when executed in default settings, it will be run multiple times, each time with different n.

Along with each p-value, Dieharder provides an evaluation of this p-value. If the p-value is smaller than \(10^{-6}\), then Dieharder returns FAILED and data are interpreted as non-random. For the p-value below 0.005 data, Dieharder returns WEAK, which suggests a potential problem with the data/RNG. For the rest of p-values (above 0.005), Dieharder returns PASS that is interpreted as random data.

Each Dieharder test is executed multiple times in sequence, and multiple first-level p-values are computed. First-level p-values are then post-processed by the Kolmogorov–Smirnov test for uniformity [17] from which the resulting second-level p-value is obtained (see Fig. 4). Tests are mostly repeated 100 times (i.e., \(N=100\)) in default run, but this number can vary based on a test.

C Other Results

Table 4 Dieharder tests with uniform distribution of first-level p-values
Table 5 Uniformity of selected (non-uniform results for quantum generator [10] listed in Table 1) Dieharder tests applied to random data produced by AES in the counter mode
Table 6 Uniformity of NIST tests applied to random data produced by quantum generator [10]
Table 7 Uniformity of NIST tests applied to data produced by AES in the counter mode

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sýs, M., Obrátil, L., Matyáš, V. et al. A Bad Day to Die Hard: Correcting the Dieharder Battery. J Cryptol 35, 3 (2022). https://doi.org/10.1007/s00145-021-09414-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00145-021-09414-y

Keywords

Navigation