Privacy-preserving Kruskal–Wallis test
Introduction
Statistical hypothesis tests are very widely used for data analysis. Some popular statistical tests include t-test [1], ANOVA [2], Kruskal–Wallis test [3], and Wilcoxon rank sum test [4]. Although these four are different tests, they serve the same goal, which is to find out whether the samples come from the same population. The t-test and ANOVA are parametric tests and assume the normal distribution of data. The non-parametric equivalence of these two tests are the Wilcoxon rank sum test, which is also known as Mann-Whitney U test [5], and Kruskal–Wallis test, respectively. They do not assume the data to be normally distributed. The t-test can only deal with the comparison between two samples, and the ANOVA extends it to multiple samples. Similarly, the Kruskal–Wallis is also a generalization of the Wilcoxon rank sum test from two samples to multiple samples.
As stated above, the four tests are doing similar things under different assumptions. The non-parametric tests perform better when the data is not normally distributed, and are suitable especially in the cases when the data size is small (<25 per sample group) [6]. Although the Kruskal–Wallis test is a helpful tool in many areas, sometimes the use of it is impeded by privacy concerns due to the confidential information in the data, especially in the clinical and biomedical research.
For example, some hospitals conducted a study and tested the INR (International Normalized Ratio) values for their patients so that each hospital holds a set of INR values. The hospitals want to perform the Kruskal–Wallis test to check whether their values are following the same trend. In this case, the set of the INR values of each hospital is treated as a sample. To conduct the Kruskal–Wallis test, all samples should be known, which means, the hospitals have to share their data with each other. The problem is that it might be improper for the hospitals to share their samples because the data contains the private information of patients. Currently there is no method that enables the conduction of the Kruskal–Wallis test on such distributed data with privacy concerns.
To solve this problem, we propose a privacy-preserving algorithm that allows the Kruskal–Wallis test to be applied on samples distributed in different parties without revealing each party's private information to others. Due to the similarity in non-parametric tests, our method can also help the design of privacy-preserving solutions for other non-parametric tests. For example, the Wilcoxon rank sum test and the Kruskal–Wallis test are used in the situations of two samples and two or more samples, respectively, and are essentially the same in the two samples case [3]. So our algorithm also solves the privacy issue of the Wilcoxon rank sum test to some extent.
The rest of this paper is organized as follows: In Section 2, we present the related work. Section 3 provides the technical preliminaries including the background knowledge about the Kruskal–Wallis test and the cryptographic tools we need. We propose the basic algorithm and the complete algorithm in Sections 4 The basic algorithm of privacy-preserving Kruskal–Wallis test, 5 The complete algorithm of privacy-preserving Kruskal–Wallis test, respectively. The basic algorithm shows the procedure of conducting the Kruskal–Wallis test securely when there is no tie in the data. The complete algorithm follows the basic algorithm and takes the existence of ties into consideration. In Section 6, we present the experimental results and finally, Section 7 concludes the paper.
Section snippets
Related work
In recent years, due to the increasing awareness of privacy problems, a lot of data analyzing methods have been enhanced to be privacy-preserving, including many popular data mining and machine learning algorithms. Most of these approaches can be divided into two categories. Approaches in the first category protect data privacy with data perturbation techniques, such as randomization [7], [8],rotation [9] and resampling [10]. Since the original data is changed, these approaches usually lose
The Kruskal–Wallis test
We first review the Kruskal–Wallis test in this section. The test as proposed by Kruskal and Wallis [3] evaluates whether two or more samples are from the same distribution. The null hypothesis is that all the samples come from the same distribution.
Suppose we have k samples, each contains a set of values. To perform the Kruskal–Wallis test, we need to first rank all the values together without considering which sample the values belong to, then compute the sum of all the ranks of values within
The basic algorithm of privacy-preserving Kruskal–Wallis test
In this part, we present the basic algorithm for computing the H statistic of the Kruskal–Wallis test securely without considering the existence of ties. The complete algorithm that also deals with ties will be discussed in the next section. To make the presentation clear, we first give the algorithm for performing the test within two parties, then extend it to the multiparty case.
Suppose there are two parties, A and B. Party A has sample S1 which contains n1 values, and party B has sample S2
The complete algorithm of privacy-preserving Kruskal–Wallis test
We present the privacy-preserving Kruskal–Wallis test with considering ties in this section.
Experiments
The experimental results are presented in this section. All the algorithms are implemented with the Crypto++ library in the C++ language and the communications between parties are implemented with socket API. The experiments are conducted on a Red Hat server with 16 × 2.27 GHz CPUs and 24 G of memory.
We use the two datasets from [34] to test the accuracy of our algorithms. The first dataset, as shown in Table 2, contains 3 samples with equal sizes. The “sample” in the context of this paper is
Conclusion
In this work, we proposed several algorithms that enable parties to conduct the Kruskal–Wallis test securely without revealing their data to others. We showed the procedure of the algorithms for data both with and without ties. We also presented an algorithm to do the multiplication of two encrypted integers under the additive homomorphic cryptosystem. Our algorithms can be extended to make other non-parametric rank based statistical tests secure, such as the Friedman test. This is our future
Conflict of interest
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
References (34)
Nonparametric versus parametric tests of location in biomedical research
American Journal of Ophthalmology
(2009)- et al.
Privacy-preserving models for comparing survival curves using the logrank test
Computer Methods and Programs in Biomedicine
(2011) Privacy-preserving algorithms for distributed mining of frequent itemsets
Information Sciences
(2007)- et al.
A sas((r)) macro implementation of a multiple comparison post hoc test for a Kruskal–Wallis analysis
Computer Methods and Programs in Biomedicine
(2011) - et al.
Numerical recipes in C: the art of scientific computing
Transform
(1992) Non-normality and tests on variances
Biometrika
(1953)- et al.
Use of ranks in one-criterion variance analysis
Journal of the American Statistical Association
(1952) Individual comparisons by ranking methods
Biometrics Bulletin
(1945)- et al.
On a test of whether one of two random variables is stochastically larger than the other
Annals of Mathematical Statistics
(1947) - et al.
Privacy-Preserving Data Mining
(2000)
Deriving private information from randomized data
Privacy preserving data classification with rotation perturbation
A bootstrap procedure to preserve statistical confidentiality in contingency tables
Privacy preserving data mining
Journal of Cryptology
Building decision tree classifier on private data
Reproduction
Tools for privacy preserving distributed data mining
ACM SIGKDD Explorations Newsletter
Unconditionally Secure Constant-Rounds Multi-party Computation for Equality, Comparison, Bits and Exponentiation, vol. 3876
Cited by (62)
Understanding the traffic flow in different types of freeway tunnels based on car-following behaviors analysis
2024, Tunnelling and Underground Space TechnologyComprehensive assessment of resilience of flood hazard villages using a modeling and field survey approach
2023, International Journal of Disaster Risk ReductionExperts' opinions about lasting innovative technologies in City Logistics
2022, Research in Transportation Business and ManagementCitation Excerpt :Both parameters need to fall within the required thresholds in order to meet true consensus. The obtained data are also analyzed via the Kruskal-Wallis test, with the goal of finding out whether the samples at issue belong to the same population (Guo, Zhong, & Zhang, 2013). This non-parametric test is able to run when data do not follow a normal distribution and it works also in the case of small sample groups (even smaller than 25) (Kitchen, 2009).
How do Cr and Zn modify cucumber plant re-establishment after grafting?
2022, Scientia HorticulturaeCitation Excerpt :The a/b-chlorophyll data were not normally distributed, and standard data transformations did not solve this problem. For these pigments, the treatments were compared using the Kruskal-Wallis test (Guo et al., 2013); and no difference was found among experimental groups (χ2 = 4.03, P = 0.54 for (a)-chlorophyll, and χ2 = 1.30, P = 0.93 for (b)-chlorophyll; D.F.=5). Chlorophyll a+b data showed the same pattern, and not differing between treatments (χ2 = 1.83, P = 0.87, D.F. = 5).
The impact of space design on occupants' satisfaction with indoor environment in university dormitories
2022, Building and EnvironmentCitation Excerpt :Finally, the Kruskal-Wallis test was used to compare occupants' satisfaction with space design among the five types of spaces. The test performed better when the data were not normally distributed [54]. Fig. 3 shows occupants' satisfaction with each IEQ factor and overall IEQ.