Privacy-preserving Kruskal–Wallis test

doi:10.1016/j.cmpb.2013.05.023

Computer Methods and Programs in Biomedicine

Volume 112, Issue 1, October 2013, Pages 135-145

https://doi.org/10.1016/j.cmpb.2013.05.023 Get rights and content

Abstract

Statistical tests are powerful tools for data analysis. Kruskal–Wallis test is a non-parametric statistical test that evaluates whether two or more samples are drawn from the same distribution. It is commonly used in various areas. But sometimes, the use of the method is impeded by privacy issues raised in fields such as biomedical research and clinical data analysis because of the confidential information contained in the data. In this work, we give a privacy-preserving solution for the Kruskal–Wallis test which enables two or more parties to coordinately perform the test on the union of their data without compromising their data privacy. To the best of our knowledge, this is the first work that solves the privacy issues in the use of the Kruskal–Wallis test on distributed data.

Introduction

Statistical hypothesis tests are very widely used for data analysis. Some popular statistical tests include t-test [1], ANOVA [2], Kruskal–Wallis test [3], and Wilcoxon rank sum test [4]. Although these four are different tests, they serve the same goal, which is to find out whether the samples come from the same population. The t-test and ANOVA are parametric tests and assume the normal distribution of data. The non-parametric equivalence of these two tests are the Wilcoxon rank sum test, which is also known as Mann-Whitney U test [5], and Kruskal–Wallis test, respectively. They do not assume the data to be normally distributed. The t-test can only deal with the comparison between two samples, and the ANOVA extends it to multiple samples. Similarly, the Kruskal–Wallis is also a generalization of the Wilcoxon rank sum test from two samples to multiple samples.

As stated above, the four tests are doing similar things under different assumptions. The non-parametric tests perform better when the data is not normally distributed, and are suitable especially in the cases when the data size is small (<25 per sample group) [6]. Although the Kruskal–Wallis test is a helpful tool in many areas, sometimes the use of it is impeded by privacy concerns due to the confidential information in the data, especially in the clinical and biomedical research.

For example, some hospitals conducted a study and tested the INR (International Normalized Ratio) values for their patients so that each hospital holds a set of INR values. The hospitals want to perform the Kruskal–Wallis test to check whether their values are following the same trend. In this case, the set of the INR values of each hospital is treated as a sample. To conduct the Kruskal–Wallis test, all samples should be known, which means, the hospitals have to share their data with each other. The problem is that it might be improper for the hospitals to share their samples because the data contains the private information of patients. Currently there is no method that enables the conduction of the Kruskal–Wallis test on such distributed data with privacy concerns.

To solve this problem, we propose a privacy-preserving algorithm that allows the Kruskal–Wallis test to be applied on samples distributed in different parties without revealing each party's private information to others. Due to the similarity in non-parametric tests, our method can also help the design of privacy-preserving solutions for other non-parametric tests. For example, the Wilcoxon rank sum test and the Kruskal–Wallis test are used in the situations of two samples and two or more samples, respectively, and are essentially the same in the two samples case [3]. So our algorithm also solves the privacy issue of the Wilcoxon rank sum test to some extent.

The rest of this paper is organized as follows: In Section 2, we present the related work. Section 3 provides the technical preliminaries including the background knowledge about the Kruskal–Wallis test and the cryptographic tools we need. We propose the basic algorithm and the complete algorithm in Sections 4 The basic algorithm of privacy-preserving Kruskal–Wallis test, 5 The complete algorithm of privacy-preserving Kruskal–Wallis test, respectively. The basic algorithm shows the procedure of conducting the Kruskal–Wallis test securely when there is no tie in the data. The complete algorithm follows the basic algorithm and takes the existence of ties into consideration. In Section 6, we present the experimental results and finally, Section 7 concludes the paper.

Section snippets

Related work

In recent years, due to the increasing awareness of privacy problems, a lot of data analyzing methods have been enhanced to be privacy-preserving, including many popular data mining and machine learning algorithms. Most of these approaches can be divided into two categories. Approaches in the first category protect data privacy with data perturbation techniques, such as randomization [7], [8],rotation [9] and resampling [10]. Since the original data is changed, these approaches usually lose

The Kruskal–Wallis test

We first review the Kruskal–Wallis test in this section. The test as proposed by Kruskal and Wallis [3] evaluates whether two or more samples are from the same distribution. The null hypothesis is that all the samples come from the same distribution.

Suppose we have k samples, each contains a set of values. To perform the Kruskal–Wallis test, we need to first rank all the values together without considering which sample the values belong to, then compute the sum of all the ranks of values within

The basic algorithm of privacy-preserving Kruskal–Wallis test

In this part, we present the basic algorithm for computing the H statistic of the Kruskal–Wallis test securely without considering the existence of ties. The complete algorithm that also deals with ties will be discussed in the next section. To make the presentation clear, we first give the algorithm for performing the test within two parties, then extend it to the multiparty case.

Suppose there are two parties, A and B. Party A has sample S₁ which contains n₁ values, and party B has sample S₂

The complete algorithm of privacy-preserving Kruskal–Wallis test

We present the privacy-preserving Kruskal–Wallis test with considering ties in this section.

Experiments

The experimental results are presented in this section. All the algorithms are implemented with the Crypto++ library in the C++ language and the communications between parties are implemented with socket API. The experiments are conducted on a Red Hat server with 16 × 2.27 GHz CPUs and 24 G of memory.

We use the two datasets from [34] to test the accuracy of our algorithms. The first dataset, as shown in Table 2, contains 3 samples with equal sizes. The “sample” in the context of this paper is

Conclusion

In this work, we proposed several algorithms that enable parties to conduct the Kruskal–Wallis test securely without revealing their data to others. We showed the procedure of the algorithms for data both with and without ties. We also presented an algorithm to do the multiplication of two encrypted integers under the additive homomorphic cryptosystem. Our algorithms can be extended to make other non-parametric rank based statistical tests secure, such as the Friedman test. This is our future

Conflict of interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

References (34)

C.M.R. Kitchen
Nonparametric versus parametric tests of location in biomedical research
American Journal of Ophthalmology
(2009)
T. Chen et al.
Privacy-preserving models for comparing survival curves using the logrank test
Computer Methods and Programs in Biomedicine
(2011)
S. Zhong
Privacy-preserving algorithms for distributed mining of frequent itemsets
Information Sciences
(2007)
A.C. Elliott et al.
A sas((r)) macro implementation of a multiple comparison post hoc test for a Kruskal–Wallis analysis
Computer Methods and Programs in Biomedicine
(2011)
W.H. Press et al.
Numerical recipes in C: the art of scientific computing
Transform
(1992)
G.E.P. Box
Non-normality and tests on variances
Biometrika
(1953)
W.H. Kruskal et al.
Use of ranks in one-criterion variance analysis
Journal of the American Statistical Association
(1952)
F. Wilcoxon
Individual comparisons by ranking methods
Biometrics Bulletin
(1945)
H.B. Mann et al.
On a test of whether one of two random variables is stochastically larger than the other
Annals of Mathematical Statistics
(1947)
R. Agrawal et al.
Privacy-Preserving Data Mining
(2000)

Z. Huang et al.

Deriving private information from randomized data

K. Chen et al.

Privacy preserving data classification with rotation perturbation

G.R. Heer

A bootstrap procedure to preserve statistical confidentiality in contingency tables

Y. Lindell et al.

Privacy preserving data mining

Journal of Cryptology

(2002)

W. Du et al.

Building decision tree classifier on private data

Reproduction

(2002)

C. Clifton et al.

Tools for privacy preserving distributed data mining

ACM SIGKDD Explorations Newsletter

(2002)

I. Damgard et al.

Unconditionally Secure Constant-Rounds Multi-party Computation for Equality, Comparison, Bits and Exponentiation, vol. 3876

(2006)

Cited by (62)

Understanding the traffic flow in different types of freeway tunnels based on car-following behaviors analysis
2024, Tunnelling and Underground Space Technology
Tunnels are an engineering solution that has gained prominence for constructing freeways in mountainous regions. The length of the tunnels can vary, depending on the geological conditions, engineering requirements, and budget constraints. Car-following is the predominant driving behavior observed in tunnels, and understanding how drivers follow each other in different types of tunnels is crucial for ensuring smooth traffic flow and safety. Each type of tunnel environment can uniquely impact car-following behavior, which allows for targeted studies to optimize traffic management. In this research, natural driving data in freeway tunnels were collected through a driving experiment conducted on the Baomao Freeway in Chongqing, China. Then, the correlations and differences in car-following data between various tunnels and sections were analyzed. Finally, car-following models were developed considering various tunnel scenarios, and the influence of tunnel types on traffic flow was analyzed by simulation. The study revealed notable variations in car-following behavior across different types of tunnels, as well as within consecutive sections of the same tunnel. As tunnel length increased, the driving stability of following vehicles decreased, but the level of driving safety risk was not positively correlated with tunnel length. Significant vehicle trajectory oscillation was observed within the inner sections of long and extra-long tunnels, and a significant relationship between the acceleration of following vehicles and the location within the tunnel section was found. Additionally, the longer the tunnel, the greater the fluctuations in traffic flow, and the negative impact of the tunnel environment on traffic flow stability increased periodically downstream. These findings offer valuable insights for understanding and modeling car-following behavior in freeway tunnels, which ultimately facilitate traffic safety and mobility.
Application of environmental DNA technology in marine ranching-case study of Bailong Pearl Bay Demonstration area in Beibu Gulf
2023, Ecological Indicators
Environmental DNA (eDNA) technology has emerged as a widely used method in resource monitoring. The purpose of this study was to assess the effectiveness of the eDNA approach in determining fish composition in the marine ranching of the Bailong Pearl Bay National Marine Ranching Demonstration Zone. Samples were collected from 12 sites in four different zones within Bailong Pearl Bay of the Beibu Gulf, and species diversity was analyzed through 12S rRNA gene amplicon sequencing. The results showed the collection of a total of 75 fish species, with Setipinna taty, Hypoatherina valenciennei, Trachurus japonicus, and Scatophagus argus identified as the dominant species based on the eDNA survey. Alpha and beta diversity index analyses revealed significant differences in composition and abundance of the different fish groups in Bailong Pearly Bay in July 2021. Furthermore, it was found that the artificial reef area and the natural sea area outside Bailong Bay harbored a more diverse fish population, which can be attributed to the positive effects of marine ranching construction. Mantel test correlation analysis demonstrated that phycoerythrin, salinity, and dissolved oxygen were the primary environmental factors influencing the structure of fish assemblages in different zones. The results indicated that eDNA can detect a greater number of species and uncover taxa that are rarely observed visually, thereby providing a comprehensive picture of fish diversity in the sample. This survey offers important technical support for the Bailong Pearl Bay National Marine Ranching Demonstration Zone, and the research findings provide valuable data for the development of marine ranching.
Comprehensive assessment of resilience of flood hazard villages using a modeling and field survey approach
2023, International Journal of Disaster Risk Reduction
Watersheds have been heavily affected by natural and human stresses in recent years, and their ability to recover and adapt to changed conditions depends on the resilience of the watersheds. Flood is one of these tensions, and to reduce the damages caused by it, it is necessary to identify vulnerable areas. This study aims is to evaluate the resilience of flood-prone sub-watersheds in the Beshar basin the Kohgiluyeh-Boyerahmad province of Iran. For this purpose, flood risk areas were determined using three machine learning models (MLMs), including random forest (RF), generalized linear model (GLM), and artificial neural network (ANN). Three models were evaluated based on criteria such as the ROC curve and Kappa coefficient, and the most accurate model was selected to identify areas at risk and complete the resilience questionnaire by residents. Social, economic, policy, and infrastructure criteria and 24 important and influential items were used to measure resilience. Different statistical methods were used to analyze the questionnaires and determine the resilience of different sub-basins. The results showed that the RF model (AUC = 0.96) is more accurate than the other two models. The flood risk map also showed that the very low-risk class had the largest area (2722 km², 86% of the total study area). Also, the resilience results showed a decrease in the mean resilience scores after 2006 compared to before 2006. The results of the spatial changes of the resilience of different sub-watersheds in these two periods showed that in the first period, 4, 9, 13, and 16 sub-watersheds are in the low resilience class and, 10 and 17 sub-watersheds are in the high resilience class. Also; after 2006, 3, 4, 9, and 21 sub-watersheds were placed in the low resilience class and 10, 13, and 14 sub-watersheds were in the high resilience class, which has had changes compared to the previous period.
Experts' opinions about lasting innovative technologies in City Logistics
2022, Research in Transportation Business and Management
Citation Excerpt :
Both parameters need to fall within the required thresholds in order to meet true consensus. The obtained data are also analyzed via the Kruskal-Wallis test, with the goal of finding out whether the samples at issue belong to the same population (Guo, Zhong, & Zhang, 2013). This non-parametric test is able to run when data do not follow a normal distribution and it works also in the case of small sample groups (even smaller than 25) (Kitchen, 2009).
The COVID-19 pandemic has highlighted the relevance of goods delivery in urban areas. However, this activity often generates negative environmental impact and several technologies have been proposed in recent years to reduce it, thus forming a complex innovation landscape characterized by different levels of maturity and effects on the City Logistics (CL) system. This complexity causes a deep uncertainty over the future of CL. This paper aims to tackle this uncertainty by forecasting the future of a set of CL technologies. A Delphi survey has been submitted to experts of this field to achieve a stable consensus over 33 projections related to 7 CL technologies for the year 2030. Results show that real-time data collection will help the coordination process between stakeholders, engendering an increased awareness over the value of using logistics data as well as its potential drawbacks. Moreover, experts share a positive attitude towards the expansion of Parcel Lockers, which should be monitored by public authorities to avoid a negative impact on land use. Finally, technologies such as drones and crowd-logistics have drawn the lowest level of consensus due to their lower level of maturity, which arouse the necessity to further explore several issues such as legal and technical barriers.
How do Cr and Zn modify cucumber plant re-establishment after grafting?
2022, Scientia Horticulturae
Citation Excerpt :
The a/b-chlorophyll data were not normally distributed, and standard data transformations did not solve this problem. For these pigments, the treatments were compared using the Kruskal-Wallis test (Guo et al., 2013); and no difference was found among experimental groups (χ2 = 4.03, P = 0.54 for (a)-chlorophyll, and χ2 = 1.30, P = 0.93 for (b)-chlorophyll; D.F.=5). Chlorophyll a+b data showed the same pattern, and not differing between treatments (χ2 = 1.83, P = 0.87, D.F. = 5).
Plant grafting is a propagation technique that uses two plant individuals to optimize crop production. Although physiological traits defining the growth of grafted plants have been investigated, physiological changes caused by excessive metal nutrients or non-essential metals during the early stages of grafting have been poorly characterized. Understanding such changes would contribute to the selection of rootstocks and scions more tolerant to environmental contamination with heavy metals. Our study evaluated the responses of cucumber plants (Cucumis sativus L.) grafted onto pumpkin rootstocks (Duchesne x Cucurbita moschata Duch.); and exposed to root applications of fertigation solutions with varying concentrations of zinc (10 µM [low] or 2.44 mM [excessive]), chromium (30 µM Cr [low] or 100 µM Cr [excessive]) or both metals (30 µM Cr and 10 µM Zn). Grafted plants exposed to Cr and excess Zn increased enzyme ascorbate peroxidase activity. Plants exposed to Cr exhibited lower magnesium and manganese concentrations in leaves than both the control and Zn treatments, which were associated to low carbon assimilation. On the other hand, low Zn availability improved plant growth after grafting and superoxide dismutase in stems. We verified that the heavy metals Cr and Zn did not impair cucumber plant re-establishment immediately after grafting onto pumpkin rootstock. Our study demonstrated that grafted cucumber plants can grow in environments contaminated with Cr, but will not express their complete physiological potential.
The impact of space design on occupants' satisfaction with indoor environment in university dormitories
2022, Building and Environment
Citation Excerpt :
Finally, the Kruskal-Wallis test was used to compare occupants' satisfaction with space design among the five types of spaces. The test performed better when the data were not normally distributed [54]. Fig. 3 shows occupants' satisfaction with each IEQ factor and overall IEQ.
Indoor environmental quality is closely-related to the occupants' comfort, performance and health. However, very few studies on indoor environmental quality were done in university dormitories, especially with regard to space design. This study aims to better understand the influence of space design on occupants' satisfaction with indoor environmental quality. It conducted a questionnaire survey among occupants of five types of spaces in seven university dormitories. In total, 921 valid responses were collected. Using multiple linear regression, it was found that space design was the most influential factor on the occupants' overall satisfaction with indoor environmental quality. A correlation analysis indicated that there was an overestimation of the correlation between space design and other indoor environmental quality factors. The results of the Mann-Whitney test demonstrated that space design had a halo effect on these factors, and its pros and cons were analyzed. Furthermore, we compared five types of spaces to identify the most satisfactory space type. Single and twin rooms with balconies provided more satisfaction. The occupants' satisfaction increased with an increase in the area per capita, whereas the increase was limited when the value reached 13.5 m². This study revealed the importance of space design on occupants' satisfaction which deserves more attention, and offered a new dimension to indoor environmental quality research.

View all citing articles on Scopus

View full text

Privacy-preserving Kruskal–Wallis test

Abstract

Introduction

Section snippets

Related work

The Kruskal–Wallis test

The basic algorithm of privacy-preserving Kruskal–Wallis test

The complete algorithm of privacy-preserving Kruskal–Wallis test

Experiments

Conclusion

Conflict of interest

American Journal of Ophthalmology

Computer Methods and Programs in Biomedicine

Information Sciences

Computer Methods and Programs in Biomedicine

Numerical recipes in C: the art of scientific computing

Transform

Non-normality and tests on variances

Biometrika

Use of ranks in one-criterion variance analysis

Journal of the American Statistical Association

Individual comparisons by ranking methods

Biometrics Bulletin

On a test of whether one of two random variables is stochastically larger than the other

Annals of Mathematical Statistics

Privacy-Preserving Data Mining

Deriving private information from randomized data

Privacy preserving data classification with rotation perturbation

A bootstrap procedure to preserve statistical confidentiality in contingency tables

Privacy preserving data mining

Journal of Cryptology

Building decision tree classifier on private data

Reproduction

Tools for privacy preserving distributed data mining

ACM SIGKDD Explorations Newsletter

Unconditionally Secure Constant-Rounds Multi-party Computation for Equality, Comparison, Bits and Exponentiation, vol. 3876