Correlated tuple data release via differential privacy

doi:10.1016/j.ins.2021.01.058

Information Sciences

Volume 560, June 2021, Pages 347-369

https://doi.org/10.1016/j.ins.2021.01.058 Get rights and content

Abstract

Privacy preserving methods supporting for tuple data release have attracted the attention of researchers in multidisciplinary fields. Among the advanced methods, differential privacy (DP), introducing independent Laplace noise, has become an influential privacy mechanism owing to its provable and rigorous privacy guarantee. Nonetheless, in practice, tuple data to be protected are always correlated while independent noise may cause undesirable information disclosure than expected. Recent researches attempt to optimize the sensitivity function of DP with consideration of the correlation strength between data – but have a drawback in a substantial growth of noise level. Therefore, for correlated tuple data release, how to decrease the noise level incurred by correlation strength is yet to be explored. To remedy this problem, this paper exploits the degradation of DP in expected privacy levels for correlated tuple data and proposes a solution to mitigate it. We first demonstrate a filtering attack, presenting a possibility of using the different dependence between original outputs and perturbations to sanitize a certain level of noise to extract individual’s sensitive information. Secondly, we introduce the notion of correlated tuple differential privacy (CTDP) to preserve expected privacy for correlated tuple data and further propose a generalized Laplace mechanism (GLM) to achieve privacy guarantees in CTDP. Then we design a practical iteration mechanism, including an update function, to conduct GLM when facing large scale queries. Finally, experimental evaluation on real-world datasets over multiple fields show that our solution consistently outperforms state-of-the-art mechanisms in data utility while providing the same privacy guarantee as other approaches for correlated tuple data.

Introduction

In data-driven applications, such as location based services (LBSs), disease surveillance and social networks, etc., information sharing is necessary for data owners to obtain better services. For example, in location based applications, behaviors trend towards uploading one’s precise position to service provider can be used to get better shopping recommendations and route planning services. In disease surveillance case, sharing personal physical condition can prevent the outbreak of some diseases.

As the above examples suggest, information sharing has outstanding benefits for knowledge discovery and acquisition. But the shared data may contain individual’s sensitive information (e.g., personal home address, health condition). Untreated data may disclose individual’s privacy while data owners may not be willing to publish their true data values because of privacy considerations. Therefore, privacy preserving data release (PPDR) has become a substantial issue in data sharing and mining [1], [2].

Early privacy preserving schemes inherently rely on the security guarantee of designed algorithms, where the security is difficult to be theoretically proved and analyzed. To remedy this problem, DP proposed by Dwork [3], [4] has a solid mathematical foundation, and no restrictions on the background knowledge of the adversary. It is a kind of privacy protection method which strictly defines the strength of protection and data utility. Due to the fact that DP can provide a complete theoretical guarantee of privacy security and better data availability, it has become a significant PPDR framework in recent years.

DP mechanism assumes that the records in the dataset are independent with each other and adds independent and identically distributed (IID) Laplace noise to the query results to guarantee its rigorous definition. However, this assumption becomes weak in correlated tuple records. IID Laplace noise can be sanitized from the query results and will not achieve expected privacy degree in correlated tuple records. For example, the relationships of members in the same family may provide helpful background knowledge for violating users’ privacy information [5], as illustrated in Fig. 1 and Example 1.

Example 1

Consider a database including six people with their records of inspection results. An adversary could ascertain their relationships by using a web crawler algorithm. The database publishes users’ data to the third party via differential privacy technology to protect the 6th person’s health information. Given query results, such as “the number of persons uninfected with flu”, an adversary may attempt to infer the health condition of the 6th person. He/she can launch an attack by asking “how many persons in this database are uninfected among the first five and all of the persons respectively?”. In order to protect the 6th person’s health information, a small IID Laplace noise is added to the true queries and the noisy values around the true answer 4 (e.g., 3 or 5) are answered. However, if the attacker knows the relationships of users that user 2, 4, 5 and 6 are the members of the same family, then the attacker learns that they have the same health condition and can infer that the 6th person is uninfected.

The above example illustrates that IID Laplace noise in the standard DP can not achieve expected privacy guarantee in correlated tuple data release. A user’s sensitive information can still be forecast and inferred according to the relationships of users. The goal of this paper is to allow third parties (who may be untrustworthy) to analyze and mine beneficial information from published records, while preserving the privacy of the tuple data collected from individuals.

Current approaches attempt to solve this problem from two natural solutions: one category is based on correlated model and the other category redefines the sensitivity function of DP with consideration of correlation levels. The first approaches establish models to describe the relations of the correlated tuple data, e.g., a Gaussian model that reflects the tuple data correlations, and a Markov [6] or Bayesian [7], [8] model that represents the transition probability distribution of the publishing process. After establishing a model, mechanisms generate noise that conforms to the model and add it to the query result. In the second category, the number of correlated tuple data or correlation coefficient matrix are utilized to describe the correlation levels, which are designated as the weight to redefine the sensitivity function to determine the noise level added into the original outputs.

Although various prioritization solutions towards mitigating differentially private release for correlated tuple data have been proposed, current schemes are still afflicted with the following challenges:

•
Rigorous restriction: Model based methods assume that the dependence of tuple data conforms to a specific model, e.g., a Gaussian or Markov model. However, this is a rigorous assumption in real-world applications. In fact, the tuple data forms in reality are diverse and can rarely be described by a single model. Noise generated by a specific model that does not tally with the realistic model will not only limit its scope of use but also reveal privacy.
•
Low-level utility: The measure to resize sensitivity function of DP considering tuple data correlations need to be carefully conducted since the tiny change of correlation may have a serious effect on the noise. Because the stronger the correlations of tuple data are, the seriously larger the correlated sensitivity function will be than that of independent records. However, larger sensitivity function will directly lead to the increase of noise level and destroy data utility.

These challenges imply that a novel mechanism for differentially private release of correlated tuple data is in high demand. With respect to the first challenge, to lift the rigorous model restriction, we attempt to explore a generalized mechanism applicable for any dependent form of tuple data. To this end, we propose a generalized Laplace mechanism that can handle arbitrary correlated forms of tuple data. This mechanism ensures differential privacy while addressing any data forms. For the second challenge, observed that whether a perturbation preserves privacy depends on (i) how much perturbation is added, and (ii) how the perturbation is added. State-of-the-art schemes mainly improve their algorithms from the first aspect and introduce extra noise to offset the privacy leakage caused by IID Laplace noise. Based on this observation, we try to address this challenge from the second aspect and explore how the perturbation should be added. Our idea is that if the correlation of perturbation is “similar” with that of query results, an adversary can not sanitize the perturbation and the privacy leakage can be alleviated. Therefore, there is no need to generate extra noise to compensate for the sanitized part.

Based on these considerations, we propose an effective differentially private release solution for correlated tuple data, including a novel concept of “correlated tuple differential privacy” (CTDP) and a generalized Laplace mechanism (GLM) to conduct CTDP. To the best of our knowledge, CTDP is the first differential privacy technique for correlated tuple data release that renders the dependence between noise and outputs indistinguishable (similar) to an adversary. Our contributions are fourfold:

•
Filtering Attack: In our previous work, we have demonstrated the possibility of a filtering attack on individual sensitive information by utilizing the different dependence between original outputs and perturbation from the aspect of signal processing. The inference attack shows that some of the IID Laplace noise can be sanitized by a linear filter, leading to a reduction of privacy degree in practice. In this paper, we generalizes this attack method to test whether individual’s sensitive privacy can be inferred using the difference of correlations between original data and published data, as described in Section 4.
•
Correlated Differential Privacy: To defend against inference attack launched by adversaries who have prior knowledge about the correlation of tuple data, we first formalize the notion of CTDP and then, we show the possibility of achieving CTDP guarantees by augmenting a generalized Laplace mechanism (GLM), which makes the correlation of the noise similar with that of original outputs intended to be protected. GLM allows a differential privacy guarantee for dependent tuple data while no additional noise is required, maintaining the same noise scale as in standard DP.
•
Iteration Mechanism: Since the queries in practice are always continuous and large scale, to make the correlation between noise and original outputs keep consistent under continuous queries, we propose a practical and efficient iteration mechanism. We first initialize a Laplace variable, and then regenerate a new variable according to the property of bivariate GLM when answering a new query each time. By this means, we apply GLM to practice when answering continuous large quantities of queries.
•
Experimental Evaluation: Extensive evaluation involving different query functions over multiple large-scale real-world datasets illustrates that our solution consistently outperforms state-of-the-art mechanisms in data utility while providing the same privacy guarantee as other approaches for correlated tuple data.

The remainder of this paper is organized as follows. In Section 2, we briefly introduce the related work associated with our work. Then notations and preliminaries adopted in this work are described in Section 3. Section 4 demonstrates a possible filtering attack to obtain an estimate of individual’s sensitive data. Our solution that consists of a formal notion CTDP and a generalized Laplace mechanism to achieve privacy guarantee of CTDP is described in Section 5. Section 6 proposes an iteration mechanism to address continuous queries, including an update function for repeat queries. Details of theoretical analysis and data utility are demonstrated in Section 7. Experimental evaluation is conducted in Section 8, followed by the conclusions and future works in Section 9.

Section snippets

Related work

Existing differential privacy mechanisms for correlated tuple data publishing can be classified into two types. The first type is model based. A correlation model of tuple data is established and noise conforms to this model is generated to perturb outputs. The other type optimizes sensitivity function by the number of correlated tuple data or correlation coefficient matrix.

Notations and Preliminaries

This section describes the notations and preliminaries underlying our problem. Specially, we first give necessary notations for our work. Secondly, we provide some assumptions and formal definitions, including tuple data correlation, adversary and perturbation mechanism. Finally, preliminaries of DP and its elaboration associated with our work are demonstrated.

Inference Attack

In this section, we will illustrate a possible inference attack for correlated data using a simplified schematic diagram and analyze the principle of attack from the aspect of signal processing.

As the elaboration of DP in section III-F represents, if the tuple data are independent, IID Laplace noise strictly guarantees $∊$ -DP. But if the tuple data are correlated, the distinguishability may not be bounded in [ $e^{- ∊}, e^{∊}$ ] for adversaries who know the correlation of records. We will analyze the privacy

Methodology

In this section, we demonstrate our methodology. We first elaborate the intuition and motivation to describe the principle of our proposed solution. Then we give the formal definitions of CTDP to differentially private release for correlated tuple data and finally, a generalized Laplace mechanism called “GLM” is introduced to hold privacy guarantee in CTDP.

Iteration mechanism for continuous queries

Although Definition 11 gives the PDF of GLM, it is not easy to generate a noise sequence that admits the PDF in practice when facing with continuous queries. Inspired by the work in [26], in this section, we design an efficient iteration mechanism to generate variables hold generalized Laplace distribution with specific dependence. We first generate bivariate GL variables and then, apply an iteration mechanism taking advantage of the excellent property of Gaussian distribution to produce noise

Mechanism analysis

Since our proposed solution CTDP aims to remain a high-level utility under the same privacy budget as set in current schemes, in this section, we first prove the correctness of GLM that satisfies the notion of CTDP, and then theoretically analyze its utility.

Experimental evaluation

In this section, we evaluate the performance of the proposed solution CTDP on multiple real-world datasets. Specifically, we aim to explore the answers of the following questions:

•
How much is the resistance of CTDP to filtering attack described in Section 4?
Since the proposed CTDP mechanism aims to resist the filtering attack, in sub-Section 8.2, we will evaluate its performance before and after filtering attack and compare it with the state-of-the-art representative approaches.
•
How does the

Conclusions and future works

Differential privacy provides a better trade-offs between privacy preserving and data utility. For this reason, an emerging consensus around its application and possible extensions in the academic institution and privacy community is being pursued. However, there remains a limiting assumption in the standard DP that can severely serve for independent tuple data release. In this paper, we analyze the properties of current mechanisms for differentially private publication of correlated tuple data

CRediT authorship contribution statement

Hao Wang: Methodology, Validation, Formal analysis, Writing - original draft, Visualization. Huan Wang: Conceptualization, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (42001398), Chongqing Natural Science Foundation (cstc2020jcyj-msxmX0635), Science and Technology Research Project of Chongqing Education Commission (KJQN201900612), Open Fund of State Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (20S02) and the PhD Starts Fund Project of Chongqing University of Posts and Telecommunications (A2020-46).

References (32)

D. Lv et al.
Achieving correlated differential privacy of big data publication
Computers Security
(2019)
B. Odelson et al.
A new autocovariance least-squares method for estimating noise covariances
Automatica
(2006)
T. Kozubowski et al.
Multivariate generalized Laplace distribution and related random fields
J. Multivariate Anal.
(2013)
A. Evfimievski, J. Gehrke, R. Srikant, Limiting privacy breaches in privacy preserving data mining, in: Proc. 22nd ACM...
C. Dwork
A firm foundation for private data analysis
Commun. ACM
(2011)
C. Dwork, Differential privacy, in: Proc. 33rd Int. Conf. Autom., Lang. Programm. (ICALP), 2006, pp....
C. Dwork
Differential privacy: a survey of results, Proc. 5th Int
Conf. Theory Appl. Mod. Comp. (TAMC)
(2008)
E. Cho, S. Myers, J. Leskovec, Friendship and mobility: user movement in location-based social networks, in: Proc. 17th...
E. Shen, T. Yu, Mining frequent graph patterns with differential privacy, in: Proc. 19th ACM SIGKDD Int. Conf. Knowl....
X. He, A. Machanavajjhala, B. Ding, Blowfish privacy: tuning privacy-utility trade-offs using policies, in: Proc. 35th...

B. Yang, I. Sato, H. Nakagawa, Bayesian differential privacy on correlated data, in: Proc. 36th ACM SIGMOD Int. Conf....

D. Kifer, A. Machanavajjhala, No free lunch in data privacy, in: Proc. 32nd ACM SIGMOD Int. Conf. Manage. Data...

L. Bo, Z. Tianqing, W. Zhou, K. Wang, Protecting privacy-sensitive locations in trajectories with correlated positions,...

G. Liao et al.

Social-aware privacy-preserving mechanism for correlated data

IEEE/ACM Trans. Networking

(2020)

X. Ju, Z. Xiaofeng, W.K. Cheung, Generating synthetic graphs for large sensitive and correlated social networks, in:...

R. Chen et al.

Correlated network data publication via differential privacy

VLDB J.

(2014)

Cited by (20)

Stochastic privacy-preserving methods for nonconvex sparse learning
2023, Information Sciences
Sparse learning is essential in mining high-dimensional data. Iterative hard thresholding (IHT) methods are effective for optimizing nonconvex objectives for sparse learning. However, IHT methods are vulnerable to adversary attacks that infer sensitive data. Although pioneering works attempted to relieve such vulnerability, they confront the issue of high computational cost for large-scale problems. We propose two differentially private stochastic IHT: one based on the stochastic gradient descent method (DP-SGD-HT) and the other based on the stochastically controlled stochastic gradient method (DP-SCSG-HT). The DP-SGD-HT method perturbs stochastic gradients with small Gaussian noise rather than full gradients, which are computationally expensive. As a result, computational complexity is reduced from $O (n \log (n))$ to a lower $O (b \log (n))$ , where $n$ is the sample size and $b$ is the mini-batch size used to compute stochastic gradients. The DP-SCSG-HT method further perturbs the stochastic gradients controlled by large-batch snapshot gradients to reduce stochastic gradient variance. We prove that both algorithms guarantee differential privacy and have linear convergence rates with estimation bias. A utility analysis examines the relationship between convergence rate and the level of perturbation, yielding the best-known utility bound for nonconvex sparse optimization. Extensive experiments show that our algorithms outperform existing methods.
Data release for machine learning via correlated differential privacy
2023, Information Processing and Management
Traditional correlated differential privacy technology usually introduces too much noise, reducing data availability. Besides, machine learning often confronts training sets of high-dimensional data, which brings heavy computing overhead. Aiming at the first issue, we design a more reasonable correlation analysis method. This method combines feature matching algorithms with information entropy-based feature importance to accurately calculate the correlated degree of records, reducing data correlation and correlated sensitivity and improving the data’s utility. It is a novel evaluation method of the correlation of records that can alleviate the limitations of traditional calculating correlation methods. Based on this method, we provide a data release solution to reduce the data dimensionality and improve the training efficiency of machine learning by combining the maximum information coefficient with differential privacy. Furthermore, we introduce an optimization algorithm based on mutual information to choose the best principal components to improve the efficiency of our data release solution. To demonstrate the proposed solution’s effectiveness and performance compared to existing schemes, we conducted experiments on three real-world datasets. The experimental results show that our scheme reduces the data correlation by up to 80% compared to existing schemes. Moreover, the accuracy of machine learning is improved by 10% to 20% for the same privacy budget.
Multidimensional grid-based clustering with local differential privacy
2023, Information Sciences
Citation Excerpt :
Therefore, it is an imperative issue that preserves high-quality clustering without leaking individual privacy in distributed privacy-preserving clustering. Local differential privacy (LDP) [1–4] is a privacy protection model inherited from differential privacy (DP) [5–8] for privacy-preserving data collection without a trusted third party. In recent years, LDP-based clustering works [9–11] have been extensively explored, especially for partition-based clustering (e.g., k-means).
Privacy-preserving clustering has received increasing research attention in recent years. Local differential privacy (LDP) is a privacy model without relying on trusted third parties. It plays a crucial role in distributed privacy-preserving clustering. Most previous methods focus on partition-based clustering (e.g., k-means), which necessitates many iterative interactions. Massive iterations cause a division of the privacy budget and increase the amount of individual noise, affecting the clustering utility. To address this issue, we turn to grid-based clustering and design the GC-LDP algorithm to balance the privacy and clustering utility with only three rounds of interactions. In GC-LDP, our core contribution is to develop a non-uniform grid division method via the coefficient of variation (CV), which can generate a grid structure approximating the global data distribution within two rounds of interactions. Besides, we designed a new perturbation mechanism to reduce the amount of individual noise injection. Further, a cell aggregation method is developed by exploring the relative density difference among grids to achieve multi-density clustering. Theoretical analysis and experiments on real-world datasets show that GC-LDP can obtain high-quality clustering results while satisfying local differential privacy.
DP-STGAT: Traffic statistics publishing with differential privacy and a spatial-temporal graph attention network
2023, Information Sciences
With the continuous development of intelligent transportation, an increasing number of smart devices and sensors are being used to record traffic information. Recently, researchers have relied on machine learning and differential privacy to achieve privacy protection during the continuous release of statistics. However, mechanisms based on differential privacy and prediction are limited by two key issues: the lack of attention given to the spatial-temporal correlations among the input data leads to poor prediction accuracy, and unreasonable privacy budget allocations cause reduced data utility. A traffic statistics publication mechanism with differential privacy and a spatial-temporal graph attention network (DP-STGAT) is proposed to address these problems. The key components include an adjacency matrix based on equivalent distance, a multistep prediction model based on an STGAT, and a combination of pre-allocation and adaptive allocation method for privacy budget allocation. These three components are tightly integrated to improve the accuracy of forecasting and solve the problem regarding poorly allocated privacy budgets. We evaluate the proposed mechanism with two real-world datasets and compare it with four representative methods with a w-event privacy guarantee. The experimental results show that the proposed scheme outperforms the existing methods.
Differentially private stochastic gradient descent via compression and memorization
2023, Journal of Systems Architecture
Citation Excerpt :
Indeed, differential privacy is known as a general method for defending against various types of attacks, such as inference attacks [1,2], or unintended feature leakage [3]. It can also be used in certifying robustness against adversarial image examples [4], decentralized auction [5], data release [6,7], task allocation [8], and law compliance [9]. This paper aims to create a differentially private variant of SGD with modest privacy budgets, while simultaneously maintaining a good learning outcome in terms of training loss and testing accuracy.
We propose a novel approach for achieving differential privacy for neural network training models through compression and memorization of gradients. The compression technique, which makes gradient vectors sparse, reduces the sensitivity so that differential privacy can be achieved with less noise; whereas the memorization technique, which remembers unused gradient parts, keeps track of the descent direction and thereby maintains the accuracy of the proposed algorithm. Our differentially private algorithm, called dp-memSGD for short, converges mathematically at the same rate of $1 / \sqrt{T}$ as standard stochastic gradient descent (SGD) algorithm, where $T$ is the number of training iterations. Experimentally, we demonstrate that dp-memSGD converges with reasonable privacy losses on many benchmark datasets.
Dynamic Logical Resource Reconstruction against Straggler Problem in Edge Federated Learning
2024, Human-centric Computing and Information Sciences

View all citing articles on Scopus

View full text