Preserving confidentiality when sharing medical database with the Cellsecu system

doi:10.1016/S1386-5056(03)00030-3

International Journal of Medical Informatics

Volume 71, Issue 1, August 2003, Pages 17-23

https://doi.org/10.1016/S1386-5056(03)00030-3 Get rights and content

Abstract

We propose a computer system called Cellsecu that maintains the anonymity and the confidentiality of each cell containing sensitive information in medical database. Cellsecu attains this by automatically removing, generalizing, and expanding information. It is designed to enhance data privacy protection so a data warehouse can automatically handle queries. In most cases, health organizations collect medical data with explicit identifiers, such as name, address and phone number. Simply removing all explicit identifiers prior to release of the data is not enough to preserve the data confidentiality. Remaining data can be used to re-identify individuals by linking or matching the data to other database, or by looking at unique characteristics found in the database. A formal model based on Modal logic is the theoretical foundation of Cellsecu. As well, a new confidentiality criteria called “non-uniqueness” is defined and implemented. We believe modeling this problem formally can clarify the issue as well as clearly identify the boundary of current technology. Base on our preliminary performance evaluation, the confidentiality check module and the confidentiality enhancing module only slightly degrade system performance.

Introduction

In Taiwan, almost all citizens are covered by the national health insurance plan. The National Health Insurance Bureau (NHIB) collects and maintains a huge database containing high-quality health related data. It is a gold mine for many researchers working in the health care related areas, as sharing data can significantly benefit the public. The existence of such database would enable researchers to track certain diseases and patient's responses to certain drugs. However, with the rapid advances in the medical data computerization, the question of protecting privacy has begun to arise. Storing a large amount of sensitive information in a central databases “could open the door to privacy invasion”. In order to better utilize medical data, NHIB has authorized the National Health Research Institute (NHRI) to handle the release of the database. Currently, NHRI accepts applications from researchers requesting data contained in the databases. A review committee grants access based on the purpose and relevance of the research. The process takes some time and may fail to distinguish requests for highly sensitive data from those for general statistical data.

How to publish a database while preserving confidentiality is far-reaching problem. The United States Social Security Administration employs bin size as the measurement for anonymity [2]. Two recent systems, Datafly [19] and μ-argus [16] also use bin size as an anonymity measurement. By anonymity we mean that a record cannot be connected to any individual. Data confidentiality assures that a field value cannot be connected to an individual. Data confidentiality is a fine-grain method to insure privacy protection and secure anonymity. It corresponds fluidly to Modal logic [6], [8], a mathematical framework that defines the idea of “knowing”. A formal framework for the data confidentiality is developed in [14]. Cellsecu system development is based on that framework. Once confidentiality criteria are defined, the question is how to assure the confidentiality if releasing certain data violates confidentiality criteria. We follow the idea of using generalization to enhance the confidentiality proposed in Datafly system. A lattice framework is developed to facilitate a search for the least generalized yet confidential data set. Cellsecu is a web-based prototype system developed based on the above-mentioned framework. We envision Cellsecu as a gate-keeper (Fig. 1) such that users can freely query a data center, while confidentiality is preserved. The rest of the paper is organized as follows. In Section 2 we give a brief review of previous work in database confidentiality. In Section 3 we present the system architecture of Cellsecu. Section 4 contains performance evaluation. Section 5 discusses practical issues. We conclude with future research directions in Section 6.

Section snippets

Related work

Starting with a study by Hoffman and Miller [13] statistical database inference has been a subject for intensive research for three decades. A statistical database is a system that enables its users to retrieve only aggregate statistics (e.g. sample mean and count) for a subset of the entities represented in the database. Many data collecting agent face the dilemma that, on one hand, database systems are expected to satisfy user requests of aggregate statistics related to non-confidential and

System architecture and methods

In this section we focus on relational databases. For each data table, fields are partitioned into the following three sets—identifying (ID), easily-known (EK) and unknown (U) fields. ID fields, which contain data such as social security numbers, are those that can be used to identify an individual, so they cannot be released for any queries. EK fields, which contain data such as height and eye color of an individual. This type of data can be easily obtained from sources other than the database

Preliminary system performance evaluation

Cellsecu runs on Celeron 400 with 256 MB SDRAM and Windows NT server 4.0. We use apache 1.3.9(win32) and apache jserv 1.1 as the web server system. Ten rounds of tests are conducted with various file and database sizes.

For each round of test, we measure the following performance indicators:

•
Query time: the execution time for querying the databases.
•
Test time: the execution time for checking the confidentiality criteria.
•
Generalize time: the execution time for the generalize module.
•
Write time: the

Discussions

There are several practical issues to consider when using our system. First, assume that there are several sensitive U fields in a record. Different fields may posses different levels of threat. For example, a test result of HIV and a test result of the flu obviously have different levels of sensitivity. Our system can be extended to assign different weights to fields to address this problem. Such an extension can be found in [5].

Second, our system uses generalization as a way to remove

Conclusion

We have developed a prototype system called Cellsecu to protect data confidentiality while sharing health databases. Our system can be used to automatically assess whether a released data set is safe after removing explicit identifiers as required by legislation. Our system is based on a formal model, so we can mathematically define the meaning of data confidentiality. Whether our formal definition captures the intuitive idea of data confidentiality is an obvious question. Roughly, the formal

References (21)

N.R. Adam et al.
Security-control methods for statistical databases: a comparative study
ACM Computing Surveys
(1989)
L. Alexander, T. Jabine, Access to social security microdata files for research and statistical purposes, Social...
Y.-C. Chiang, Protecting Privacy in Public Database, Master's thesis, Graduate Institute of Information Management,...
Y.-C. Chiang, T.-S. Hsu, S. Kuo, D.-W. Wang, Preserving confidentially when sharing medical data, in: Proceedings Asia...
Y.T. Chiang et al.
How much privacy?—A system to safe guard personal privacy while releasing database
B.F. Chellas
Modal Logic
(1980)
D.E. Denning et al.
Views for multilevel database security
IEEE Transactions on Software Engineering
(1987)
R. Fagin et al.
Reasoning About Knowledge
(1995)
W.R. Ford et al.
(1990)
T.D. Garvey, T.F. Lunt, X. Quin, M.E. Stickel, Inference Channel Detection and Elimination in Knowledge-Based Systems,...

There are more references available in the full text version of this article.

Cited by (23)

A logical framework for privacy-preserving social network publication
2014, Journal of Applied Logic
Social network analysis is an important methodology in sociological research. Although social network data are valuable resources for data analysis, releasing the data to the public may cause an invasion of privacy. In this paper, we consider privacy preservation in the context of publishing social network data. To address privacy concerns, information about a social network can be released in two ways. Either the global structure of the network can be released in an anonymized way; or non-sensitive information about the actors in the network can be accessed via a query-answering process. However, an attacker could re-identify the actors in the network by combining information obtained in these two ways. The resulting privacy risk depends on the amount of detail in the released network structure and expressiveness of the admissible queries. In particular, different sets of admissible queries correspond to different types of attacks. In this paper, we propose a logical framework that can represent different attack models uniformly. Specifically, in the framework, individuals that satisfy the same subset of admissible queries are considered indiscernible by the attacker. By partitioning a social network into equivalence classes (i.e., information granules) based on the indiscernibility relation, we can generalize the privacy criteria developed for tabulated data to social network data. To exemplify the usability of the framework, we consider two instances of the framework, where the sets of admissible queries are the $ALCI$ and $ALCQI$ concept terms respectively; and we exploit social position analysis techniques to compute their indiscernibility relations. We also show how the framework can be extended to deal with the privacy-preserving publication of weighted social network data. The uniformity of the framework provides us with a common ground to compare existing attack models; while its generality could extend the scope of research to meet privacy concerns in the era of social semantic computing.
Formal anonymity models for efficient privacy-preserving joins
2009, Data and Knowledge Engineering
Organizations, such as federally-funded medical research centers, must share de-identified data on their consumers to publicly accessible repositories to adhere to regulatory requirements. Many repositories are managed by third-parties and it is often unknown if records received from disparate organizations correspond to the same individual. Failure to resolve this issue can lead to biased (e.g., double counting of identical records) and underpowered (e.g., unlinked records of different data types) investigations. In this paper, we present a secure multiparty computation protocol that enables record joins via consumers’ encrypted identifiers. Our solution is more practical than prior secure join models in that data holders need to interact with the third party one time per data submission. Though technically feasible, the speed of the basic protocol scales quadratically with the number of records. Thus, we introduce an extended version of our protocol in which data holders append k-anonymous features of their consumers to their encrypted submissions. These features facilitate a more efficient join computation, while providing a formal guarantee that each record is linkable to no less than k individuals in the union of all organizations’ consumers. Beyond a theoretical treatment of the problem, we provide an extensive experimental investigation with data derived from the US Census to illustrate the significant gains in efficiency such an approach can achieve.
A computational model to protect patient data from location-based re-identification
2007, Artificial Intelligence in Medicine
Health care organizations must preserve a patient's anonymity when disclosing personal data. Traditionally, patient identity has been protected by stripping identifiers from sensitive data such as DNA. However, simple automated methods can re-identify patient data using public information. In this paper, we present a solution to prevent a threat to patient anonymity that arises when multiple health care organizations disclose data. In this setting, a patient's location visit pattern, or “trail”, can re-identify seemingly anonymous DNA to patient identity. This threat exists because health care organizations (1) cannot prevent the disclosure of certain types of patient information and (2) do not know how to systematically avoid trail re-identification. In this paper, we develop and evaluate computational methods that health care organizations can apply to disclose patient-specific DNA records that are impregnable to trail re-identification.
To prevent trail re-identification, we introduce a formal model called k-unlinkability, which enables health care administrators to specify different degrees of patient anonymity. Specifically, k-unlinkability is satisfied when the trail of each DNA record is linkable to no less than k identified records. We present several algorithms that enable health care organizations to coordinate their data disclosure, so that they can determine which DNA records can be shared without violating k-unlinkability. We evaluate the algorithms with the trails of patient populations derived from publicly available hospital discharge databases. Algorithm efficacy is evaluated using metrics based on real world applications, including the number of suppressed records and the number of organizations that disclose records.
Our experiments indicate that it is unnecessary to suppress all patient records that initially violate k-unlinkability. Rather, only portions of the trails need to be suppressed. For example, if each hospital discloses 100% of its data on patients diagnosed with cystic fibrosis, then 48% of the DNA records are 5-unlinkable. A naïve solution would suppress the 52% of the DNA records that violate 5-unlinkability. However, by applying our protection algorithms, the hospitals can disclose 95% of the DNA records, all of which are 5-unlinkable. Similar findings hold for all populations studied.
This research demonstrates that patient anonymity can be formally protected in shared databases. Our findings illustrate that significant quantities of patient-specific data can be disclosed with provable protection from trail re-identification. The configurability of our methods allows health care administrators to quantify the effects of different levels of privacy protection and formulate policy accordingly.
An epistemic framework for privacy protection in database linking
2007, Data and Knowledge Engineering
In this paper, we present an epistemic framework for privacy protection in the database linking context, whereby the user’s knowledge and the individuals’ confidential information are represented by propositional sentences. In the framework, the concept of safety is rigorously defined, and an effective approach for testing the safety of released data is provided. It is shown that some generalization operations can be applied to original data to make it less specific so that the release of generalized data does not violate privacy. Two kinds of generalization operation are considered: attribute-oriented generalization (AOG) and cell-oriented generalization (COG). AOG is more restrictive, but a bottom-up search algorithm can be used to find the maximally informative AOG that satisfies the safety requirement. We investigate the properties of AOG that can be used to improve the search efficiency. COG, on the other hand, is more flexible. However, it necessitates searching through the whole space, so its computational complexity is much higher. Although graph theory can be used to simplify the search procedure, heuristic methods are needed to improve its efficiency. Easy extensibility is one of the main advantages of our framework. It is shown that the framework can be extended to accommodate probabilistic inference attacks and alternative protection techniques.
Value versus damage of information release: A data privacy perspective
2006, International Journal of Approximate Reasoning
We assume that a database of personal information comprises records of individuals that contain confidential or sensitive fields. Queries about the distribution of a sensitive field within a selected population in the database can be submitted to the data center. However, the answers to the queries may leak confidential information about some individuals, even though no identification information is provided. Inspired by decision theory, we present two quantitative models for privacy protection in such a database query or linkage environment. One models the value of information from the viewpoint of the querier, while the other models the damage caused by and compensation for privacy leakage.
In both models, we define the information state by a class of probability distributions on a set of possible confidential values. These states can be modified and refined by the user’s knowledge acquisition behavior. In the first model, the value of information is defined as the expected gain of the querier, and privacy is protected by imposing costs on the answers to the queries to balance any potential gain. In the second model, the safety of information is guaranteed by ensuring that anyone misusing private information must pay more compensation than the value of the possible gain.
Medical privacy protection based on granular computing
2004, Artificial Intelligence in Medicine
Based on granular computing methodology, we propose two criteria to quantitatively measure privacy invasion. The total cost criterion measures the effort needed for a data recipient to find private information. The average benefit criterion measures the benefit a data recipient obtains when he received the released data. These two criteria remedy the inadequacy of the deterministic privacy formulation proposed in Proceedings of Asia Pacific Medical Informatics Conference, 2000; Int J Med Inform 2003;71:17–23. Granular computing methodology provides a unified framework for these quantitative measurements and previous bin size and logical approaches. These two new criteria are implemented in a prototype system Cellsecu 2.0. Preliminary system performance evaluation is conducted and reviewed.

View all citing articles on Scopus

^☆: An abstract of this paper appears in [Proc. Asia Pacific Med. Inform. Conf. (APAMI-MIC), 2000] [4].

¹: This work was done while this author was with Department of Information Management, National Taiwan University, Taiwan, ROC.

View full text

Preserving confidentiality when sharing medical database with the Cellsecu system☆

Abstract

Introduction

Section snippets

Related work

System architecture and methods

Preliminary system performance evaluation

Discussions

Conclusion

Security-control methods for statistical databases: a comparative study

ACM Computing Surveys

How much privacy?—A system to safe guard personal privacy while releasing database

Modal Logic

Views for multilevel database security

IEEE Transactions on Software Engineering

Reasoning About Knowledge