Preserving confidentiality when sharing medical database with the Cellsecu system

https://doi.org/10.1016/S1386-5056(03)00030-3Get rights and content

Abstract

We propose a computer system called Cellsecu that maintains the anonymity and the confidentiality of each cell containing sensitive information in medical database. Cellsecu attains this by automatically removing, generalizing, and expanding information. It is designed to enhance data privacy protection so a data warehouse can automatically handle queries. In most cases, health organizations collect medical data with explicit identifiers, such as name, address and phone number. Simply removing all explicit identifiers prior to release of the data is not enough to preserve the data confidentiality. Remaining data can be used to re-identify individuals by linking or matching the data to other database, or by looking at unique characteristics found in the database. A formal model based on Modal logic is the theoretical foundation of Cellsecu. As well, a new confidentiality criteria called “non-uniqueness” is defined and implemented. We believe modeling this problem formally can clarify the issue as well as clearly identify the boundary of current technology. Base on our preliminary performance evaluation, the confidentiality check module and the confidentiality enhancing module only slightly degrade system performance.

Introduction

In Taiwan, almost all citizens are covered by the national health insurance plan. The National Health Insurance Bureau (NHIB) collects and maintains a huge database containing high-quality health related data. It is a gold mine for many researchers working in the health care related areas, as sharing data can significantly benefit the public. The existence of such database would enable researchers to track certain diseases and patient's responses to certain drugs. However, with the rapid advances in the medical data computerization, the question of protecting privacy has begun to arise. Storing a large amount of sensitive information in a central databases “could open the door to privacy invasion”. In order to better utilize medical data, NHIB has authorized the National Health Research Institute (NHRI) to handle the release of the database. Currently, NHRI accepts applications from researchers requesting data contained in the databases. A review committee grants access based on the purpose and relevance of the research. The process takes some time and may fail to distinguish requests for highly sensitive data from those for general statistical data.

How to publish a database while preserving confidentiality is far-reaching problem. The United States Social Security Administration employs bin size as the measurement for anonymity [2]. Two recent systems, Datafly [19] and μ-argus [16] also use bin size as an anonymity measurement. By anonymity we mean that a record cannot be connected to any individual. Data confidentiality assures that a field value cannot be connected to an individual. Data confidentiality is a fine-grain method to insure privacy protection and secure anonymity. It corresponds fluidly to Modal logic [6], [8], a mathematical framework that defines the idea of “knowing”. A formal framework for the data confidentiality is developed in [14]. Cellsecu system development is based on that framework. Once confidentiality criteria are defined, the question is how to assure the confidentiality if releasing certain data violates confidentiality criteria. We follow the idea of using generalization to enhance the confidentiality proposed in Datafly system. A lattice framework is developed to facilitate a search for the least generalized yet confidential data set. Cellsecu is a web-based prototype system developed based on the above-mentioned framework. We envision Cellsecu as a gate-keeper (Fig. 1) such that users can freely query a data center, while confidentiality is preserved. The rest of the paper is organized as follows. In Section 2 we give a brief review of previous work in database confidentiality. In Section 3 we present the system architecture of Cellsecu. Section 4 contains performance evaluation. Section 5 discusses practical issues. We conclude with future research directions in Section 6.

Section snippets

Related work

Starting with a study by Hoffman and Miller [13] statistical database inference has been a subject for intensive research for three decades. A statistical database is a system that enables its users to retrieve only aggregate statistics (e.g. sample mean and count) for a subset of the entities represented in the database. Many data collecting agent face the dilemma that, on one hand, database systems are expected to satisfy user requests of aggregate statistics related to non-confidential and

System architecture and methods

In this section we focus on relational databases. For each data table, fields are partitioned into the following three sets—identifying (ID), easily-known (EK) and unknown (U) fields. ID fields, which contain data such as social security numbers, are those that can be used to identify an individual, so they cannot be released for any queries. EK fields, which contain data such as height and eye color of an individual. This type of data can be easily obtained from sources other than the database

Preliminary system performance evaluation

Cellsecu runs on Celeron 400 with 256 MB SDRAM and Windows NT server 4.0. We use apache 1.3.9(win32) and apache jserv 1.1 as the web server system. Ten rounds of tests are conducted with various file and database sizes.

For each round of test, we measure the following performance indicators:

  • Query time: the execution time for querying the databases.

  • Test time: the execution time for checking the confidentiality criteria.

  • Generalize time: the execution time for the generalize module.

  • Write time: the

Discussions

There are several practical issues to consider when using our system. First, assume that there are several sensitive U fields in a record. Different fields may posses different levels of threat. For example, a test result of HIV and a test result of the flu obviously have different levels of sensitivity. Our system can be extended to assign different weights to fields to address this problem. Such an extension can be found in [5].

Second, our system uses generalization as a way to remove

Conclusion

We have developed a prototype system called Cellsecu to protect data confidentiality while sharing health databases. Our system can be used to automatically assess whether a released data set is safe after removing explicit identifiers as required by legislation. Our system is based on a formal model, so we can mathematically define the meaning of data confidentiality. Whether our formal definition captures the intuitive idea of data confidentiality is an obvious question. Roughly, the formal

References (21)

  • N.R. Adam et al.

    Security-control methods for statistical databases: a comparative study

    ACM Computing Surveys

    (1989)
  • L. Alexander, T. Jabine, Access to social security microdata files for research and statistical purposes, Social...
  • Y.-C. Chiang, Protecting Privacy in Public Database, Master's thesis, Graduate Institute of Information Management,...
  • Y.-C. Chiang, T.-S. Hsu, S. Kuo, D.-W. Wang, Preserving confidentially when sharing medical data, in: Proceedings Asia...
  • Y.T. Chiang et al.

    How much privacy?—A system to safe guard personal privacy while releasing database

  • B.F. Chellas

    Modal Logic

    (1980)
  • D.E. Denning et al.

    Views for multilevel database security

    IEEE Transactions on Software Engineering

    (1987)
  • R. Fagin et al.

    Reasoning About Knowledge

    (1995)
  • W.R. Ford et al.
    (1990)
  • T.D. Garvey, T.F. Lunt, X. Quin, M.E. Stickel, Inference Channel Detection and Elimination in Knowledge-Based Systems,...
There are more references available in the full text version of this article.

Cited by (23)

  • Value versus damage of information release: A data privacy perspective

    2006, International Journal of Approximate Reasoning
  • Medical privacy protection based on granular computing

    2004, Artificial Intelligence in Medicine
View all citing articles on Scopus

An abstract of this paper appears in [Proc. Asia Pacific Med. Inform. Conf. (APAMI-MIC), 2000] [4].

1

This work was done while this author was with Department of Information Management, National Taiwan University, Taiwan, ROC.

View full text