Confidentiality risks in fine scale aggregations of health data

https://doi.org/10.1016/j.compenvurbsys.2010.08.002Get rights and content

Abstract

Spatial confidentiality concerns limit the sharing of data between health data guardians and other researchers. This reduces the contribution GIScience might play in understanding spatial patterns of poor health. This paper takes a first step towards easing data sharing by investigating the confidentiality risks in releasing aggregated data at a fine spatial resolution. A randomly generated cancer map is exported as a graduated color overlay to Google Earth and test subjects are asked to locate where they believe the disease cases reside. Risk is measured by both the separating distance and the number of alternate parcels between the “choice” and a randomly generated disease case. The paper also develops a simulation approach that can be used to test the level of risk involved with these choices. Results suggest that across the scales of aggregation tested in this paper, the finest of which is a 0.5 km grid, there was relatively little risk in revealing sensitive information. In addition, the closest student choice to a disease case was not consistent across aggregations, suggesting no underlying geographic vulnerability. Although the results presented here are encouraging, a series of subsequent investigations are needed before data sharing guidelines can be proposed.

Research highlights

► There is relatively little confidentiality risk in releasing spatial information in mapped aggregations as fine as 0.5 km. ► Potential spatial vulnerabilities were not consistent across aggregations. ► Spatial confidentiality “risk” can be assessed using comparisons to a simulated distribution.

Introduction

The use of geographic information systems (GIS) and other geospatial technologies in health research is now well established. Applications include but are not limited to analyzing spatial patterns and processes of disease, identifying service provision deficiencies, supporting surveillance, outbreak response and even tracking patients and identifying activity spaces for healthy living (see Cromley and McLafferty, 2002, Gatrell and Loytonen, 1998). For many of these tasks, individual level data are generated, stored, analyzed, and potentially warehoused for subsequent analysis. These data create confidentiality challenges as tabular and visual displays (Gutmann et al., 2008), especially as maps have the potential to be reengineered back to reveal sensitive information about a person (Curtis, Mills, & Leitner, 2006). Although there has been a steady stream of research into confidentiality violation in the cartographic presentation of data, there is general agreement that more work is needed to help guide researchers and Institutional Review Boards (IRB) (Kamel Boulos, Curtis, & AbdelMalik, 2009). For example, with regards to cancer data, researchers must comply with arbitrary “rules” for presenting or sharing data to protect the geoconfidentiality of disease (e.g. do not show maps of individual points, or do not present or share any aggregation with fewer than five observations). These guidelines are not based on replicable science, and therefore it is debatable whether these rules protect confidentiality and/or mask important geospatial associations. As a result academics often feel shackled by overzealous privacy rules which have arguably limited public health research advances (Wartenberg & Thompson, 2010). The question of vulnerability in mapped output was also raised with the release of the New York State Department of Health interactive cancer map which was mandated by the state legislature. This cartographic resource shows cancer counts at a fine spatial resolution, often down to a census block (Hakim, 2010). Both these examples show that there is still a need to bridge the work on confidentiality between academia and data providers, with clearly defined rules to guide visualization, data sharing (and therefore research collaboration) and policy changes. More specifically:

  • It is important that “real world” guidelines are provided to both visualize and share confidential data that does not dramatically limit analyses.

  • The confidentiality risk in releasing or displaying data according to those guidelines needs to be quantified.

  • Potential errors in spatial analysis outcomes using masked or manipulated data as a result of the guidelines need to be clearly stated.

This paper addresses the first two bullets by combining three interrelated areas of investigation: the ease with which the public can reengineer information from a map, the additional reengineering risk posed by ubiquitous access to online geospatial engines with high resolution aerial photography, and how to measure risk from aggregated maps at a finer resolution than a census block group. The rationale for this work is both promoting the way data can be released for collaboration purposes, and possibly even informing future policy changes such as those that have led to the New York State cancer map.

More specifically, this paper will assess the confidentiality risk involved in displaying an aggregated (simulated) cancer map in Google Earth. The success of test subjects who are tasked to locate where these cancer patients reside will be measured by both the separating distance and the number of alternative locations between the choice and actual address. This paper will also present a method of assessing what is an acceptable level of risk with these measurements. Results will show that there is generally little risk involved in releasing data at the aggregations tested here, and that a lack of consistency in test subject choices across aggregations does not point to any common geographic vulnerability. Finally, the discussion will outline the next steps needed in order to construct universal guidelines for data sharing.

There have been numerous attempts to summarize the academic body of work on spatial confidentiality including potential strategies to reduce reengineering risk. These range from national organization guidelines, such as a North American Association of Central Cancer Registries (NAACCR) handbook on (spatial) data, a National Academies Press summary by the panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and Self-Identifying Data (Guttman and Stern, 2007), workshops organized at GIS conferences and several journal articles (Kamel Boulos et al., 2009). However still more is needed in terms of synthesizing these insights into practical guidelines that can be implemented at all scales of health data collection, not just for Federal agencies but also for city health offices and nonprofit organizations, both of which may generate sensitive data, and both of whom may benefit from academic research into their data.

Spatial confidentiality risk varies according to the sensitivity of the data (including the “uniqueness” of the subject), the type of research (and the complexity of the spatial analysis needed), the availability of other quasi-indicators (attributes associated with the individual or disease type that can help narrow identification within society), the availability of previously generated maps, who is the end user, and the nature of the underlying geography, especially population density (for a seminal reference see Armstrong, Rushton, & Zimmerman, 1999). An additional concern is the temporal synergy of the data, as combining multiple versions of the same information, even if perturbed or aggregated, may be combined to improve the chance of reengineering (Zimmerman & Pavlik, 2008). This situation is particularly challenging for data archivists involved with geoportals and clearinghouses where data or map products are stored for future access. It is probable that more effective future reengineering methods will be developed (Gutmann et al., 2008). For example in the late 1990s there was no ubiquitous online access to easily manipulated aerial photography that also allows for the uploading and overlay of maps and spatial references. Certainly Google Earth has lead to previously unforeseen confidentiality risks in previously published maps.

Given the propensity for address level data in health records, and that point level information is often preferred for analysis, it is not surprising that much of the recent confidentiality work has involved the dangers of point level display, and the solution offered by different masks (Armstrong et al., 1999, Kwan et al., 2004, Leitner and Curtis, 2004, Leitner and Curtis, 2006). The risks involved were highlighted by Curtis et al. (2006) who showed that a protagonist with relatively little skill could scan a printed map, import it into a GIS and extract coordinate locations that could be used to guide field teams to actual addresses. Similarly, Brownstein, Cassa, Kohane and Mandl (2006) managed to reverse engineer over 50% of simulated patient addresses to within five buildings from their publication quality map. Unfortunately such fine scale data are needed to investigate patterns and processes at the sub-neighborhood level, especially with regard to linking disease to characteristics of the built environment. It is likely that advances into these relationships are likely to result from multi-disciplinary research teams. Therefore data need to be shared at the safest and finest geographic scale possible.

In general, there are three ways to reduce confidentially risk in a fine scale spatial data set; perturb a point layer (such as coordinates, or geocoded addresses) to reduce the possibility of a displayed point being reattached to its original location (Cassa et al., 2006, Leitner and Curtis, 2004, Olson et al., 2006). Secondly, aggregate the points as finely as possible so that individual locations cannot be identified, and thirdly, use simulated data that mirror the real spatial distribution. A further subcategory to the first two masks is the addition, or subtraction, of records to change the true landscape. For example, false points can be added (with associated attribute information) and displayed as perturbed locations, or included in the aggregation process. This addition (or subtraction) adds a degree of uncertainty not only as to how a data point has been spatially changed, but also as to whether or not any particular location is real.

This paper will focus on a fine scale aggregation of data, as this is arguable the easiest to perform for local data providers who are more likely to be technically and resource limited. Unfortunately there are few guidelines with regards the release of aggregated data. A commonly discussed threshold between researchers is that health data should only be visualized for ZIP codes with a base population of no less than 20,000. (see Leitner and Curtis, 2006, Office for Civil Rights Privacy Brief, Summary of the HIPAA Privacy Rule, 2005 for an explanation). If this same guideline is used as the benchmark for data sharing between entities, then the size of the ZIP code would be too large to adequately capture fine scale processes related to the physical or cultural neighborhood. For example, if a researcher was hoping to link pre-Hurricane Katrina health outcomes in the Lower Nine Ward to the stresses of a prolonged recovery process, the relevant ZIP code would cover several culturally different neighborhoods, and even a wide array of initial damage and recovery outcomes (Curtis, 2008). An additional criticism of this Zip code denominator threshold is that the underlying geography changes the degree of risk involved, and therefore appropriate denominators should also vary across space. El Emam, Brown, AbdelMalik and Evaluating (2009) eloquently prove this point for urban environments in Canada.

Given such a lack of guidelines the following questions should be asked; how much perturbation, or what aggregation scale and zone are necessary to reduce reengineering risk to an acceptable level? A further question from a practical perspective is – how do we define this acceptable level? In order to answer this we need to produce benchmarks of spatial confidentiality risk. Is it enough to prevent an “intruder” from reengineering to “identity disclosure” which means linking information back to an actual case (Gutmann et al., 2008, Zimmerman and Pavlik, 2008), or is the risk still too great if the reengineered location is spatially proximate? Again, how do we define spatially proximate? How many houses need to separate an actual case from a reengineered location? How many alternatives are needed to satisfy and reassure researchers, data custodians, subjects of the research or the public in general?

One suggestion made by VanWey and sometimes implemented in mask investigations is a 1 in 20 risk (approximating a 0.05 probability) (VanWey LK, Rindfuss, Gutmann, Entwisle, & Balk, 2005). In other words, there should be a no greater chance of an “intruder” finding the “event”, whether a clinic or patient, than a 1 in 20 selection (Sherman & Fetters, 2007). For example if the investigation is considering poor pregnancy outcomes, the study space should be large enough to cover at least 20 pregnant women at any one time. Obviously the dimensions of this spatial mask will again vary according to the underlying social and physical geography.

Section snippets

Material and methods

This paper adopts the empirical work of Leitner and Curtis, 2004, Leitner and Curtis, 2006 who used test subjects to investigate maps designed to reduce confidentiality concerns in spatial data. For the current paper all test subjects will attempt to locate actual residences from different aggregated maps, with their choices being informed by Google Earth which will allow the test subject to drill through a graduated color output to see the underlying geography, and make mental adjustments in

Results

A total of 104 different students placed coordinates on five different but overlapping aggregations for the same area of Los Angeles. Table 1 displays the breakdown of students being tested for each aggregation level, with the two smallest aggregations, grid cells of 0.75 km and 0.5 km being over-sampled.

One measure of reengineering risk is the total separating distance between the RDC and the student choice. Table 2 displays this vulnerability as measured by the buffer radius from the RDC to the

Discussion

The risk of revealing confidential information through the release of spatial data, either in raw, manipulated, or mapped format, rightly continues to be a major concern for data producers and users. However, the belief of doing no (spatial) harm has to be balanced by the insights that might be revealed through research of these data (Wartenberg & Thompson, 2010). The ubiquitous use of geospatial technologies and an increase in geographic appreciation mean that there has never been a better

Acknowledgements

Data collection was supported by the SEER Rapid Response Surveillance Study program by a supplement to contract N01-PC-95002 (USC) with the National Cancer Institute. Dr. Cockburn was additionally supported by Grant Number U55/CCU921930-02 from the Centers for Disease Control and Prevention, NIEHS Grant 5P30 ES07048 and by R01 CA121052. The collection of data used in this publication was supported by the California Department of Health Services as part of the statewide cancer reporting program

References (22)

  • C.A. Cassa et al.

    A context-sensitive approach to anonymizing spatial surveillance data: Impact on outbreak detection

    Journal of the American Medical Informatics Association

    (2006)
  • K. El Emam et al.

    Evaluating (2009) predictors of geographic area population size cut-offs to manage re-identification risk

    Journal of the American Medical Informatics Association

    (2009)
  • M.P. Armstrong et al.

    Geographically masking health data to preserve confidentiality

    Statistics in Medicine

    (1999)
  • J.S. Brownstein et al.

    An unsupervised classification method for inferring original case locations from low-resolution disease maps

    International Journal of Health Geographics

    (2006)
  • E.K. Cromley et al.

    GIS and public health

    (2002)
  • A. Curtis

    From healthy start to Hurricane Katrina: Using GIS to eliminate disparities in perinatal health

    Statistics in Medicine

    (2008)
  • A. Curtis et al.

    Spatial confidentiality and GIS: Re-engineering mortality locations from published maps about Hurricane Katrina

    International Journal of Health Geographics

    (2006)
  • Gatrell, A. C., Loytonen, M. (1998). GIS and health (pp. 3–16). London: Taylor and...
  • M. Gutmann et al.

    Issues and current practices relating to confidentiality

    Population Research and Policy Review

    (2008)
  • M.P. Guttman et al.

    Putting people on the map: Protecting confidentiality with linked social–spatial data

    (2007)
  • Hakim, D. (2010). Map tracks incidences of cancer throughout New York State, New York Times (May 11, 2010 A18 New York...
  • Cited by (0)

    1

    Tel.: +1 562 985 4454; fax: +1 562 985 8993.

    View full text