Keywords

1 Introduction

Sharing human related activity data can offer many important benefits to society. For example, mining human mobility data based on cell phone usage can reveal timely information about traffic conditions. “Smart cities” demonstrations show ways human activity data can improve city services. Often, these models require sharing beyond the person or even the government. A vision is that some of the best possible benefits result from sharing personal activity among organizations. However, the greater the sharing, the greater the risks may be of personal harms. So, privacy concerns may hinder widespread data sharing. Concerns are not unfounded. For example, by correlating location of the individual at a given time of the week, it may be possible to infer someone’s religion. Similarly, privacy attacks ranging from stalking to sensitive information disclosure have been widely reported in practice against human mobility data [5, 10].

Of course, personal data can be shared widely if it cannot be personally attributed to a specific person. The idea is that no one can be harmed if his information cannot be isolated in shared data. To address these kinds of privacy challenges, computer scientists have proposed mathematically rigorous techniques in the framework of differential privacy [7]. The main idea in differential privacy is that disclosed results do not change noticeably with the inclusion or exclusion of any given individual’s data. Recently, differential privacy has been applied in many different settings ranging from answering basic count queries [2] to building support vector machines [8]. In almost all of these cases, the underlying differential privacy tools are designed for specific use cases and utility is defined and tested for that given use case (e.g., measuring utility for differential private count queries by comparing the euclidean distance between original count query results vs. differentially private results). Usually, it is not clear whether a given approach can support a wide range of uses to which an actual human data scientist may put the data. In this work, we try to understand whether we can provide differentially private synthetic data sets that can be shared instead of an original dataset with confidence that the resulting data will retain utility in different usage scenarios.

One challenge in understanding all the potential uses of a given dataset is that it is impossible to model human imagination. In other words, different data scientists may want to use the data in very different ways. To address this challenge, we look into an existing data set disclosed as a part of the Hubway Data Visualization Challenge [1]. The Hubway is a public bicycle sharing system with stations throughout Boston, Cambridge, Somerville and Brookline; and it is designed to provide a convenient form of active public transportation by providing access to bicycles. The Hubway system stores users’ information and generates trip data every day. Hubway data contains users’ bike rides history and some personal information, so if it is released publicly or shared with other stakeholders an adversary can take advantage of it and may potentially figure out private information of its target. In 2012, Hubway and Metropolitan Area Planning Council (MAPC) jointly hosted a challenge named Hubway Data Visualization Challenge asking participants to come up with some projects that involve visualizations, animations, artistic representations or interactive data analysis tools. After this challenge, there were reports that some of the disclosed data could be used to identify individuals using location information disclosed on Twitter [9]. Still, submissions to the competition give us a good understanding of what data scientists may want to do with a given human mobility dataset.

To answer the question mentioned above, we propose a model built under differential privacy here. Our model generates differentially private synthetic human mobility dataset from an original dataset that preserves users’ privacy. Furthermore, our synthetic dataset also shows a very good accuracy in most important statistical comparisons with the original dataset. Finally, we analyze whether the disclosed differentially private synthetic dataset can adequately provide what data scientists need by analyzing the utility of our disclosed data based on the Hubway challenge submissions. Main contributions of this paper are-

  • We present a generic Sanitization Model built under differential privacy for resource sharing based human mobility services to generate differentially private synthetic dataset that preserves users’ trip level privacy while sacrificing as little as possible data utility.

  • To show the applicability of the generated synthetic data, we compute and compare the most compelling statistics from both synthetic and original datasets. We observe that synthetic data upholds a very impressive accuracy.

  • Moreover, a thorough and extensive utility evaluation of synthetic distribution has been done with respect to four different utility metrics.

In Sect. 2, we talk about some preliminaries about differential privacy. Our sanitization model is described in Sect. 3. The experimental evaluation and possible application of it are discussed in Sects. 4 and 5. Section 6 talks about related works and the conclusion in Sect. 7.

2 Preliminaries

Definition 2.1

Differential Privacy [7]: A privacy mechanism \(\mathcal {A}\) gives \(\epsilon \)-differential privacy if for any database D and \(\hat{D}\) differing on at most one record, and for any possible output \(O \in Range(\mathcal {A})\),

$$\begin{aligned} Pr[\mathcal {A}(D)=O] \le e^{\epsilon } \times Pr[\mathcal {A}(\hat{D})=O] \end{aligned}$$
(1)

where the probability is taken over the randomness of \(\mathcal {A}\).

Definition 2.2

Global Sensitivity [7]: For any function \(f:D \rightarrow \mathbb {R}^{d}\), the L1-sensitivity of f is,

$$\begin{aligned} \nabla f = \max \limits _{D,\hat{D}} \parallel f(D) - f(\hat{D}) \parallel _{1} \end{aligned}$$
(2)

for all D, \(\hat{D}\) differing on at most one record.

Theorem 2.3

Laplace Mechanism [7]: For any function \(f:D \rightarrow \mathbb {R}^{d}\), and \(\epsilon > 0\), the following mechanism \(\mathcal {A}\), called the Laplace Mechanism, is \(\epsilon \)-differentially private: \(\mathcal {A}_{f}(D) = f(D) + \big \langle Lap(\nabla f/\epsilon )\big \rangle ^{d}\).

3 Sanitization Model

In this section, we propose a Sanitization Model, shown in Algorithm 1, which is built under differential privacy. The model takes original dataset, D and privacy budget, \(\epsilon \) as input and generates \(\epsilon \)-differentially private synthetic data, \(\hat{D}\) as output. First, it removes invalid entries and outliers from original dataset applying the statistical 3IQR [11] rule to attributes. Second, the attributes form some non-disjoint groups based on their associativity. The associativity among the attributes can be examined using well-known Chi-square Test for Independence/Homogeneity or G-Test. We would like to emphasize that, no statistics are disclosed as a part of this step here. We assume that the grouping of attributes is public information.Footnote 1 Third, the model builds desired synthetic distributions for each group which is described in Sect. 3.1. Finally, the synthetic dataset is generated by taking samples from these distributions and aggregating them together which is discussed in Sect. 3.2.

figure a

3.1 Constructing Synthetic Distributions

The first step of constructing a distribution is to create a contingency table (CT) for it. For any particular group \(\mathcal {G}\), frequency distribution of all possible distinct combinations of values of all attributes belong to the group represents a \(CT \) of that group. Now using the equation mentioned in Theorem 2.0 if we add laplace noise to each frequency, it will become a synthetic CT. In case of negative values, we set them zero since frequency cannot be negative. The frequencies are then normalized to compute respective probability density function.

In our case, hubway dataset has nine attributes- (1) id (trip id), I, (2) start_station_id, S, (3) end_station_id, E, (4) start_time, ST, (5) duration, L, (6) end_time, ET, (7) zip_code, Z, (8) subscription_type, U and (9) gender, G. In step-2 of Algorithm 1, we divide the attributes into five groups: \(\mathcal {G}_{1}\)- \(\{S, E\}\), \(\mathcal {G}_{2}\)- \(\{S, ST\}\), \(\mathcal {G}_{3}\)- \(\{S, E, L\}\), \(\mathcal {G}_{4}\)- \(\{S, Z\}\), and \(\mathcal {G}_{5}\)- \(\{Z, U, G\}\). Note that, we do not include ET in any group because it can be calculated from ST and L. For \(\mathcal {G}_{1}\), we first make a CT and after adding laplace noise we convert the resulting synthetic CT finally to CDF which is represented as Trip CDF, \(\hat{\varPhi }_{\mathcal {T}}\). For \(\mathcal {G}_{2}\), \(\mathcal {G}_{3}\), \(\mathcal {G}_{4}\) and \(\mathcal {G}_{5}\), rather than making a single CDF, we make a number of CDFs for each group instead. More specifically for \(\mathcal {G}_{2}\) and \(\mathcal {G}_{4}\), we build a total of |S| number of CDFs one corresponds to a specific station in S. Likewise, for \(\mathcal {G}_{5}\) we build a total of |Z| number of CDFs one for each zip code. \(\mathcal {G}_{2}\) has an attribute ST which, in essence, is a combination of year, month, date and hour sub-attributes (we ignore min and sec here). Taking these four sub-attributes into account, we build a CDF for each start station in S. The StartTime CDF is denoted by \(\hat{\varPhi }_{\mathcal {ST}}\). For \(\mathcal {G}_{4}\), we build zip distribution for each start station and the CDF for this group is denoted by \(\hat{\varPhi }_{\mathcal {SZ}}\). Similarly, for \(\mathcal {G}_{5}\) we construct Subscription-Gender distribution for each zip and it is represented by \(\hat{\varPhi }_{\mathcal {ZUG}}\). In case of \(\mathcal {G}_{3}\), instead of fitting duration into an existing parametric distribution, we build total \(||S| \times |S||\) empirical distributions for Duration where one corresponds to a particular combination of start and end stations. The reason for building empirical distributions is that they show better results than the fitted parametric Exponential, Normal and Log-normal distributions. And we build \(||S| \times |S||\) duration distributions rather than only |S| distributions like \(\mathcal {G}_{2}\) and \(\mathcal {G}_{4}\) because duration mainly depends on the distance between two stations. The number of bins is set to 7. These empirical distributions are indeed CTs for duration and their corresponding synthetic CDF is denoted by \(\hat{\varPhi }_{\mathcal {L}}\). Note that we add laplace noise to the degree that satisfies the equation stated in Theorem 2.0 to make synthetic distributions \(\epsilon \)-differentially private. In all cases, global sensitivity \(\nabla f\) is 2 since adding or removing or changing an entry can change the function value at most 2.

3.2 Differentially Private Synthetic Data Generation

In this section we will describe how to generate differentially private synthetic data for The Hubway from the distributions constructed in Sect. 3.1. Among nine attributes, id I is unique in the original dataset. Thus, we assign an unique id for each newly generated trip entry. The steps to generate other attributes of a particular trip, i, as follows: First, select a trip \((s_{i}, e_{i})\) by a random sampling from trip CDF, \(\hat{\varPhi }_{\mathcal {T}}\) for i. Second, start time for trip i, \(st_{i}\) is randomly sampled from \(\hat{\varPhi }_{\mathcal {ST}}(s_{i})\). Here, \(\hat{\varPhi }_{\mathcal {ST}}(s_{i})\) returns sample from the start time CDF of station, \(s_{i}\). Since it gives year, month, date and hour only, we add min and sec by taking samples from Uniform(0, 59) distribution. Third, to get duration \(l_{i}\) for trip i, we need to take a random sample from duration distribution of trip \((s_{i}, e_{i})\), \(\hat{\varPhi }_{\mathcal {L}}(s_{i},e_{i})\). In this case, each sample taken from \(\hat{\varPhi }_{\mathcal {L}}(s_{i},e_{i})\) returns a bin with its start value, a and end value, b. To get the exact value of duration for trip, i, we take a random sample from Uniform(ab) distribution. By adding \(l_{i}\) to \(st_{i}\), \(et_{i}\) is calculated accordingly. Fourth, we get \(z_{i}\) from a random sample taking from station-zip CDF of \(s_{i}\), \(\hat{\varPhi }_{\mathcal {SZ}}(s_{i})\). Finally, a sample \((z_{i}, u_{i}, g_{i})\) is taken from \(\hat{\varPhi }_{\mathcal {ZUG}}\) where \(z_{i}\) is restricted to the value computed in step Fourth.

4 Experimental Evaluation

In our experiment, we use hubway trip history data released in [1] in February 2014. It contains total of 1029739 entries. For simplicity and without loss of generality, we work on nine attributes among them. We will show some aspects of applicability and effective use of synthetic data in this section. Our experiments show that 0.9 is the lowest value of \(\epsilon \) where we get maximum utility. We run the experiment 20 times and compute the following statistics in each run. In Fig. 1, we show the average and standard deviation \(\sigma \) of 20 runs.

Fig. 1.
figure 1

Statistical Analysis: (a) Top 5 popular stations (outgoing trips), (b) Top 5 popular stations (incoming trips), (c) Top 5 popular days and (d) Popular Top 5 trip-routes in original and synthetic (\(\epsilon =0.9\)) data [20 runs].

For each station, there are two types of trips: outgoing and incoming. A trip is considered as outgoing to its starting station and as incoming to its destination. Both are statistics are important in practice and so we study both cases here. Figure 1(a) shows the outgoing trips percentage of top 5 popular stations in original data and their corresponding percentage in synthetic data with \(\sigma \). As we observe, the percentages of trips shown in the table are identical in both datasets with very low deviation. Similar statistics considering incoming trips are shown in Fig. 1(b) and it holds similar observation. Besides popular stations, finding popular days is an another essential statistics needed for planning purposes. Figure 1(c) shows the top 5 popular days in original dataset with trips percentage and the corresponding percentage in synthetic data along with their standard deviation. As we see, the percentage of each of the popular days in original and synthetic data is almost same and the corresponding \(\sigma \) is very low as well. Finding popular trip routes is also another statistics that carries important information. In Fig. 1(d), we show the top 5 popular trip routes in original data and the corresponding statistics in synthetic datasets. Popularity is measured based on their percentages in entire dataset. The result shows that all top 5 popular trip routes in original data have same percentage of trips in synthetic data as well with very low \(\sigma \).

We also compute some other statistics but due to space constraint the figures are not shown in the paper. We briefly discuss these statistics here. Comparing the trip percentages in different time periods between two datasets is another important measure for understanding utility. Results show that the noise impact is negligible and in all cases, Morning, Afternoon and Night, synthetic data preserves the original statistics almost precisely. For example, the difference in morning trip percentage is 0.15 with \(\sigma \) 0.103 only. Gender distribution for each station may be useful in some practical applications (e.g., targeting adds for given stations). We pick few stations randomly to see their gender distributions. According to the results, synthetic data shows promising results in this case as well. For example, station 67 has gender distribution: Male- \(62.68\,\%\), Female- \(14.04\,\%\), X- \(23.28\,\%\) in original data and in synthetic data it is: \(57.74\,\%\), \(19.00\,\%\), \(23.26\,\%\) with \(\sigma \) 0.25, 0.22 and 0.20 respectively. The subscription distribution per station seems another relevant statistics that has also a significant impact in resource optimizations (e.g., for Smart City). The subscription distribution of (Registered, Casual) for station 67 in original data is (\(76.72\,\%\), \(23.28\,\%\)) and in synthetic data it is (\(76.74\,\%\), \(23.26\,\%\)) with \(\sigma \) 0.20 which is to some extent identical with original statistics. Result is very much alike for other stations as well. However, the comparison between original and synthetic trip duration shows that synthetic data almost accurately measures overall average duration but failed to measure maximum and minimum durations precisely. The result is even worse if we use parametric distribution for duration. We notice in the empirical distribution that a significant number of cells have very low frequency. Due to this fact, a notable noise impact is reflected in the synthetic results.

Furthermore, we study four utility metrics (Average Relative Error (Avg. RLE), Earth Mover’s Distance (EMD), True Positive (TP) and Utility Loss (UL)) to compare synthetic distributions with their original distributions. The results show that the range of Avg. RLE is \([0.06- 0.30]\) with \(\sigma \) range \([0.003-0.015]\) for \(\epsilon \) 0.1 to 1. EMD, TP and UL are \([0.06- 0.99]\), \([95.25- 97.5]\) and \([4.41 - 10.1]\) with \(\sigma \) range \([0.021-0.124]\), \([2.22-3.08]\) and \([2.22-3.03]\) respectively. Due to space constraints, figures are not shown here.

5 Discussion

In 2012, Hubway and Metropolitan Area Planning Council (MAPC) jointly hosted a challenge named Hubway Data Visualization Challenge [1] asking participants to come up with projects that involve visualizations, animations, artistic representations or interactive data analysis tool. It had received total 67 projects that used original data provided by the host. We went through short description and/or little demo provided with each of these projects to find out the statistics that were computed by most of the participants. The comparison of top 7 of these statistics are shown in Sect. 4 and it seems that releasing differentially private data preserves utility in each case.

Due to the way original data is released, we do not provide user level privacy, our synthetic data provides trip level instead (e.g., sensitivity is computed based on adding or removing one trip, not on adding or removing individual). The synthetic data is also provides more protection than original dataset w.r.t. Intimate Stalker Threat [10]. First, unlike the original data set, synthetic data does not release real time visit information. Moreover, it is \(\epsilon \)-differentially private which means it hides a particular trip information with \(\epsilon \) privacy. As a result, identity as well as location resolution would be more harder for an intimate stalker using the synthetic data compared to original data.

6 Related Work

Few works [2–4, 6] have been done on publishing and characterizing human mobility based on cellular network and other spatio-temporal data. All these papers built their model under Differential privacy. Chen et al. [4] study the problem of publishing trajectory data of commuters in Montreal. In paper [3], authors make use of the variable-length n-gram model. Mir et al. [6] models the human mobility based on Call Detail Records from a cellular telephone network. Acs et al. [2] presents a new anonymization scheme to release the spatio-temporal density of Paris in France. All these papers addressed the specific utility goal. This inspires us to study the possible utility for arbitrary queries.

7 Conclusion

In this paper, we propose a sanitization model for hubway dataset built under differential privacy that preserves users’ trip level privacy. To show the applicability and utility of the generated synthetic data for arbitrary range of queries, we compare the most essential and compelling statistics derived from both synthetic and original datasets. Based on the comparison results, we conclude that most of the information required by human analysts can be provided accurately by differentially private synthetic data. We also discuss that the synthetic data release could be used to reduce threats due attacks such as Intimate Stalker compared to original data release.