Anonymization of moving objects databases by clustering and perturbation
Introduction
With today's pervasiveness of mobile phones and other location-aware devices, the amount of traces left by moving objects and daily collected by service providers, is continuously increasing. The wealth of space–time trajectories left by these personal devices and their human companions is expected to enable novel classes of applications, where the discovery of consumable, concise, and applicable knowledge is the key step. These mobile trajectories contain detailed information about personal and vehicular mobile behavior, and therefore offer interesting practical opportunities to find behavioral patterns, to be used for instance in traffic and sustainable mobility management, e.g., to study the accessibility to services.
Clearly, in these applications privacy is a concern, since location data enable intrusive inferences, which may reveal habits, social customs, religious and sexual preferences of individuals, and can be used for unauthorized advertisement and user profiling. As an example, consider a traffic control application that collects vehicle movements. In a naïve tentative of preserving anonymity, the car identifiers are not disclosed but instead replaced with pseudonyms. However, as shown in [4] such operation is insufficient to guarantee anonymity, as location is a property that in some circumstances can lead to the identification of the individual. If one is known to follow almost every morning the same route, it is very likely that the starting point is the home of the individual and the ending point is the working place. Joining this information with some telephone directories we can easily link the trajectory to its owner.
In this paper we study the problem of preserving privacy when publishing data from moving objects databases. We extend the classical concept of k-anonymity [1] to deal with this particular form of data, and to exploit its inherent uncertainty [5], [6], [7]. In fact the energy in a mobile device is very limited, so it is impossible for a mobile object to continuously send out its location information. To reduce the energy consumption, many methods [8], [9] are developed for predicting an expected location of a mobile object at a given time t, using some predictive model, e.g., Kalman filter, linear model, etc. If the actual location of the mobile object differs more than an uncertainty threshold from the predicted location, then the mobile object reports the new location, otherwise it does not. The threshold is defined by an agreement between the server and the moving object. For sake of presentation, in the following we assume a common , although our framework can easily handle different , as discussed in Section 9.
The basic idea underlying our proposal is to exploit this inherent position uncertainty in moving objects data, in order to keep low the distortion needed for anonymizing such data before their release.
Following Trajcevski et al. [6] an uncertain trajectory is defined as a cylindrical volume of radius . Definition 1 Uncertain trajectory [6] A trajectory of a moving object is a polyline in three-dimensional space represented as a sequence of spatio-temporal points: . During the time segment [ti,ti+1] the object is assumed to move along a straight line from (xi,yi) to (xi+1,yi+1) at a constant speed. Given a trajectory between times t1 and tn, and an uncertainty threshold , the pair defines an uncertain trajectory. For each point (x,y,t) along , its uncertainty area is the horizontal disk (i.e., circle and its interior) with radius and centered at (x,y,t), where (x,y) is the expected location at time . The trajectory volume of , denoted is the union of all such disks for all . A possible motion curve of is any continuous function defined on the interval [t1,tn] such that for any , the spatio-temporal point is inside the uncertainty area at time t: we also adopt the notation .
Definition 1 is graphically represented in Fig. 1. Intuitively two trajectories are indistinguishable if they are defined in the same time interval, and they follow almost the same route w.r.t. the uncertainty threshold. Definition 2 Co-localization Two trajectories , defined in [t1,tn] are said to be co-localized w.r.t. , iff for each point (x1,y1,t) along and (x2,y2,t) along with , it holds that , where Dist is the Euclidean distance: . We write omitting the time interval [t1,tn].
Note that the definition above requires two trajectories to be defined exactly in the same time interval in order to be co-localized. Although in real-world data it is quite unusual to have two trajectories starting and ending at the exact same time instants, in practice this problem can be tackled by allowing small time gaps, or by selecting coarser time samplings, or more in general, by introducing small information loss. We will discuss later how we deal with this constraint.
Another way to express the co-localization of trajectories is to say that each one is a possible motion curve of the other. Proposition 1
It is worth noting that co-localization does not induce a partition of a set of trajectories. Proposition 2 is a reflexive and symmetric relation. For , it is not transitive, and thus it does not induce equivalence classes; while for , is also transitive and thus it is an equivalence relation.
Given an anonymity threshold k, we can define an anonymity set as a set of at least k trajectories that are co-localized. Definition 3 Anonymity set of trajectories Given an uncertainty threshold and an anonymity threshold k, a set S of trajectories is a set iff and .
The following properties further characterize an anonymity set of trajectories. Proposition 3 A set of trajectories S, with , is a set iff there exists a trajectory such that all the trajectories in S are possible motion curves of within an uncertainty radius of : i.e., . Given a set S, the trajectory is obtained by taking, for each , the point(x,y) that is the center of the minimum bounding circle of all the points at time t of all trajectories in S.
Therefore, an anonymity set of trajectories can be bounded by a cylindrical volume of radius . In Fig. 2, we graphically represent this property.
The problem we study in this article is that of a database of trajectories of moving objects. Problem 1 Given a dataset of trajectories , an uncertainty threshold and an anonymity threshold k, the problem of requires to transform in a dataset , such that for each trajectory there exists a set , ; and the distortion between and is minimized.
In this article we study the problem of anonymity preserving data publishing in moving objects databases, as formally defined above. In the next section we collocate our contribution within a heterogeneous literature ranging from anonymity and data publishing to moving objects databases and location based services.
In Section 3 we introduce the basic technique that we use to enforce , we develop a suitable measure of the information distortion introduced by space translation, and we prove that the problem of achieving with minimum distortion is NP-hard.
Therefore, we propose a two-step greedy method: in the first step by means of k-member constrained clustering we group trajectories in clusters having at least k elements; in the second step we perform the minimum space translation needed to push all the trajectories of a cluster within a cylindrical volume of radius (according to Proposition 3), making them a set.
During a preliminary investigation (not reported in this article but available in our technical report [10]) we adapted many classical clustering methods to deal with the k-member constraint and to handle trajectories data. Through an experimental comparison we selected greedy clustering as a good trade-off between minimization of information distortion and efficiency. Such simple method is at the basis of all the algorithms that we present in this paper.
In Section 4 we recall our previous method, namely () [2]. is obtained by enhancing the basic greedy clustering algorithm by equipping it with ad hoc pre-processing, and techniques to produce compact clusters at the price of suppressing some outlier trajectories.
Starting by a discussion on the limits of , in Section 5 we develop a novel method that, being based on EDR distance [3] (instead of the Euclidean distance as it was ), it has the important feature of being time-tolerant. The novel method is named (). To the best of our knowledge our algorithm is the first to use EDR as distance function within a trajectory clustering framework, thus providing further assessment of the merits of this distance.
Another novel idea that we introduce, is to exploit the EDR computation also as a guide on how to perform the last step of the anonymization process. After having clustered trajectories we need to modify each cluster to make it an anonymity set. Being an edit distance, it is EDR itself to suggest us how to do this spatio-temporal editing: this means that we are able to reuse the computation done during the clustering phase also in the points translation phase. The technical details on how to do spatio-temporal editing efficiently are reported in Section 5.2.
Our experiments in Section 6 confirm that produces higher quality data than . However, it might be prohibitively expensive for large and complex datasets. Thus, in Section 7 we develop techniques to make scalable. In particular, in Section 7.1 we introduce a novel O(n) spatio-temporal distance function, named LSTD (linear spatio-temporal distance). Being linear, LSTD has the same computational cost of Euclidean distance, but it has not the same limits: in fact LSTD is time-tolerant, can be applied to trajectories of different length, and it is tolerant to outliers. In practice, it represents a good trade-off between Euclidean distance and EDR. In Section 8 all the variants of are empirically evaluated in terms of data quality and efficiency, and compared to their predecessor . Data quality is assessed both by means of objective measures of information distortion, and by more usability oriented measure, i.e., by comparing the results of (i) spatio-temporal range queries and (ii) frequent pattern mining, executed on the original database and on the one.
The experiments show that the new techniques make scalable to very large datasets, without giving up quality of the anonymization.
Section snippets
Relational data anonymity
The traditional k-anonymity framework [1] focuses on relational tables: the basic assumptions are that the table to be anonymized contains entity-specific information, that each tuple in the table corresponds uniquely to an individual, and that attributes are divided into quasi-identifiers (i.e., a set of attributes whose values in combination can be linked to external information to reidentify the individual to whom the information refers); and sensitive attributes (publicly unknown and that
Techniques for trajectory anonymity
In the following we discuss various techniques that could be used to enforce trajectory anonymity. We start discussing the basic techniques used in the classical k-anonymity setting, generalization and suppression [48], then we discuss the condensation approach [35], and finally we introduce the main technique adopted in this paper, namely space translation.
According to Definition 2, two trajectories to be co-localized must be defined over the same time interval. Informally we call this
The algorithm
In order to assess which clustering approach is most suitable for our purposes, we have extended a large variety of well known clustering schemes to make them handle trajectories and the constraint that each cluster must have a population of at least k and at most 2k−1 elements.
We have prototyped and experimentally compared: hierarchical divisive and hierarchical agglomerative clustering, k-means, greedy clustering, mixture of Gaussian, and density based clustering. Entering in the details of
: time-tolerant anonymization
We conducted various experiments in order to assess effectiveness and efficiency of (these experiments are thoroughly reported in [2], [10], and later in Section 6). The experimental assessment confirmed that , despite its efficiency, is able to provide ()-anonymization with very low distortion for a wide range of values of and k, raising only for high values of k in combination with small values of .
However, there is still plenty of room for improvements. In particular, the main
Experimental comparison with
In this section we report an empirical comparison between our new proposal and its predecessor , in the same experimental setting of [2].
Scaling to large databases
Algorithm faces two main computational challenges: (i) pairwise distance computation between trajectories, and (ii) quadratic time (w.r.t. trajectory length) of each pairwise distance computation. In other terms, the algorithm is quadratic both on the number of trajectories (that we call the horizontal dimension) and the mean trajectory length (vertical dimension). Although some optimizations are already done on as given in Section 5.3, the algorithm is still impractical with very large
More experiments
In this section we report the final experimentation comparing the various members of the family of methods and . We coded our methods in C, and they are available in the corresponding webpages (links are in the first page of this article). All the experiments were performed on a Intel Xeon 2 GHz processor with 1 Gb of RAM over a Linux 2.6.14 platform. The settings of the experiments on the Oldenburg dataset that we report in the following are the same as described previously in Section 6.
Conclusions, extensions and open issues
We studied the problem of anonymity preserving data publishing in moving objects databases and introduced the concept of ()-anonymity, that exploits the inherent uncertainty of location in order to reduce the amount of distortion needed to anonymize data. We deeply characterized the problem and developed various methods to solve it. In particular we first recalled our previous proposal which is essentially a greedy clustering method for trajectories equipped with ad hoc pre-processing
Acknowledgements
The authors express their gratitude to the GeoPKDD project for the Milan dataset, and to Kristen LeFevre for providing the implementation of the Mondrian algorithm.
This work was started during the tenure of Osman Abul ERCIM fellowship at ISTI-CNR. Mirco Nanni is supported by the EU project GeoPKDD (IST-6FP-014915). Osman Abul is supported by TUBITAK, project number 108E016.
References (56)
- P. Samarati, L. Sweeney, Generalizing data to provide anonymity when disclosing information (abstract), in: Proceedings...
- O. Abul, F. Bonchi, M. Nanni, Never Walk Alone: uncertainty for anonymity in moving objects databases, in: Proceedings...
- L. Chen, M.T. Özsu, V. Oria, Robust and fast similarity search for moving object trajectories, in: Proceedings of the...
- C. Bettini, X.S. Wang, S. Jajodia, Protecting privacy against location-based personal identification, in: Proceedings...
- O. Wolfson, S. Chamberlain, S. Dao, L. Jiang, G. Mendez, Cost and imprecision in modeling the position of moving...
- et al.
Managing uncertainty in moving objects databases
ACM Transactions on Database Systems
(2004) - D. Pfoser, C.S. Jensen, Capturing the uncertainty of moving-object representations, in: Proceedings of the 6th...
- et al.
Updating and querying databases that track mobile units
Distributed and Parallel Databases
(1999) - A. Jain, E.Y. Chang, Y.-F. Wang, Adaptive stream resource management using Kalman filters, in: Proceedings of the 2004...
- O. Abul, F. Bonchi, M. Nanni, Never Walk Alone: trajectory anonymity via clustering, Technical Report ISTI-007/2007,...
Anonymity preserving pattern discovery
VLDB Journal
Protecting privacy in continuous location-tracking applications
IEEE Security & Privacy Magazine
Moving Objects Databases
Indexing spatiotemporal archives
VLDB Journal
Cited by (152)
Adaptative generalisation over a value hierarchy for the k-anonymisation of Origin–Destination matrices
2023, Transportation Research Part C: Emerging TechnologiesLocation-privacy preserving partial nearby friends querying in urban areas
2022, Data and Knowledge EngineeringIndividual mobility prediction review: Data, problem, method and application
2022, Multimodal TransportationA Local Differential Privacy Trajectory Protection Method Based on Temporal and Spatial Restrictions for Staying Detection
2024, Tsinghua Science and TechnologyFinding Geometric Facilities with Location Privacy
2023, AlgorithmicaTowards Anonymizing Intermodal Mobility Data for Smart Cities
2023, GeoPrivacy 2023 - Proceedings of the 1st ACM SIGSPATIAL International Workshop on GeoPrivacy and Data Utility for Smart Societies
- 1
Software freely available at: www-kdd.isti.cnr.it/W4M/.
- 2
Software freely available at: www-kdd.isti.cnr.it/NWA/.