Elsevier

Information Systems

Volume 35, Issue 8, December 2010, Pages 884-910
Information Systems

Anonymization of moving objects databases by clustering and perturbation

https://doi.org/10.1016/j.is.2010.05.003Get rights and content

Abstract

Preserving individual privacy when publishing data is a problem that is receiving increasing attention. Thanks to its simplicity the concept of k-anonymity, introduced by Samarati and Sweeney [1], established itself as one fundamental principle for privacy preserving data publishing. According to the k-anonymity principle, each release of data must be such that each individual is indistinguishable from at least k−1 other individuals.

In this article we tackle the problem of anonymization of moving objects databases. We propose a novel concept of k-anonymity based on co-localization, that exploits the inherent uncertainty of the moving object's whereabouts. Due to sampling and imprecision of the positioning systems (e.g., GPS), the trajectory of a moving object is no longer a polyline in a three-dimensional space, instead it is a cylindrical volume, where its radius δ represents the possible location imprecision: we know that the trajectory of the moving object is within this cylinder, but we do not know exactly where. If another object moves within the same cylinder they are indistinguishable from each other. This leads to the definition of (k,δ)-anonymity for moving objects databases. We first characterize the (k,δ)-anonymity problem, then we recall NWA (NeverWalkAlone), a method that we introduced in [2] based on clustering and spatial perturbation. Starting from a discussion on the limits of NWA we develop a novel clustering method that, being based on EDR distance [3], has the important feature of being time-tolerant. As a consequence it perturbs trajectories both in space and time. The novel method, named W4M (WaitforMe), is empirically shown to produce higher quality anonymization than NWA, at the price of higher computational requirements. Therefore, in order to make W4M scalable to large datasets, we introduce two variants based on a novel (and computationally cheaper) time-tolerant distance function, and on chunking.

All the variants of W4M1 are empirically evaluated in terms of data quality and efficiency, and thoroughly compared to their predecessor NWA.2 Data quality is assessed both by means of objective measures of information distortion, and by more usability oriented measure, i.e., by comparing the results of (i) spatio-temporal range queries and (ii) frequent pattern mining, executed on the original database and on the (k,δ)-anonymized one.

Experimental results over both real-world and synthetic mobility data confirm that, for a wide range of values of δ and k, the relative distortion introduced by our anonymization methods is kept low. Moreover, the techniques introduced to make W4M scalable to large datasets, achieve their goal without giving up data quality in the anonymization process.

Introduction

With today's pervasiveness of mobile phones and other location-aware devices, the amount of traces left by moving objects and daily collected by service providers, is continuously increasing. The wealth of space–time trajectories left by these personal devices and their human companions is expected to enable novel classes of applications, where the discovery of consumable, concise, and applicable knowledge is the key step. These mobile trajectories contain detailed information about personal and vehicular mobile behavior, and therefore offer interesting practical opportunities to find behavioral patterns, to be used for instance in traffic and sustainable mobility management, e.g., to study the accessibility to services.

Clearly, in these applications privacy is a concern, since location data enable intrusive inferences, which may reveal habits, social customs, religious and sexual preferences of individuals, and can be used for unauthorized advertisement and user profiling. As an example, consider a traffic control application that collects vehicle movements. In a naïve tentative of preserving anonymity, the car identifiers are not disclosed but instead replaced with pseudonyms. However, as shown in [4] such operation is insufficient to guarantee anonymity, as location is a property that in some circumstances can lead to the identification of the individual. If one is known to follow almost every morning the same route, it is very likely that the starting point is the home of the individual and the ending point is the working place. Joining this information with some telephone directories we can easily link the trajectory to its owner.

In this paper we study the problem of preserving privacy when publishing data from moving objects databases. We extend the classical concept of k-anonymity [1] to deal with this particular form of data, and to exploit its inherent uncertainty [5], [6], [7]. In fact the energy in a mobile device is very limited, so it is impossible for a mobile object to continuously send out its location information. To reduce the energy consumption, many methods [8], [9] are developed for predicting an expected location of a mobile object at a given time t, using some predictive model, e.g., Kalman filter, linear model, etc. If the actual location of the mobile object differs more than an uncertainty threshold δ from the predicted location, then the mobile object reports the new location, otherwise it does not. The threshold δ is defined by an agreement between the server and the moving object. For sake of presentation, in the following we assume a common δ, although our framework can easily handle different δ's, as discussed in Section 9.

The basic idea underlying our proposal is to exploit this inherent position uncertainty in moving objects data, in order to keep low the distortion needed for anonymizing such data before their release.

Following Trajcevski et al. [6] an uncertain trajectory is defined as a cylindrical volume of radius δ.

Definition 1 Uncertain trajectory [6]

A trajectory of a moving object is a polyline in three-dimensional space represented as a sequence of spatio-temporal points: (x1,y1,t1),(x2,y2,t2)(xn,yn,tn)(t1<t2<<tn). During the time segment [ti,ti+1] the object is assumed to move along a straight line from (xi,yi) to (xi+1,yi+1) at a constant speed. Given a trajectory τ between times t1 and tn, and an uncertainty threshold δ, the pair τ,δ defines an uncertain trajectory. For each point (x,y,t) along τ, its uncertainty area is the horizontal disk (i.e., circle and its interior) with radius δ and centered at (x,y,t), where (x,y) is the expected location at time t[t1,tn]. The trajectory volume of τ,δ, denoted Vol(τ,δ) is the union of all such disks for all t[t1,tn]. A possible motion curve of τ is any continuous function fPMCτ:TimeR2 defined on the interval [t1,tn] such that for any t[t1,tn], the spatio-temporal point (fPMCτ(t),t) is inside the uncertainty area at time t: we also adopt the notation fPMCτVol(τ,δ).

Definition 1 is graphically represented in Fig. 1. Intuitively two trajectories are indistinguishable if they are defined in the same time interval, and they follow almost the same route w.r.t. the uncertainty threshold.

Definition 2 Co-localization

Two trajectories τ1, τ2 defined in [t1,tn] are said to be co-localized w.r.t. δ, iff for each point (x1,y1,t) along τ1 and (x2,y2,t) along τ2 with t[t1,tn], it holds that Dist((x1,y1),(x2,y2))δ, where Dist is the Euclidean distance: Dist((x1,y1),(x2,y2))=(x1x2)2+(y1y2)2. We write Colocδ(τ1,τ2) omitting the time interval [t1,tn].

Note that the definition above requires two trajectories to be defined exactly in the same time interval in order to be co-localized. Although in real-world data it is quite unusual to have two trajectories starting and ending at the exact same time instants, in practice this problem can be tackled by allowing small time gaps, or by selecting coarser time samplings, or more in general, by introducing small information loss. We will discuss later how we deal with this constraint.

Another way to express the co-localization of trajectories is to say that each one is a possible motion curve of the other.

Proposition 1

Colocδ(τ1,τ2)τ1Vol(τ2,δ)τ2Vol(τ1,δ)

It is worth noting that co-localization does not induce a partition of a set of trajectories.

Proposition 2

Colocδ(τ1,τ2) is a reflexive and symmetric relation. For δ>0, it is not transitive, and thus it does not induce equivalence classes; while for δ=0, Colocδ(τ1,τ2) is also transitive and thus it is an equivalence relation.

Given an anonymity threshold k, we can define an anonymity set as a set of at least k trajectories that are co-localized.

Definition 3 Anonymity set of trajectories

Given an uncertainty threshold δ and an anonymity threshold k, a set S of trajectories is a (k,δ)-anonymity set iff |S|k and τi,τjS.Colocδ(τi,τj).

The following properties further characterize an anonymity set of trajectories.

Proposition 3

A set of trajectories S, with |S|k, is a (k,δ)anonymity set iff there exists a trajectory τc such that all the trajectories in S are possible motion curves of τc within an uncertainty radius of δ/2: i.e., τS.τVol(τc,δ/2).

Given a (k,δ)anonymity set S, the trajectory τc is obtained by taking, for each t[t1,tn], the point(x,y) that is the center of the minimum bounding circle of all the points at time t of all trajectories in S.

Therefore, an anonymity set of trajectories can be bounded by a cylindrical volume of radius δ/2. In Fig. 2, we graphically represent this property.

The problem we study in this article is that of (k,δ)-anonymizing a database of trajectories of moving objects.

Problem 1

Given a dataset of trajectories D, an uncertainty threshold δ and an anonymity threshold k, the problem of (k,δ)-anonymity requires to transform D in a dataset D, such that for each trajectory τD there exists a (k,δ)-anonymity set SD, τS; and the distortion between D and D is minimized.

In this article we study the problem of anonymity preserving data publishing in moving objects databases, as formally defined above. In the next section we collocate our contribution within a heterogeneous literature ranging from anonymity and data publishing to moving objects databases and location based services.

In Section 3 we introduce the basic technique that we use to enforce (k,δ)-anonymity, we develop a suitable measure of the information distortion introduced by space translation, and we prove that the problem of achieving (k,δ)-anonymity with minimum distortion is NP-hard.

Therefore, we propose a two-step greedy method: in the first step by means of k-member constrained clustering we group trajectories in clusters having at least k elements; in the second step we perform the minimum space translation needed to push all the trajectories of a cluster within a cylindrical volume of radius δ/2 (according to Proposition 3), making them a (k,δ)-anonymity set.

During a preliminary investigation (not reported in this article but available in our technical report [10]) we adapted many classical clustering methods to deal with the k-member constraint and to handle trajectories data. Through an experimental comparison we selected greedy clustering as a good trade-off between minimization of information distortion and efficiency. Such simple method is at the basis of all the algorithms that we present in this paper.

In Section 4 we recall our previous method, namely NWA (NeverWalkAlone) [2]. NWA is obtained by enhancing the basic greedy clustering algorithm by equipping it with ad hoc pre-processing, and techniques to produce compact clusters at the price of suppressing some outlier trajectories.

Starting by a discussion on the limits of NWA, in Section 5 we develop a novel method that, being based on EDR distance [3] (instead of the Euclidean distance as it was NWA), it has the important feature of being time-tolerant. The novel method is named W4M (WaitforMe). To the best of our knowledge our algorithm is the first to use EDR as distance function within a trajectory clustering framework, thus providing further assessment of the merits of this distance.

Another novel idea that we introduce, is to exploit the EDR computation also as a guide on how to perform the last step of the anonymization process. After having clustered trajectories we need to modify each cluster to make it an anonymity set. Being an edit distance, it is EDR itself to suggest us how to do this spatio-temporal editing: this means that we are able to reuse the computation done during the clustering phase also in the points translation phase. The technical details on how to do spatio-temporal editing efficiently are reported in Section 5.2.

Our experiments in Section 6 confirm that W4M produces higher quality (k,δ)-anonymized data than NWA. However, it might be prohibitively expensive for large and complex datasets. Thus, in Section 7 we develop techniques to make W4M scalable. In particular, in Section 7.1 we introduce a novel O(n) spatio-temporal distance function, named LSTD (linear spatio-temporal distance). Being linear, LSTD has the same computational cost of Euclidean distance, but it has not the same limits: in fact LSTD is time-tolerant, can be applied to trajectories of different length, and it is tolerant to outliers. In practice, it represents a good trade-off between Euclidean distance and EDR. In Section 8 all the variants of W4M are empirically evaluated in terms of data quality and efficiency, and compared to their predecessor NWA. Data quality is assessed both by means of objective measures of information distortion, and by more usability oriented measure, i.e., by comparing the results of (i) spatio-temporal range queries and (ii) frequent pattern mining, executed on the original database and on the (k,δ)-anonymized one.

The experiments show that the new techniques make W4M scalable to very large datasets, without giving up quality of the anonymization.

Section snippets

Relational data anonymity

The traditional k-anonymity framework [1] focuses on relational tables: the basic assumptions are that the table to be anonymized contains entity-specific information, that each tuple in the table corresponds uniquely to an individual, and that attributes are divided into quasi-identifiers (i.e., a set of attributes whose values in combination can be linked to external information to reidentify the individual to whom the information refers); and sensitive attributes (publicly unknown and that

Techniques for trajectory anonymity

In the following we discuss various techniques that could be used to enforce trajectory anonymity. We start discussing the basic techniques used in the classical k-anonymity setting, generalization and suppression [48], then we discuss the condensation approach [35], and finally we introduce the main technique adopted in this paper, namely space translation.

According to Definition 2, two trajectories to be co-localized must be defined over the same time interval. Informally we call this

The NWA algorithm

In order to assess which clustering approach is most suitable for our purposes, we have extended a large variety of well known clustering schemes to make them handle trajectories and the constraint that each cluster must have a population of at least k and at most 2k−1 elements.

We have prototyped and experimentally compared: hierarchical divisive and hierarchical agglomerative clustering, k-means, greedy clustering, mixture of Gaussian, and density based clustering. Entering in the details of

W4M: time-tolerant anonymization

We conducted various experiments in order to assess effectiveness and efficiency of NWA (these experiments are thoroughly reported in [2], [10], and later in Section 6). The experimental assessment confirmed that NWA, despite its efficiency, is able to provide (k,δ)-anonymization with very low distortion for a wide range of values of δ and k, raising only for high values of k in combination with small values of δ.

However, there is still plenty of room for improvements. In particular, the main

Experimental comparison with NWA

In this section we report an empirical comparison between our new proposal W4M and its predecessor NWA, in the same experimental setting of [2].

Scaling to large databases

Algorithm W4M faces two main computational challenges: (i) pairwise distance computation between trajectories, and (ii) quadratic time (w.r.t. trajectory length) of each pairwise distance computation. In other terms, the algorithm is quadratic both on the number of trajectories (that we call the horizontal dimension) and the mean trajectory length (vertical dimension). Although some optimizations are already done on W4M as given in Section 5.3, the algorithm is still impractical with very large

More experiments

In this section we report the final experimentation comparing the various members of the W4M family of methods and NWA. We coded our methods in C, and they are available in the corresponding webpages (links are in the first page of this article). All the experiments were performed on a Intel Xeon 2 GHz processor with 1 Gb of RAM over a Linux 2.6.14 platform. The settings of the experiments on the Oldenburg dataset that we report in the following are the same as described previously in Section 6.

Conclusions, extensions and open issues

We studied the problem of anonymity preserving data publishing in moving objects databases and introduced the concept of (k,δ)-anonymity, that exploits the inherent uncertainty of location in order to reduce the amount of distortion needed to anonymize data. We deeply characterized the problem and developed various methods to solve it. In particular we first recalled our previous proposal NWA which is essentially a greedy clustering method for trajectories equipped with ad hoc pre-processing

Acknowledgements

The authors express their gratitude to the GeoPKDD project for the Milan dataset, and to Kristen LeFevre for providing the implementation of the Mondrian algorithm.

This work was started during the tenure of Osman Abul ERCIM fellowship at ISTI-CNR. Mirco Nanni is supported by the EU project GeoPKDD (IST-6FP-014915). Osman Abul is supported by TUBITAK, project number 108E016.

References (56)

  • P. Samarati, L. Sweeney, Generalizing data to provide anonymity when disclosing information (abstract), in: Proceedings...
  • O. Abul, F. Bonchi, M. Nanni, Never Walk Alone: uncertainty for anonymity in moving objects databases, in: Proceedings...
  • L. Chen, M.T. Özsu, V. Oria, Robust and fast similarity search for moving object trajectories, in: Proceedings of the...
  • C. Bettini, X.S. Wang, S. Jajodia, Protecting privacy against location-based personal identification, in: Proceedings...
  • O. Wolfson, S. Chamberlain, S. Dao, L. Jiang, G. Mendez, Cost and imprecision in modeling the position of moving...
  • G. Trajcevski et al.

    Managing uncertainty in moving objects databases

    ACM Transactions on Database Systems

    (2004)
  • D. Pfoser, C.S. Jensen, Capturing the uncertainty of moving-object representations, in: Proceedings of the 6th...
  • O. Wolfson et al.

    Updating and querying databases that track mobile units

    Distributed and Parallel Databases

    (1999)
  • A. Jain, E.Y. Chang, Y.-F. Wang, Adaptive stream resource management using Kalman filters, in: Proceedings of the 2004...
  • O. Abul, F. Bonchi, M. Nanni, Never Walk Alone: trajectory anonymity via clustering, Technical Report ISTI-007/2007,...
  • A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, l-diversity: privacy beyond k-anonymity, in:...
  • G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, A. Zhu, Anonymizing tables, in: Proceedings...
  • A. Meyerson, R. Willliams, On the complexity of optimal k-anonymity, in: Proceedings of the 23rd ACM Symposium on...
  • K. LeFevre, D.J. DeWitt, R. Ramakrishnan, Incognito: efficient full-domain k-anonymity, in: Proceedings of the 2005 ACM...
  • R. Bayardo, R. Agrawal, Data privacy through optimal k-anonymity, in: Proceedings of the 21st IEEE International...
  • K. LeFevre, D.J. DeWitt, R. Ramakrishnan, Mondrian multidimensional k-anonymity, in: Proceedings of the 22nd IEEE...
  • M. Atzori et al.

    Anonymity preserving pattern discovery

    VLDB Journal

    (2008)
  • M. Gruteser, D. Grunwald, Anonymous usage of location-based services through spatial and temporal cloaking, in:...
  • B. Gedik, L. Liu, Location privacy in mobile systems: a personalized anonymization model, in: Proceedings of the 25th...
  • H. Kido, Y. Yanagisawa, T. Satoh, Protection of location privacy using dummies for location-based services, in:...
  • A.R. Beresford, F. Stajano, Mix zones: user privacy in location-aware services, in: Proceedings of the Second IEEE...
  • M. Gruteser et al.

    Protecting privacy in continuous location-tracking applications

    IEEE Security & Privacy Magazine

    (2004)
  • B. Hoh, M. Gruteser, H. Xiong, A. Alrabady, Preserving privacy in gps traces via uncertainty-aware path cloaking, in:...
  • B. Hoh, M. Gruteser, R. Herring, J. Ban, D. Work, J.C. Herrera, A.M. Bayen, M. Annavaram, Q. Jacobson, Virtual trip...
  • R.H. Güting et al.

    Moving Objects Databases

    (2005)
  • G. Kollios, D. Gunopulos, V.J. Tsotras, On indexing mobile objects, in: Proceedings of the 18th ACM Symposium on...
  • P.K. Agarwal, L. Arge, J. Erickson, Indexing moving points, in: Proceedings of the 19th ACM Symposium on Principles of...
  • M. Hadjieleftheriou et al.

    Indexing spatiotemporal archives

    VLDB Journal

    (2006)
  • Cited by (152)

    • Towards Anonymizing Intermodal Mobility Data for Smart Cities

      2023, GeoPrivacy 2023 - Proceedings of the 1st ACM SIGSPATIAL International Workshop on GeoPrivacy and Data Utility for Smart Societies
    View all citing articles on Scopus
    1

    Software freely available at: www-kdd.isti.cnr.it/W4M/.

    2

    Software freely available at: www-kdd.isti.cnr.it/NWA/.

    View full text