Abstract
Living in the information society facilitates the automatic collection of huge amounts of data on individuals, organizations, etc. Publishing such data for secondary analysis (e.g. learning models and finding patterns) may be extremely useful to policy makers, planners, marketing analysts, researchers and others. Yet, data publishing and mining do not come without dangers, namely privacy invasion and also potential discrimination of the individuals whose data are published. Discrimination may ensue from training data mining models (e.g. classifiers) on data which are biased against certain protected groups (ethnicity, gender, political preferences, etc.). The objective of this paper is to describe how to obtain data sets for publication that are: (i) privacy-preserving; (ii) unbiased regarding discrimination; and (iii) as useful as possible for learning models and finding patterns. We present the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention. We formally define the problem, give an optimal algorithm to tackle it and evaluate the algorithm in terms of both general and specific data analysis metrics (i.e. various types of classifiers and rule induction algorithms). It turns out that the impact of our transformation on the quality of data is the same or only slightly higher than the impact of achieving just privacy preservation. In addition, we show how to extend our approach to different privacy models and anti-discrimination legal concepts.
Similar content being viewed by others
Notes
The use of PD (resp., PND) attributes in decision making does not necessarily lead to (or exclude) discriminatory decisions (Ruggieri et al. 2010).
In full-domain generalization if a value is generalized, all its instances are generalized. There are alternative generalization schemes, such as multi-dimensional generalization or cell generalization, in which some instances of a value may remain ungeneralized while other instances are generalized.
Although algorithms using multi-dimensional or cell generalizations (e.g. the Mondrian algorithm, Lefevre et al. 2006) cause less information loss than algorithms using full-domain generalization, the former suffer from the problem of data exploration (Fung et al. 2010). This problem is caused by the co-existence of specific and generalized values in the generalized data set, which make data exploration and interpretation difficult for the data analyst.
On the legal side, different measures are adopted worldwide; see Pedreschi et al. (2013) for parallels between different measures and anti-discrimination acts.
Discrimination occurs when a group is treated “less favorably” than others.
Discrimination of a group occurs when a higher proportion of people not in the group is able to comply with a qualifying criterion.
\(\alpha \) states an acceptable level of discrimination according to laws and regulations. For example, the U.S. Equal Pay Act (United States Congress 1963) states that “a selection rate for any race, sex, or ethnic group which is less than four-fifths of the rate for the group with the highest rate will generally be regarded as evidence of adverse impact”. This amounts to using clift with \(\alpha =1.25\).
References
Aggarwal CC, Yu PS (eds) (2008) Privacy preserving data mining: models and algorithms. Springer, Berlin
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases, VLDB, pp 487–499
Agrawal R, Srikant R (2000) Privacy preserving data mining. In: ACM SIGMOD 2000, pp 439–450
Australian Legislation (2008) (a) Equal Opportunity Act—Victoria State, (b) Anti-Discrimination Act—Queensland State
Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml. Accessed 20 Jan 2014
Bayardo RJ, Agrawal R (2005) Data privacy through optimal k-anonymization. In: ICDE 2005: IEEE, pp 217–228
Berendt B, Preibusch S (2012) Exploring discrimination: a user-centric evaluation of discrimination-aware data mining. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 344–351
Calders T, Verwer S (2010) Three naive Bayes approaches for discrimination-free classification. Data Mining Knowl Discov 21(2):277–292
Custers B, Calders T, Schermer B, Zarsky TZ (eds) (2013) Discrimination and privacy in the information society—data mining and profiling in large databases. Studies in applied philosophy, epistemology and rational ethics, vol 3. Springer, Berlin
Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Mining Knowl Discov 11(2):195–212
Dwork C (2006) Differential privacy. In: ICALP 2006, LNCS 4052, Springer, pp 112
Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):8695
Dwork C, Hardt M, Pitassi T, Reingold O, Zemel RS (2012) Fairness through awareness. In: ITCS 2012, ACM, pp 214–226
European Union Legislation (1995) Directive 95/46/EC
European Union Legislation (2009) (a) Race Equality Directive, 2000/43/EC, 2000; (b) Employment Equality Directive, 2000/78/EC, 2000; (c) Equal Treatment of Persons, European Parliament legislative resolution, P6\_TA(2009) 0211
Fung BCM, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: ICDE 2005, IEEE, pp 205–216
Fung BCM, Wang K, Fu AW-C, Yu P (2010) Introduction to privacy-preserving data publishing: concepts and techniques. Chapman & Hall/CRC, New York
Hajian S, Domingo-Ferrer J, Martínez-Ballesté A (2011) Rule protection for indirect discrimination prevention in data mining. In: MDAI 2011, LNCS 6820, Springer, pp 211–222
Hajian S, Domingo-Ferrer J (2013) A methodology for direct and indirect discrimination prevention in data mining. IEEE Trans Knowl Data Eng 25(7):1445–1459
Hajian S, Monreale A, Pedreschi D, Domingo-Ferrer J, Giannotti F (2012) Injecting discrimination and privacy awareness into pattern discovery. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 360–369
Hajian S, Domingo-Ferrer J (2012) A study on the impact of data anonymization on anti-discrimination. In: 2012 IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 352–359
Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Schulte-Nordholt E, Spicer K, de Wolf P-P (2012) Statistical disclosure control. Wiley, Chichester
Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: SIGKDD 2002, ACM, pp 279288
Kamiran F, Calders T (2011) Data preprocessing techniques for classification without discrimination. Knowl Inf Syst 33(1):1–33
Kamiran F, Calders T, Pechenizkiy M (2010) Discrimination aware decision tree learning. In: ICDM 2010, IEEE, pp 869–874
Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: ECML/PKDD, LNCS 7524, Springer, pp 35–50
Lefevre K, Dewitt DJ, Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In SIGMOD 2005, ACM, pp 49–60
Lefevre K, Dewitt DJ, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: ICDE 2006, IEEE, p 25
Li N, Li T, Venkatasubramanian S (2007) \(t\)-Closeness: privacy beyond \(k\)-anonymity and \(l\)-diversity. In: IEEE ICDE 2007, IEEE, pp 106–115
Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Bellare M (ed) Advances in cryptology-CRYPTO’00, LNCS 1880, Springer, Berlin, pp 36–53
Loung BL, Ruggieri S, Turini F (2011) k-NN as an implementation of situation testing for discrimination discovery and prevention. In: KDD 2011, ACM, pp 502–510
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) \(l\)-Diversity: privacy beyond \(k\)-anonymity. ACM Trans Knowl Discov Data (TKDD) 1(1):Article 3
Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: KDD 2011, ACM, pp 493–501
Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: KDD 2008, ACM, pp 560–568
Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: SDM 2009, SIAM, pp 581–592
Pedreschi D, Ruggieri S, Turini F (2009) Integrating induction and deduction for finding evidence of discrimination. In: ICAIL 2009, ACM, pp 157–166
Pedreschi D, Ruggieri S, Turini F (2013) The discovery of discrimination. In: Custers BHM, Calders T, Schermer BW, Zarsky TZ (eds) Discrimination and privacy in the information society: studies in applied philosophy, epistemology and rational, ethics. Springer, Berlin, pp 91–108
Ruggieri S, Pedreschi D, Turini F (2010) Data mining for discrimination discovery. ACM Trans Knowl Discov Data (TKDD) 4(2):Article 9
Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027
Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information. In: Proceedings of the 17th ACM SIGACTSIGMOD-SIGART symposium on principles of database systems (PODS 98), Seattle, WA, p 188
Statistics Sweden (2001) Statistisk rjandekontroll av tabeller, databaser och kartor (Statistical disclosure control of tables, databases and maps, in Swedish). Statistics Sweden, Örebro. http://www.scb.se/statistik/_publikationer/OV9999_2000I02_BR_X97P0102. Accessed 20 Jan 2014
Sweeney L (1998) Datafly: a system for providing anonymity in medical data. In: Proceedings of the IFIP TC11 WG11.3 11th international conference on database security XI: status and prospects, pp 356–381
Sweeney L (2002) k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5):557–570
United States Congress (1963) US Equal Pay Act (EPA) (Pub. L. 88-38). http://www.eeoc.gov/eeoc/history/35th/thelaw/epa.html. Accessed 20 Jan 2014
Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protection. In: ICDM 2004, IEEE, pp 249–256
Willenborg L, de Waal T (1996) Elements of statistical disclosure control. Springer, Berlin
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Zliobaite I, Kamiran F, Calders T (2011) Handling conditional discrimination. In: ICDM 2011, IEEE, pp 992–1001
Acknowledgments
The authors wish to thank Kristen LeFevre for providing the implementation of the Incognito algorithm and Guillem Rufian-Torrell for helping in the implementation of the algorithm proposed in this paper. This work was partly supported by the Government of Catalonia under Grant 2009 SGR 1135, by the Spanish Government through projects TIN2011-27076-C03-01 “CO-PRIVACY”, TIN2012-32757 “ICWT” and CONSOLIDER INGENIO 2010 CSD2007-00004 “ARES”, and by the European Comission under FP7 projects “DwB” and “INTER-TRUST”. The second author is partially supported as an ICREA Acadèmia researcher by the Government of Catalonia. The authors are with the UNESCO Chair in Data Privacy, but they are solely responsible for the views expressed in this paper, which do not necessarily reflect the position of UNESCO nor commit that organization.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Guest Editors of PKDD 2014 (Dr. Toon Calders, Prof. Floriana Esposito, Prof. Eyke Hüllermeier and Dr. Rosa Meo).
Rights and permissions
About this article
Cite this article
Hajian, S., Domingo-Ferrer, J. & Farràs, O. Generalization-based privacy preservation and discrimination prevention in data publishing and mining. Data Min Knowl Disc 28, 1158–1188 (2014). https://doi.org/10.1007/s10618-014-0346-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0346-1