Abstract
Privacy should be carefully considered during the publication of data (e.g. database records) collected from individuals to avoid disclosing identities or revealing confidential information. Anonymisation methods aim at achieving a certain degree of privacy by performing transformations over non-anonymous data while minimising, as much as possible, the distortion (i.e. information loss) derived from these transformations. k-anonymity is a property typically considered when masking data, stating that each record (corresponding to an individual) is indistinguishable from at least k-1 other records in the anonymised dataset. Many methods have been developed to anonymise data, but most of them are focused solely on numerical attributes. Non-numerical values (e.g. categorical attributes like job or country-of-birth or unbounded textual ones like user preferences) are more challenging because arithmetic operations cannot be applied. To properly manage and interpret this kind of data, it is required to have operators that are able to deal with data semantics. In this paper, we propose an anonymisation method based on a classic data re-sampling algorithm that guarantees the fulfilment of the k-anonymity property and is able to deal with non-numerical data from a semantic perspective. Our method has been applied to anonymise the well-known Adult Census dataset, showing that a semantic interpretation of non-numerical values better minimises the information loss of the masked data file.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Samarati, P., Sweeney, L.: Protecting Privacy when Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression. Technical Report SRI-CSL-98-04, SRI Computer Science Laboratory (1998)
Domingo-Ferrer, J.: A Survey of Inference Control Methods for Privacy-Preserving Data Mining. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining, vol. 34, pp. 53–80. Springer US (2008)
Heer, G.R.: A bootstrap procedure to preserve statistical confidentiality in contingency tables. In: Int. Seminar on Statistical Confidentiality, Eurostat, pp. 261–271 (1993)
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical Data-Oriented Microaggregation for Statistical Disclosure Control. IEEE Trans. on Knowl. and Data Eng. 14, 189–201 (2002)
Herranz, J., Nin, J., Torra, V.: Distributed Privacy-Preserving Methods for Statistical Disclosure Control. In: Garcia-Alfaro, J., Navarro-Arribas, G., Cuppens-Boulahia, N., Roudier, Y. (eds.) DPM 2009. LNCS, vol. 5939, pp. 33–47. Springer, Heidelberg (2010)
Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality. The American Statistician 60, 224–232 (2006)
Torra, V.: Towards knowledge intensive data privacy. In: Proceedings of the 5th International Workshop on Data Privacy Management, and 3rd International Conference on Autonomous Spontaneous Security, pp. 1–7. Springer, Athens (2011)
Martínez, S., Sánchez, D., Valls, A., Batet, M.: Privacy protection of textual attributes through a semantic-based masking method. Information Fusion. Special Issue on Privacy and Security 13, 304–314 (2012)
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: The 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138. Association for Computational Linguistics, Las Cruces (1994)
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Resampling for statistical confidentiality in contingency tables. Computers & Mathematics with Applications 38, 13–32 (1999)
Jones, D.H., Adam, N.R.: Disclosure avoidance using the bootstrap and other resampling schemes. In: Proceedings of the Fifth Annual Research Conference, U.S. Bureau of the Census, pp. 446–455 (1989)
Abril, D., Navarro-Arribas, G., Torra, V.: Towards Semantic Microaggregation of Categorical Data for Confidential Documents. In: Torra, V., Narukawa, Y., Daumas, M. (eds.) MDAI 2010. LNCS, vol. 6408, pp. 266–276. Springer, Heidelberg (2010)
Torra, V.: Microaggregation for Categorical Variables: A Median Based Approach. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 162–174. Springer, Heidelberg (2004)
Martínez, S., Sánchez, D., Valls, A.: Ontology-Based Anonymization of Categorical Values. In: Torra, V., Narukawa, Y., Daumas, M. (eds.) MDAI 2010. LNCS, vol. 6408, pp. 243–254. Springer, Heidelberg (2010)
Martínez, S., Sánchez, D., Valls, A., Batet, M.: The Role of Ontologies in the Anonymization of Textual Variables. In: Artificial Intelligence Research and Development: Proceedings of the 13th International Conference of the Catalan Association for Artificial Intelligence, pp. 153–162. IOS Press (2010)
Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press (1998)
Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 652–659. ACM, Washington, D.C. (2004)
Hettich, S., Bay, S.D.: The UCI KDD Archive (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martínez, S., Sánchez, D., Valls, A. (2012). Towards k-Anonymous Non-numerical Data via Semantic Resampling. In: Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R. (eds) Advances in Computational Intelligence. IPMU 2012. Communications in Computer and Information Science, vol 300. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31724-8_54
Download citation
DOI: https://doi.org/10.1007/978-3-642-31724-8_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31723-1
Online ISBN: 978-3-642-31724-8
eBook Packages: Computer ScienceComputer Science (R0)