Applying multi-label and multi-class classification to enhance K-anonymity in sequential releases

Tran, Dung; Sokolova, Marina

doi:10.1007/s13748-016-0096-y

Applying multi-label and multi-class classification to enhance K-anonymity in sequential releases

Regular Paper
Published: 08 July 2016

Volume 5, pages 277–288, (2016)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

334 Accesses
2 Citations
Explore all metrics

Abstract

Privacy-preserving data mining is gaining prominence due to increased accumulation of data containing personal information. Data holders in healthcare, finance and other sectors collecting person-specific information are challenged to publish useful data, while meeting ever-increasing demands of privacy protection for data subjects. K-anonymity is a popular technique used to preserve data privacy for data publishing by anonymizing quasi identifiers (QI) (e.g., race, gender, age). However, K-anonymized data can be at risk of temporal attacks that target multiple versions of released data, also called sequential releases. The objective of this study is to develop a model that uses multi-class and multi-label classifiers to evaluate risk in re-identifying QI information in previous data releases through learning from current data release. In our empirical study, we use five healthcare and financial data sets to compare performance of binary relationship and label powerset problem transformations and Naïve Bayes, C4.5, random tree and kNN learning algorithms. Our empirical results show that multi-label classification is a powerful tool in enhancing K-anonymity of sequential data release. Statistical analysis of the classification results shows that RAkEL outperforms other transformation methods in predicting demographics information, hence, can be useful in assessing risks of QI re-identification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data in healthcare: management, analysis and future prospects

Article Open access 19 June 2019

Trends and Future Perspective Challenges in Big Data

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Notes

References

Aggarwal, C.: Privacy-preserving data mining.’ In: Data Mining, pp. 663–693. Springer International Publishing (2015)
Cotha, N., Sokolova, M.: Multi-label learning in classification of patients’ quasi-identifiers. Prog. Artificial Intell. 4(3–4), 37–48 (2015)
Article Google Scholar
Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Dong, Y., Yang, Y., Tang, J., Yang, Y., Chawla, N.: Inferring user demographics and social strategies in mobile social networks. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 2014, pp. 15–24 (2014)
Elisseeff, A., Weston, J.: A Kernel method for multi-labelled classification. In: Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval, pp. 274–281 (2005)
Eze, B., Peyton, L.: Systematic literature review on the anonymization of high dimensional streaming datasets for health data sharing. Proc. Comput. Sci. 63, 348–355 (2015)
Fan, W., Wang, H., Yu, P., Ma, S.: Is random model better? On its accuracy and efficiency. In: Third IEEE International Conference on Data Mining, 2003. ICDM 2003, pp. 51–58. IEEE (2003)
Gibaja, E., Ventura, S.: Multi-label learning: a review of the state of the art and ongoing research. Wiley Int. Rev. Data Min. Knowl. Disc., 4, 6, pp. 411–444 (2014)
Hu, J., Zeng, H., Li, H., Niu, C., Chen, Z.: Demographic prediction based on user’s browsing behavior. In: Proceedings of the \(16^{th}\) international conference on World Wide Web, pp. 151–160 (2007)
Jafer, Y., Matwin, S., Sokolova, M.: Task oriented privacy preserving data publishing using feature selection. In: Advances in Artificial Intelligence 27, pp. 143–154. Springer (2014)
Japkowicz, N., Shah, M.: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, Cambridge (2011)
Book MATH Google Scholar
Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recognit. 45(9), 3084–3104 (2012)
Article Google Scholar
Martínez, S., Sánchez, D., Valls, A.: A semantic framework to protect the privacy of electronic health records with non-numerical attributes. J. Biomed. Inform. 46(2), 294–303 (2013)
Article Google Scholar
Office for Civil Rights, H.: Standards for privacy of individually identifiable health information. Final rule. Federal Register 67(157), 53181 (2002)
Pei, J., Xu, J., Wang, Z., Wang, W., Wang, K.: Maintaining k-anonymity against incremental updates. In: Proceedings of the International Conference on Scientific and Statistical Database Management (2007)
Read, J.: A pruned problem transformation method for multi-label classification. In: Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008), pp. 143–150 (2008)
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Proceedings of the 20th European Conference on Machine Learning, pp. 254–269 (2009)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45, 427–437 (2009)
Article Google Scholar
Soria-Comas, J., Domingo-Ferrer, J.: Big data privacy: challenges to privacy principles and models. Data Sci. Eng. 1(1), 21–28 (2016)
Article Google Scholar
Sorower, M.S.: A Literature Survey on Algorithms for Multi-Label Learning. Oregon State University, Corvallis (2010)
Google Scholar
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)
Article MathSciNet MATH Google Scholar
Tsoumakas, G., Vlahavas, I.: Random k-labelsets: an ensemble method for multilabel classification. In: Proceedings of the 18th European Conference on Machine Learning (ECML 2007) (2007)
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 3(3), 1–13 (2007)
Article Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer (2009)
Wang, K., Fung, B.: Anonymizing sequential releases. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 414–423. ACM (2006)
Zhang, X., Yuan, Q., Zhao, S., Fan, W., Zheng, W., Wang, Z.: Multilabel classification without the multi-label cost. In: Proceedings of SDM, pp. 778–789 (2010)
Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Knowl. Data Eng. Trans. 26(8), 1819–1837 (2014)
Article Google Scholar

Download references

Acknowledgments

We thank Nathalie Japkowicz for fruitful suggestions on an early study. We thank anonymous reviewers for helpful comments.

Author information

Authors and Affiliations

University of Ottawa, Ottawa, Canada
Dung Tran & Marina Sokolova
Institute for Big Data Analytics, Dalhousie University, Halifax, Canada
Marina Sokolova

Authors

Dung Tran
View author publications
You can also search for this author in PubMed Google Scholar
Marina Sokolova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Sokolova.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran, D., Sokolova, M. Applying multi-label and multi-class classification to enhance K-anonymity in sequential releases. Prog Artif Intell 5, 277–288 (2016). https://doi.org/10.1007/s13748-016-0096-y

Download citation

Received: 02 March 2016
Accepted: 24 June 2016
Published: 08 July 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s13748-016-0096-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Applying multi-label and multi-class classification to enhance K-anonymity in sequential releases

Abstract

Access this article

Similar content being viewed by others

Big data in healthcare: management, analysis and future prospects

Trends and Future Perspective Challenges in Big Data

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Applying multi-label and multi-class classification to enhance K-anonymity in sequential releases

Abstract

Access this article

Similar content being viewed by others

Big data in healthcare: management, analysis and future prospects

Trends and Future Perspective Challenges in Big Data

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation