research-article

Towards a Free Text Dataset for Hiding Quasi-Identifiers

Authors:
Dunbo Cai

Center for Technology Research & Innovation, China Mobile (Suzhou) Software Technology Co., Ltd., China

Center for Technology Research & Innovation, China Mobile (Suzhou) Software Technology Co., Ltd., China
View Profile

,
Ling Qian

Center for Technology Research & Innovation, China Mobile (Suzhou) Software Technology Co., Ltd., China

Center for Technology Research & Innovation, China Mobile (Suzhou) Software Technology Co., Ltd., China
View Profile

,
Dongqin Xu

Center for Technology Research & Innovation, China Mobile (Suzhou) Software Technology Co., Ltd., China

Center for Technology Research & Innovation, China Mobile (Suzhou) Software Technology Co., Ltd., China
View Profile

,
Zhiguo Huang

Center for Technology Research & Innovation, China Mobile (Suzhou) Software Technology Co., Ltd., China

Center for Technology Research & Innovation, China Mobile (Suzhou) Software Technology Co., Ltd., China
View Profile

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial IntelligenceDecember 2021Article No.: 89Pages 1–6https://doi.org/10.1145/3508546.3508635

Published:25 February 2022Publication History

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence

Pages 1–6

ABSTRACT

Personal information protection is becoming so important for individuals. Besides personal identifier information (PII), quasi-identifier information (QII) also needs protection, as the community argues, and the solution methods have attracted many research. Many existing methods for protecting QII focus on structured text data which is organized by tables of records. However, free text data that contains QII, is very common in application domains, such as data lakes of a company. The protection of QII in free text data thus need new methods. Supervised machine learning based solutions are promising while usually require a large scale dataset to train the model. Here we propose a novel method towards building such a desired dataset. Our method exploits an existing structured text dataset, a table to sentence generation deep learning model, and incorporated the idea of Piecewise Convolution Neural Network (PCNN). The resulted dataset contains more than 120,000 free text sentences, and many of them contains QII data.

References

El Emam, Khaled, and Fida Kamal Dankar. 2008. Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association, 15, 5, 627-637.Google Scholar
Sweeney, Latanya. 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 05, 557-570.Google Scholar
Šarčević, Tanja, David Molnar, and Rudolf Mayer. 2020. An Analysis of Different Notions of Effectiveness in k-Anonymity. In Proceedings of International Conference on Privacy in Statistical Databases. Springer, Cham, 121-135.Google Scholar
Machanavajjhala, Ashwin, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1, 1, 3-es.Google Scholar
Nininahazwe, Franck Seigneur. 2019. Studying L-Diversity and K-Anonymity Over Datasets with Sensitive Fields. In Proceedings of the International Conference on Artificial Intelligence and Security. Springer, Cham, 63-73.Google Scholar
Neamatullah, Ishna, Margaret M. Douglass, H. Lehman Li-wei, Andrew Reisner, Mauricio Villarroel, William J. Long, Peter Szolovits, George B. Moody, Roger G. Mark, and Gari D. 2008. Automated de-identification of free-text medical records. BMC medical informatics and decision making, 8, 1, 1-17.Google Scholar
Iwendi, Celestine, Syed Atif Moqurrab, Adeel Anjum, Sangeen Khan, Senthilkumar Mohan, and Gautam Srivastava. 2020. N-sanitization: A semantic privacy-preserving framework for unstructured medical datasets. Computer Communications, 161, 160-171.Google ScholarCross Ref
Liu, Zengjian, Buzhou Tang, Xiaolong Wang, and Qingcai Chen. 2017. De-identification of clinical notes via recurrent neural network and conditional random field. Journal of biomedical informatics, 75, S34-S42.Google ScholarDigital Library
Yogarajan, Vithya, Bernhard Pfahringer, and Michael Mayo. 2020. A review of automatic end-to-end de-identification: Is high accuracy the only metric?. Applied Artificial Intelligence, 34, 3, 251-269.Google ScholarCross Ref
Zeng, Daojian, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, 1753-1762.Google Scholar
Liu, Tianyu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018. Table-to-text generation by structure-aware seq2seq learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18).AAAI Press, New Orleans, Louisiana, USA, 4881-4888.Google Scholar
Puduppully, Ratish, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the Thirty-Third AAAI conference on artificial intelligence (AAAI-19).AAAI Press, Hawai, USA, 33, 01, 6908-6915.Google Scholar
[Online]. Available: http://archive.ics.uci.edu/ml/datasets.php.Google Scholar

Recommendations

Anonymization of Sensitive Quasi-Identifiers for l-Diversity and t-Closeness

A number of studies on privacy-preserving data mining have been proposed. Most of them assume that they can separate quasi-identifiers (QIDs) from sensitive attributes. For instance, they assume that address, job, and age are QIDs but are not sensitive ...
Read More
Anonymization of Quasi-Sensitive Attribute Sets in Aggregated Dataset
The widespread use of Internet of Things (IoT) and Data Fusion technologies make privacy protection an urgent problem to be solved. The aggregated datasets generated in these two scenarios face extra privacy disclosure. We define attribute sets with ...
Read More
Privacy Preserving Publishing on Multiple Quasi-identifiers
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering

In some applications of privacy preserving data publishing, a practical demand is to publish a data set on multiple quasi-identifiers for multiple users simultaneously, which poses several challenges. Can we generate one anonymized version of the data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence
December 2021
699 pages
ISBN:9781450385053
DOI:10.1145/3508546

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 February 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Natural language processing
Personal information protection
Quasi-identifier information
Table to sentence
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate173of395submissions,44%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 42
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Towards a Free Text Dataset for Hiding Quasi-Identifiers

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence

ABSTRACT

References

Cited By

Recommendations

Anonymization of Sensitive Quasi-Identifiers for l-Diversity and t-Closeness

Anonymization of Quasi-Sensitive Attribute Sets in Aggregated Dataset

Privacy Preserving Publishing on Multiple Quasi-identifiers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Towards a Free Text Dataset for Hiding Quasi-Identifiers

ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence

ABSTRACT

References

Cited By

Recommendations

Anonymization of Sensitive Quasi-Identifiers for l-Diversity and t-Closeness

Anonymization of Quasi-Sensitive Attribute Sets in Aggregated Dataset

Privacy Preserving Publishing on Multiple Quasi-identifiers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media