research-article

Deep-Confidentiality: An IoT-Enabled Privacy-Preserving Framework for Unstructured Big Biomedical Data

Authors:

Syed Atif Moqurrab,

Gwanggil JeonAuthors Info & Claims

ACM Transactions on Internet Technology (TOIT), Volume 22, Issue 2

Article No.: 42, Pages 1 - 21

https://doi.org/10.1145/3421509

Published: 10 November 2021 Publication History

Abstract

Due to the Internet of Things evolution, the clinical data is exponentially growing and using smart technologies. The generated big biomedical data is confidential, as it contains a patient’s personal information and findings. Usually, big biomedical data is stored over the cloud, making it convenient to be accessed and shared. In this view, the data shared for research purposes helps to reveal useful and unexposed aspects. Unfortunately, sharing of such sensitive data also leads to certain privacy threats. Generally, the clinical data is available in textual format (e.g., perception reports). Under the domain of natural language processing, many research studies have been published to mitigate the privacy breaches in textual clinical data. However, there are still limitations and shortcomings in the current studies that are inevitable to be addressed. In this article, a novel framework for textual medical data privacy has been proposed as Deep-Confidentiality. The proposed framework improves Medical Entity Recognition (MER) using deep neural networks and sanitization compared to the current state-of-the-art techniques. Moreover, the new and generic utility metric is also proposed, which overcomes the shortcomings of the existing utility metric. It provides the true representation of sanitized documents as compared to the original documents. To check our proposed framework’s effectiveness, it is evaluated on the i2b2-2010 NLP challenge dataset, which is considered one of the complex medical data for MER. The proposed framework improves the MER with 7.8% recall, 7% precision, and 3.8% F1-score compared to the existing deep learning models. It also improved the data utility of sanitized documents up to 13.79%, where the value of the k is 3.

References

[1]

Ligang Luo, Liping Li, Jiajia Hu, Xiaozhe Wang, Boulin Hou, Tianze Zhang, and Lue Ping Zhao. 2016. A hybrid solution for extracting structured medical information from unstructured data in medical records via a double-reading/entry system. BMC Medical Informatics and Decision Making 16, 1 (2016), 114.

[2]

Matthew D. Mailman, Michael Feolo, Yumi Jin, Masato Kimura, Kimberly Tryka, Rinat Bagoutdinov, Luning Hao, et al. 2007. The NCBI dbGaP database of genotypes and phenotypes. Nature Genetics 39, 10 (2007), 1181–1186.

[3]

William Ollier, Tim Sprosen, and Tim Peakman. 2005. UK Biobank: From concept to reality. Pharmacogenomics 6, 6 (2005), 639–646.

[4]

Noshina Tariq, Muhammad Asim, Feras Al-Obeidat, Muhammad Zubair Farooqi, Thar Baker, Mohammad Hammoudeh, and Ibrahim Ghafir. 2019. The security of big data in fog-enabled IoT applications including blockchain: A survey. Sensors 19, 8 (2019), 1788.

[5]

Accountability Act. 1996. Health insurance portability and accountability act of 1996. Public Law 104 (1996), 191.

[6]

Peter Carey. 2018. Data Protection: A Practical Guide to UK and EU Law. Oxford University Press.

Digital Library

[7]

Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05 (2002), 557–570.

Digital Library

[8]

Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. L-Diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 3–es.

Digital Library

[9]

Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. 2007. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering. IEEE, Los Alamitos, CA, 106–115.

[10]

Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy.Foundations and Trends in Theoretical Computer Science 9, 3–4 (2014), 211–407.

Digital Library

[11]

Syed Atif Moqurrab, Adeel Anjum, Umar Manzoor, Samia Nefti, Naveed Ahmad, and Saif Ur Rehman Malik. 2017. Differential average diversity: An efficient privacy mechanism for electronic health records. Journal of Medical Imaging and Health Informatics 7, 6 (2017), 1177–1187.

[12]

Celestine Iwendi, Syed Atif Moqurrab, Adeel Anjum, Sangeen Khan, Senthilkumar Mohan, and Gautam Srivastava. 2020. N-Sanitization: A semantic privacy-preserving framework for unstructured medical datasets. Computer Communications 161 (2020), 160–171.

[13]

David Sánchez and Montserrat Batet. 2017. Toward sensitive document release with privacy guarantees. Engineering Applications of Artificial Intelligence 59 (2017), 23–34.

Digital Library

[14]

Montserrat Batet and David Sánchez. 2020. Leveraging synonymy and polysemy to improve semantic similarity assessments based on intrinsic information content. Artificial Intelligence Review 53 (2020), 2023–2041.

[15]

David Sanchez, Montserrat Batet, and Alexandre Viejo. 2013. Automatic general-purpose sanitization of textual documents. IEEE Transactions on Information Forensics and Security 8, 6 (2013), 853–862.

Digital Library

[16]

David Sánchez, Montserrat Batet, and Alexandre Viejo. 2013. Minimizing the disclosure risk of semantic correlations in document sanitization. Information Sciences 249 (2013), 110–123.

[17]

David Sánchez, Montserrat Batet, and Alexandre Viejo. 2014. Utility-preserving privacy protection of textual healthcare documents. Journal of Biomedical Informatics 52 (2014), 189–198.

Digital Library

[18]

Montserrat Batet and David Sánchez. 2014. Privacy protection of textual medical documents. In Proceedings of the 2014 IEEE Network Operations and Management Symposium (NOMS’14). IEEE, Los Alamitos, CA, 1–6.

[19]

David Sánchez and Montserrat Batet. 2016. C-sanitized: A privacy model for document redaction and sanitization. Journal of the Association for Information Science and Technology 67, 1 (2016), 148–163.

Digital Library

[20]

Özlem Uzuner, Brett R. South, Shuying Shen, and Scott L. DuVall. 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association 18, 5 (2011), 552–556.

[21]

Özlem Uzuner, Imre Solti, and Eithon Cadag. 2010. Extracting medication information from clinical text. Journal of the American Medical Informatics Association 17, 5 (2010), 514–518.

[22]

Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. Journal of the American Medical Informatics Association 20, 5 (2013), 806–813.

[23]

Amber Stubbs, Christopher Kotfila, and Özlem Uzuner. 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of Biomedical Informatics 58 (2015), S11–S19.

Digital Library

[24]

Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics (* SEM’13), Volume 2: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval’13). 1–9.

[25]

Liadh Kelly, Lorraine Goeuriot, Hanna Suominen, Tobias Schreck, Gondy Leroy, Danielle L. Mowery, Sumithra Velupillai, et al. 2014. Overview of the ShARe/CLEF eHealth evaluation lab 2014. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. 172–191.

[26]

Hanna Suominen, Sanna Salanterä, Sumithra Velupillai, Wendy W. Chapman, Guergana Savova, Noemie Elhadad, Sameer Pradhan, et al. 2013. Overview of the ShARe/CLEF eHealth evaluation lab 2013. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. 212–231.

Digital Library

[27]

Sameer Pradhan, Wendy Chapman, Suresh Man, and Guergana Savova. 2014. SemEval-2014 task 7: Analysis of clinical text. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval’14).

[28]

Steven Bethard, Leon Derczynski, Guergana Savova, James Pustejovsky, and Marc Verhagen. 2015. SemEval-2015 task 6: Clinical TempEval. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval’15). 806–814.

[29]

Steven Bethard, Guergana Savova, Wei-Te Chen, Leon Derczynski, James Pustejovsky, and Marc Verhagen. 2016. SemEval-2016 task 12: Clinical TempEval. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval’16). 1052–1062.

[30]

Natalia Ponomareva, Ferran Pla, Antonio Molina, and Paolo Rosso. 2007. Biomedical named entity recognition: A poor knowledge HMM-based approach. In Proceedings of the International Conference on Application of Natural Language to Information Systems. 382–387.

Digital Library

[31]

Shaodian Zhang and Noémie Elhadad. 2013. Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts. Journal of Biomedical Informatics 46, 6 (2013), 1088–1098.

Digital Library

[32]

Ahmed Sultan Al-Hegami Ameen Mohammed and Farea Othman Fuad Tarbosh Bagash. 2017. A biomedical named entity recognition using machine learning classifiers and rich feature set. IJCSNS 17, 1 (2017), 170.

[33]

Tzong-Han Tsai, Shih-Hung Wu, and Wen-Lian Hsu. 2005. Exploitation of linguistic features using a CRF-based biomedical named entity recognizer. In Proceedings of BioLINK, Vol. 2005.

[34]

Chen Lyu, Bo Chen, Yafeng Ren, and Donghong Ji. 2017. Long short-term memory RNN for biomedical named entity recognition. BMC Bioinformatics 18, 1 (2017), 462.

[35]

Qile Zhu, Xiaolin Li, Ana Conesa, and Cécile Pereira. 2018. GRAM-CNN: A deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics 34, 9 (2018), 1547–1554.

[36]

Maryam Habibi, Leon Weber, Mariana Neves, David Luis Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33, 14 (2017), i37–i48.

[37]

Inigo Jauregi Unanue, Ehsan Zare Borzeshi, and Massimo Piccardi. 2017. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of Biomedical Informatics 76 (2017), 102–109.

Digital Library

[38]

Henghui Zhu, Ioannis Ch. Paschalidis, and Amir Tahmasebi. 2018. Clinical concept extraction with contextual word embedding. arxiv:1810.10566

[39]

Yuqi Si, Jingqi Wang, Hua Xu, and Kirk Roberts. 2019. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association 26, 11 (July 2019), 1297–1304. DOI:

[40]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (Sept. 2019), 1234–1240. DOI:

[41]

Lei Yu, Ling Liu, Calton Pu, Mehmet Emre Gursoy, and Stacey Truex. 2019. Differentially private model publishing for deep learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP’19). IEEE, Los Alamitos, CA, 332–349.

[42]

Mohammed Alawad, Hong-Jun Yoon, Shang Gao, Brent Mumphrey, Xiao-Cheng Wu, Eric B. Durbin, Jong Cheol Jeong, et al. 2020. Privacy-preserving deep learning NLP models for cancer registries. IEEE Transactions on Emerging Topics in Computing 9, 3 (2020), 1219–1230.

[43]

Lixin Fan, Kam Woh Ng, Ce Ju, Tianyu Zhang, Chang Liu, Chee Seng Chan, and Qiang Yang. 2020. Rethinking privacy preserving deep learning: How to evaluate and thwart privacy attacks. arxiv:2006.11601

[44]

Fatemehsadat Mirshghallah, Mohammadkazem Taram, Praneeth Vepakomma, Abhishek Singh, Ramesh Raskar, and Hadi Esmaeilzadeh. 2020. Privacy in deep learning: A survey. arxiv:2004.12254

[45]

Venkatesan T. Chakaravarthy, Himanshu Gupta, Prasan Roy, and Mukesh K. Mohania. 2008. Efficient techniques for document sanitization. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08). ACM, New York, NY, 843–852. DOI:https://doi.org/10.1145/1458082.1458194

[46]

Chad Cumby and Rayid Ghani. 2011. A machine learning based system for semi-automatically redacting documents. In Proceedings of the 23rd IAAI Conference.

[47]

Balamurugan Anandan, Chris Clifton, Wei Jiang, Mummoorthy Murugesan, Pedro Pastrana-Camacho, and Luo Si. 2012. t-Plausibility: Generalizing words to desensitize text.Transactions on Data Privacy 5, 3 (2012), 505–534.

Digital Library

[48]

Ying Qin and Yingfei Zeng. 2018. Research of clinical named entity recognition based on bi-LSTM-CRF. Journal of Shanghai Jiaotong University (Science) 23, 3 (2018), 392–397.

[49]

Zengjian Liu, Ming Yang, Xiaolong Wang, Qingcai Chen, Buzhou Tang, Zhe Wang, and Hua Xu. 2017. Entity recognition from clinical texts via recurrent neural network. BMC Medical Informatics and Decision Making 17, 2 (2017), 67.

[50]

François Chollet.2015. Keras. Retrieved September 22, 2021 from https://github.com/fchollet/keras.

[51]

Zornitsa Kozareva. 2006. Bootstrapping named entity recognition with automatically generated gazetteer lists. In Proceedings of the Student Research Workshop.

Digital Library

[52]

Beaumont Hospital. 2020. Home Page. Retrieved May 4, 2020 from http://www.beaumont.ie/.

Cited By

Adnan MHaider Syed MAnjum ARehman S(2025)A Framework for Privacy-Preserving in IoV Using Federated Learning With Differential PrivacyIEEE Access10.1109/ACCESS.2025.352693413(13507-13521)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3526934
Guo QHe YLi QLiu AXiong NHe QYang QZhang S(2025)PPAT: An effective scheme ensuring privacy-preserving, accuracy, and trust for worker selection in mobile crowdsensing networksFuture Generation Computer Systems10.1016/j.future.2024.107536163(107536)Online publication date: Feb-2025
https://doi.org/10.1016/j.future.2024.107536
DeSmet CCook D(2024)HydraGAN: A Cooperative Agent Model for Multi-Objective Data GenerationACM Transactions on Intelligent Systems and Technology10.1145/365398215:3(1-21)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3653982
Show More Cited By

Index Terms

Deep-Confidentiality: An IoT-Enabled Privacy-Preserving Framework for Unstructured Big Biomedical Data
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Privacy protections

Recommendations

Privacy Preserving Unstructured Big Data Analytics

Big data analytics has created opportunities for researchers to process huge amount of data but created a big threat to privacy of individual. Data processed by big data analytics platforms may have personal information which need to be taken care of ...
Towards privacy preserving unstructured big data publishing

Various sources and sophisticated tools are used to gather and process the comparatively large volume of data or big data that sometimes leads to privacy disclosure (at broader or finer level) for the data owner. Privacy preserving data publishing ...
Privacy preserving medical data publishing

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology

ACM Transactions on Internet Technology Volume 22, Issue 2

May 2022

582 pages

ISSN:1533-5399

EISSN:1557-6051

DOI:10.1145/3490674

Editor:
Ling Liu
Georgia Institute of Technology, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2021

Accepted: 01 August 2020

Revised: 01 August 2020

Received: 01 June 2020

Published in TOIT Volume 22, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
439
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Adnan MHaider Syed MAnjum ARehman S(2025)A Framework for Privacy-Preserving in IoV Using Federated Learning With Differential PrivacyIEEE Access10.1109/ACCESS.2025.352693413(13507-13521)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3526934
Guo QHe YLi QLiu AXiong NHe QYang QZhang S(2025)PPAT: An effective scheme ensuring privacy-preserving, accuracy, and trust for worker selection in mobile crowdsensing networksFuture Generation Computer Systems10.1016/j.future.2024.107536163(107536)Online publication date: Feb-2025
https://doi.org/10.1016/j.future.2024.107536
DeSmet CCook D(2024)HydraGAN: A Cooperative Agent Model for Multi-Objective Data GenerationACM Transactions on Intelligent Systems and Technology10.1145/365398215:3(1-21)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3653982
Song JFu HJiao TWang D(2023)AI-enabled legacy data integration with privacy protection: a case study on regional cloud arbitration courtJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00500-z12:1Online publication date: 14-Oct-2023
https://dl.acm.org/doi/10.1186/s13677-023-00500-z
Roy SGhosh NUplavikar NGhosh P(2023)Towards a Unified Pandemic Management Architecture: Survey, Challenges, and Future DirectionsACM Computing Surveys10.1145/360932456:2(1-32)Online publication date: 14-Jul-2023
https://dl.acm.org/doi/10.1145/3609324
Rehman ZTariq NMoqurrab SYoo JSrivastava G(2023)Machine learning and internet of things applications in enterprise architectures: Solutions, challenges, and open issuesExpert Systems10.1111/exsy.1346741:1Online publication date: 18-Oct-2023
https://doi.org/10.1111/exsy.13467
Moqurrab SAnjum ATariq NSrivastava G(2023)Instant_Anonymity: A Lightweight Semantic Privacy Guarantee for 5G-Enabled IIoTIEEE Transactions on Industrial Informatics10.1109/TII.2022.317953619:1(951-959)Online publication date: Jan-2023
https://doi.org/10.1109/TII.2022.3179536
Saqib NMalik SAnjum ASyed MMoqurrab SSrivastava GLin J(2023)Preserving Privacy in Internet of Vehicles (IoV): A Novel Group-Leader-Based Shadowing Scheme Using BlockchainIEEE Internet of Things Journal10.1109/JIOT.2023.329413310:24(21421-21430)Online publication date: 15-Dec-2023
https://doi.org/10.1109/JIOT.2023.3294133
Moqurrab SNaeem TShoaib Malik MFayyaz AJamal ASrivastava G(2023)UtilityAwareInformation Sciences: an International Journal10.1016/j.ins.2023.119247643:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.ins.2023.119247
Javed LAnjum AYakubu BIqbal MMoqurrab SSrivastava G(2022) ShareChain : Blockchain‐enabled model for sharing patient data using federated learning and differential privacy Expert Systems10.1111/exsy.1313140:5Online publication date: 24-Aug-2022
https://doi.org/10.1111/exsy.13131
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents