skip to main content
research-article

Deep-Confidentiality: An IoT-Enabled Privacy-Preserving Framework for Unstructured Big Biomedical Data

Published: 10 November 2021 Publication History

Abstract

Due to the Internet of Things evolution, the clinical data is exponentially growing and using smart technologies. The generated big biomedical data is confidential, as it contains a patient’s personal information and findings. Usually, big biomedical data is stored over the cloud, making it convenient to be accessed and shared. In this view, the data shared for research purposes helps to reveal useful and unexposed aspects. Unfortunately, sharing of such sensitive data also leads to certain privacy threats. Generally, the clinical data is available in textual format (e.g., perception reports). Under the domain of natural language processing, many research studies have been published to mitigate the privacy breaches in textual clinical data. However, there are still limitations and shortcomings in the current studies that are inevitable to be addressed. In this article, a novel framework for textual medical data privacy has been proposed as Deep-Confidentiality. The proposed framework improves Medical Entity Recognition (MER) using deep neural networks and sanitization compared to the current state-of-the-art techniques. Moreover, the new and generic utility metric is also proposed, which overcomes the shortcomings of the existing utility metric. It provides the true representation of sanitized documents as compared to the original documents. To check our proposed framework’s effectiveness, it is evaluated on the i2b2-2010 NLP challenge dataset, which is considered one of the complex medical data for MER. The proposed framework improves the MER with 7.8% recall, 7% precision, and 3.8% F1-score compared to the existing deep learning models. It also improved the data utility of sanitized documents up to 13.79%, where the value of the k is 3.

References

[1]
Ligang Luo, Liping Li, Jiajia Hu, Xiaozhe Wang, Boulin Hou, Tianze Zhang, and Lue Ping Zhao. 2016. A hybrid solution for extracting structured medical information from unstructured data in medical records via a double-reading/entry system. BMC Medical Informatics and Decision Making 16, 1 (2016), 114.
[2]
Matthew D. Mailman, Michael Feolo, Yumi Jin, Masato Kimura, Kimberly Tryka, Rinat Bagoutdinov, Luning Hao, et al. 2007. The NCBI dbGaP database of genotypes and phenotypes. Nature Genetics 39, 10 (2007), 1181–1186.
[3]
William Ollier, Tim Sprosen, and Tim Peakman. 2005. UK Biobank: From concept to reality. Pharmacogenomics 6, 6 (2005), 639–646.
[4]
Noshina Tariq, Muhammad Asim, Feras Al-Obeidat, Muhammad Zubair Farooqi, Thar Baker, Mohammad Hammoudeh, and Ibrahim Ghafir. 2019. The security of big data in fog-enabled IoT applications including blockchain: A survey. Sensors 19, 8 (2019), 1788.
[5]
Accountability Act. 1996. Health insurance portability and accountability act of 1996. Public Law 104 (1996), 191.
[6]
Peter Carey. 2018. Data Protection: A Practical Guide to UK and EU Law. Oxford University Press.
[7]
Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05 (2002), 557–570.
[8]
Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. L-Diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 3–es.
[9]
Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. 2007. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering. IEEE, Los Alamitos, CA, 106–115.
[10]
Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy.Foundations and Trends in Theoretical Computer Science 9, 3–4 (2014), 211–407.
[11]
Syed Atif Moqurrab, Adeel Anjum, Umar Manzoor, Samia Nefti, Naveed Ahmad, and Saif Ur Rehman Malik. 2017. Differential average diversity: An efficient privacy mechanism for electronic health records. Journal of Medical Imaging and Health Informatics 7, 6 (2017), 1177–1187.
[12]
Celestine Iwendi, Syed Atif Moqurrab, Adeel Anjum, Sangeen Khan, Senthilkumar Mohan, and Gautam Srivastava. 2020. N-Sanitization: A semantic privacy-preserving framework for unstructured medical datasets. Computer Communications 161 (2020), 160–171.
[13]
David Sánchez and Montserrat Batet. 2017. Toward sensitive document release with privacy guarantees. Engineering Applications of Artificial Intelligence 59 (2017), 23–34.
[14]
Montserrat Batet and David Sánchez. 2020. Leveraging synonymy and polysemy to improve semantic similarity assessments based on intrinsic information content. Artificial Intelligence Review 53 (2020), 2023–2041.
[15]
David Sanchez, Montserrat Batet, and Alexandre Viejo. 2013. Automatic general-purpose sanitization of textual documents. IEEE Transactions on Information Forensics and Security 8, 6 (2013), 853–862.
[16]
David Sánchez, Montserrat Batet, and Alexandre Viejo. 2013. Minimizing the disclosure risk of semantic correlations in document sanitization. Information Sciences 249 (2013), 110–123.
[17]
David Sánchez, Montserrat Batet, and Alexandre Viejo. 2014. Utility-preserving privacy protection of textual healthcare documents. Journal of Biomedical Informatics 52 (2014), 189–198.
[18]
Montserrat Batet and David Sánchez. 2014. Privacy protection of textual medical documents. In Proceedings of the 2014 IEEE Network Operations and Management Symposium (NOMS’14). IEEE, Los Alamitos, CA, 1–6.
[19]
David Sánchez and Montserrat Batet. 2016. C-sanitized: A privacy model for document redaction and sanitization. Journal of the Association for Information Science and Technology 67, 1 (2016), 148–163.
[20]
Özlem Uzuner, Brett R. South, Shuying Shen, and Scott L. DuVall. 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association 18, 5 (2011), 552–556.
[21]
Özlem Uzuner, Imre Solti, and Eithon Cadag. 2010. Extracting medication information from clinical text. Journal of the American Medical Informatics Association 17, 5 (2010), 514–518.
[22]
Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. Journal of the American Medical Informatics Association 20, 5 (2013), 806–813.
[23]
Amber Stubbs, Christopher Kotfila, and Özlem Uzuner. 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of Biomedical Informatics 58 (2015), S11–S19.
[24]
Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics (* SEM’13), Volume 2: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval’13). 1–9.
[25]
Liadh Kelly, Lorraine Goeuriot, Hanna Suominen, Tobias Schreck, Gondy Leroy, Danielle L. Mowery, Sumithra Velupillai, et al. 2014. Overview of the ShARe/CLEF eHealth evaluation lab 2014. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. 172–191.
[26]
Hanna Suominen, Sanna Salanterä, Sumithra Velupillai, Wendy W. Chapman, Guergana Savova, Noemie Elhadad, Sameer Pradhan, et al. 2013. Overview of the ShARe/CLEF eHealth evaluation lab 2013. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. 212–231.
[27]
Sameer Pradhan, Wendy Chapman, Suresh Man, and Guergana Savova. 2014. SemEval-2014 task 7: Analysis of clinical text. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval’14).
[28]
Steven Bethard, Leon Derczynski, Guergana Savova, James Pustejovsky, and Marc Verhagen. 2015. SemEval-2015 task 6: Clinical TempEval. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval’15). 806–814.
[29]
Steven Bethard, Guergana Savova, Wei-Te Chen, Leon Derczynski, James Pustejovsky, and Marc Verhagen. 2016. SemEval-2016 task 12: Clinical TempEval. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval’16). 1052–1062.
[30]
Natalia Ponomareva, Ferran Pla, Antonio Molina, and Paolo Rosso. 2007. Biomedical named entity recognition: A poor knowledge HMM-based approach. In Proceedings of the International Conference on Application of Natural Language to Information Systems. 382–387.
[31]
Shaodian Zhang and Noémie Elhadad. 2013. Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts. Journal of Biomedical Informatics 46, 6 (2013), 1088–1098.
[32]
Ahmed Sultan Al-Hegami Ameen Mohammed and Farea Othman Fuad Tarbosh Bagash. 2017. A biomedical named entity recognition using machine learning classifiers and rich feature set. IJCSNS 17, 1 (2017), 170.
[33]
Tzong-Han Tsai, Shih-Hung Wu, and Wen-Lian Hsu. 2005. Exploitation of linguistic features using a CRF-based biomedical named entity recognizer. In Proceedings of BioLINK, Vol. 2005.
[34]
Chen Lyu, Bo Chen, Yafeng Ren, and Donghong Ji. 2017. Long short-term memory RNN for biomedical named entity recognition. BMC Bioinformatics 18, 1 (2017), 462.
[35]
Qile Zhu, Xiaolin Li, Ana Conesa, and Cécile Pereira. 2018. GRAM-CNN: A deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics 34, 9 (2018), 1547–1554.
[36]
Maryam Habibi, Leon Weber, Mariana Neves, David Luis Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33, 14 (2017), i37–i48.
[37]
Inigo Jauregi Unanue, Ehsan Zare Borzeshi, and Massimo Piccardi. 2017. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of Biomedical Informatics 76 (2017), 102–109.
[38]
Henghui Zhu, Ioannis Ch. Paschalidis, and Amir Tahmasebi. 2018. Clinical concept extraction with contextual word embedding. arxiv:1810.10566
[39]
Yuqi Si, Jingqi Wang, Hua Xu, and Kirk Roberts. 2019. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association 26, 11 (July 2019), 1297–1304. DOI:
[40]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (Sept. 2019), 1234–1240. DOI:
[41]
Lei Yu, Ling Liu, Calton Pu, Mehmet Emre Gursoy, and Stacey Truex. 2019. Differentially private model publishing for deep learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP’19). IEEE, Los Alamitos, CA, 332–349.
[42]
Mohammed Alawad, Hong-Jun Yoon, Shang Gao, Brent Mumphrey, Xiao-Cheng Wu, Eric B. Durbin, Jong Cheol Jeong, et al. 2020. Privacy-preserving deep learning NLP models for cancer registries. IEEE Transactions on Emerging Topics in Computing 9, 3 (2020), 1219–1230.
[43]
Lixin Fan, Kam Woh Ng, Ce Ju, Tianyu Zhang, Chang Liu, Chee Seng Chan, and Qiang Yang. 2020. Rethinking privacy preserving deep learning: How to evaluate and thwart privacy attacks. arxiv:2006.11601
[44]
Fatemehsadat Mirshghallah, Mohammadkazem Taram, Praneeth Vepakomma, Abhishek Singh, Ramesh Raskar, and Hadi Esmaeilzadeh. 2020. Privacy in deep learning: A survey. arxiv:2004.12254
[45]
Venkatesan T. Chakaravarthy, Himanshu Gupta, Prasan Roy, and Mukesh K. Mohania. 2008. Efficient techniques for document sanitization. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08). ACM, New York, NY, 843–852. DOI:https://doi.org/10.1145/1458082.1458194
[46]
Chad Cumby and Rayid Ghani. 2011. A machine learning based system for semi-automatically redacting documents. In Proceedings of the 23rd IAAI Conference.
[47]
Balamurugan Anandan, Chris Clifton, Wei Jiang, Mummoorthy Murugesan, Pedro Pastrana-Camacho, and Luo Si. 2012. t-Plausibility: Generalizing words to desensitize text.Transactions on Data Privacy 5, 3 (2012), 505–534.
[48]
Ying Qin and Yingfei Zeng. 2018. Research of clinical named entity recognition based on bi-LSTM-CRF. Journal of Shanghai Jiaotong University (Science) 23, 3 (2018), 392–397.
[49]
Zengjian Liu, Ming Yang, Xiaolong Wang, Qingcai Chen, Buzhou Tang, Zhe Wang, and Hua Xu. 2017. Entity recognition from clinical texts via recurrent neural network. BMC Medical Informatics and Decision Making 17, 2 (2017), 67.
[50]
François Chollet.2015. Keras. Retrieved September 22, 2021 from https://github.com/fchollet/keras.
[51]
Zornitsa Kozareva. 2006. Bootstrapping named entity recognition with automatically generated gazetteer lists. In Proceedings of the Student Research Workshop.
[52]
Beaumont Hospital. 2020. Home Page. Retrieved May 4, 2020 from http://www.beaumont.ie/.

Cited By

View all
  • (2025)A Framework for Privacy-Preserving in IoV Using Federated Learning With Differential PrivacyIEEE Access10.1109/ACCESS.2025.352693413(13507-13521)Online publication date: 2025
  • (2025)PPAT: An effective scheme ensuring privacy-preserving, accuracy, and trust for worker selection in mobile crowdsensing networksFuture Generation Computer Systems10.1016/j.future.2024.107536163(107536)Online publication date: Feb-2025
  • (2024)HydraGAN: A Cooperative Agent Model for Multi-Objective Data GenerationACM Transactions on Intelligent Systems and Technology10.1145/365398215:3(1-21)Online publication date: 17-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 22, Issue 2
May 2022
582 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/3490674
  • Editor:
  • Ling Liu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2021
Accepted: 01 August 2020
Revised: 01 August 2020
Received: 01 June 2020
Published in TOIT Volume 22, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep neural network
  2. LSTM
  3. CNN
  4. textual data privacy
  5. big biomedical data

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)70
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Framework for Privacy-Preserving in IoV Using Federated Learning With Differential PrivacyIEEE Access10.1109/ACCESS.2025.352693413(13507-13521)Online publication date: 2025
  • (2025)PPAT: An effective scheme ensuring privacy-preserving, accuracy, and trust for worker selection in mobile crowdsensing networksFuture Generation Computer Systems10.1016/j.future.2024.107536163(107536)Online publication date: Feb-2025
  • (2024)HydraGAN: A Cooperative Agent Model for Multi-Objective Data GenerationACM Transactions on Intelligent Systems and Technology10.1145/365398215:3(1-21)Online publication date: 17-May-2024
  • (2023)AI-enabled legacy data integration with privacy protection: a case study on regional cloud arbitration courtJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00500-z12:1Online publication date: 14-Oct-2023
  • (2023)Towards a Unified Pandemic Management Architecture: Survey, Challenges, and Future DirectionsACM Computing Surveys10.1145/360932456:2(1-32)Online publication date: 14-Jul-2023
  • (2023)Machine learning and internet of things applications in enterprise architectures: Solutions, challenges, and open issuesExpert Systems10.1111/exsy.1346741:1Online publication date: 18-Oct-2023
  • (2023)Instant_Anonymity: A Lightweight Semantic Privacy Guarantee for 5G-Enabled IIoTIEEE Transactions on Industrial Informatics10.1109/TII.2022.317953619:1(951-959)Online publication date: Jan-2023
  • (2023)Preserving Privacy in Internet of Vehicles (IoV): A Novel Group-Leader-Based Shadowing Scheme Using BlockchainIEEE Internet of Things Journal10.1109/JIOT.2023.329413310:24(21421-21430)Online publication date: 15-Dec-2023
  • (2023)UtilityAwareInformation Sciences: an International Journal10.1016/j.ins.2023.119247643:COnline publication date: 1-Sep-2023
  • (2022) ShareChain : Blockchain‐enabled model for sharing patient data using federated learning and differential privacy Expert Systems10.1111/exsy.1313140:5Online publication date: 24-Aug-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media