Using machine learning to detect PII from attributes and supporting activities of information assets

Wei, Yu-Chih; Liao, Tzu-Yin; Wu, Wei-Chen

doi:10.1007/s11227-021-04239-9

Using machine learning to detect PII from attributes and supporting activities of information assets

Published: 17 January 2022

Volume 78, pages 9392–9413, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

701 Accesses
3 Citations
Explore all metrics

Abstract

Since the implementation of the EU General Data Protection Regulation (“GDPR”) and similar legislation on personal data protection in Taiwan, enterprises must now provide adequate protection for their customers’ personal data. Many enterprises use automated personally identifiable information (“PII”) scanning systems to process PII to ensure full compliance with the law. However, personal data saved in non-electronic form cannot be detected by these automated scanning systems, resulting in PII not being able to be accurately identified. We propose a random forest (“RF”) approach to detect unidentified PII to close the loopholes. Relevant peripheral information attributes of PII are identified and used in our study for machine learning and modeling to establish a model for detecting PII that otherwise cannot be detected by automated scanners. Our study shows that the F1-measure of our proposed model achieves at least 90%, a higher accuracy rate than that of automated scanners in detecting PII in an enterprise’s inventory of information assets. Finally, the results of the experiment in our study show that our proposed model can shorten the time required for detecting PII by 100 times and increase the F1-measure by 2% when compared with the PII detection conducted manually.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big Data Analytics: A Literature Review Paper

A Review on Random Forest: An Ensemble Classifier

Machine learning in agriculture: a review of crop management applications

Article 01 July 2023

References

Eminagaoglu M, Eren S (2010) Implementation and comparison of machine learning classifiers for information security risk analysis of a human resources department In: 2010 International Conference on Computer Information Systems and Industrial Management Applications (CISIM), 2010 IEEE, pp 187–192
Zhao D-M, Liu J-X, Zhang Z-H (2009) Method of risk evaluation of information security based on neural networks In: 2009 International Conference on Machine Learning and Cybernetics, 2009 IEEE, pp 1127–1132
Paltrinieri N, Comfort L, Reniers G (2019) Learning about risk: machine learning for risk assessment. Saf Sci 118:475–486
Article Google Scholar
Kaplan S, Garrick BJJRa (1981) On the quantitative definition of risk. Risk Anal 1(1):11–27
Article Google Scholar
Mostafaeipour A, Qolipour M, Eslami HJTJOS (2017) Implementing fuzzy rank function model for a new supply chain risk management. J Supercomput 73(8):3586–3602
Article Google Scholar
Shijun S (2020) Risk management and countering measurements by computer modeling and simulation technology in the approval and early preparation stages of a large international project. J Supercomput 76(5):3689–3701
Article Google Scholar
Wei Y-C, Wu W-C, Lai G-H, Chu Y-CJTJoS, (2020) pISRA: privacy considered information security risk assessment model. J Supercomput 76(3):1468–1481
Article Google Scholar
Wei Y-C, Wu W-C, Chu Y-C (2019) (2019) Personally identifiable data field checking using machine learning. International Conference on Frontier Computing. Springer, pp 1789–1796
Google Scholar
Manning CD, Manning CD, Schütze H (1999) Foundations of statistical natural language processing The MIT Press, America
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Malini N, Pushpa M (2017) Analysis on credit card fraud identification techniques based on KNN and outlier detection. In: 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), 2017 IEEE, pp 255–258
Knorr EM, Ng RT (1997) A unified approach for mining outliers Paper presented at the Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research
Campos GO, Zimek A, Sander J, Campello RJ, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927
Article MathSciNet Google Scholar
Sathya R, Abraham A (2013) Comparison of supervised and unsupervised learning algorithms for pattern classification. Int J Adv Res Artif Intell 2(2):34–38
Article Google Scholar
Goecks J, Shavlik J (2000) Learning users' interests by unobtrusively observing their normal behavior In: Proceedings of the 5th international conference on Intelligent user interfaces, 2000 pp 129–132
Claypool M, Le P, Wased M, Brown D (2001) Implicit interest indicators. In: Proceedings of the 6th international conference on Intelligent user interfaces, 2001 pp 33–40
Paganelli L, Paternò F (2002) Intelligent analysis of user interactions with web applications In: Proceedings of the 7th international conference on Intelligent user interfaces, 2002 pp 111–118
Nakamichi N, Shima K, Sakai M, Matsumoto K-i (2006) Detecting low usability web pages using quantitative data of users' behavior In: Proceedings of the 28th international conference on Software engineering, 2006 pp 569–576
Martín-Albo D, Leiva LA, Huang J, Plamondon R (2016) Strokes of insight: user intent detection and kinematic compression of mouse cursor trails. Inf Process Manag 52(6):989–1003
Article Google Scholar
Zissman J (2020) TimeMe.js. https://github.com/jasonzissman/TimeMe.js
Huiqin W, Weiguo L (2018) Analysis of the Art of War of Sun Tzu by Text Mining Technology. In: 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS), 2018. IEEE, pp 626–628
Li P-H, Ma W-Y (2019) CkipTagger. https://github.com/ckiplab/ckiptagger
Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Document 28(1):11–21
Article Google Scholar
Berry MW, Dumais ST, O’Brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37(4):573–595
Article MathSciNet Google Scholar
Justeson JS, Katz SM (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Nat Lang Eng 1(1):9–27
Article Google Scholar
Zhang W, Yoshida T, Tang X (2011) A comparative study of TF* IDF, LSI and multi-words for text classification. Expert Syst Appl 38(3):2758–2765
Article Google Scholar
Ma C-M, Yang W-S, Cheng B-W (2014) How the parameters of k-nearest neighbor algorithm impact on the best classification accuracy: In case of parkinson dataset. J Appl Sci 14(2):171–176
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intel Res 16:321–357
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by Onward Security (No.209A136), National Taipei University of Technology-Beijing University of Technology Joint Research Program (No. NTUT-BJUT-110-01) and Ministry of Science and Technology (NO. 110-2637-H-027-004-).

Author information

Authors and Affiliations

National Taipei University of Technology, Taipei, Taiwan
Yu-Chih Wei & Tzu-Yin Liao
National Taipei University of Business, Taipei, Taiwan
Wei-Chen Wu

Authors

Yu-Chih Wei
View author publications
You can also search for this author in PubMed Google Scholar
Tzu-Yin Liao
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Chen Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tzu-Yin Liao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, YC., Liao, TY. & Wu, WC. Using machine learning to detect PII from attributes and supporting activities of information assets. J Supercomput 78, 9392–9413 (2022). https://doi.org/10.1007/s11227-021-04239-9

Download citation

Accepted: 30 November 2021
Published: 17 January 2022
Issue Date: May 2022
DOI: https://doi.org/10.1007/s11227-021-04239-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using machine learning to detect PII from attributes and supporting activities of information assets

Abstract

Access this article

Similar content being viewed by others

Big Data Analytics: A Literature Review Paper

A Review on Random Forest: An Ensemble Classifier

Machine learning in agriculture: a review of crop management applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using machine learning to detect PII from attributes and supporting activities of information assets

Abstract

Access this article

Similar content being viewed by others

Big Data Analytics: A Literature Review Paper

A Review on Random Forest: An Ensemble Classifier

Machine learning in agriculture: a review of crop management applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation