Skip to main content
Log in

Using machine learning to detect PII from attributes and supporting activities of information assets

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Since the implementation of the EU General Data Protection Regulation (“GDPR”) and similar legislation on personal data protection in Taiwan, enterprises must now provide adequate protection for their customers’ personal data. Many enterprises use automated personally identifiable information (“PII”) scanning systems to process PII to ensure full compliance with the law. However, personal data saved in non-electronic form cannot be detected by these automated scanning systems, resulting in PII not being able to be accurately identified. We propose a random forest (“RF”) approach to detect unidentified PII to close the loopholes. Relevant peripheral information attributes of PII are identified and used in our study for machine learning and modeling to establish a model for detecting PII that otherwise cannot be detected by automated scanners. Our study shows that the F1-measure of our proposed model achieves at least 90%, a higher accuracy rate than that of automated scanners in detecting PII in an enterprise’s inventory of information assets. Finally, the results of the experiment in our study show that our proposed model can shorten the time required for detecting PII by 100 times and increase the F1-measure by 2% when compared with the PII detection conducted manually.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Eminagaoglu M, Eren S (2010) Implementation and comparison of machine learning classifiers for information security risk analysis of a human resources department In: 2010 International Conference on Computer Information Systems and Industrial Management Applications (CISIM), 2010 IEEE, pp 187–192

  2. Zhao D-M, Liu J-X, Zhang Z-H (2009) Method of risk evaluation of information security based on neural networks In: 2009 International Conference on Machine Learning and Cybernetics, 2009 IEEE, pp 1127–1132

  3. Paltrinieri N, Comfort L, Reniers G (2019) Learning about risk: machine learning for risk assessment. Saf Sci 118:475–486

    Article  Google Scholar 

  4. Kaplan S, Garrick BJJRa (1981) On the quantitative definition of risk. Risk Anal 1(1):11–27

    Article  Google Scholar 

  5. Mostafaeipour A, Qolipour M, Eslami HJTJOS (2017) Implementing fuzzy rank function model for a new supply chain risk management. J Supercomput 73(8):3586–3602

    Article  Google Scholar 

  6. Shijun S (2020) Risk management and countering measurements by computer modeling and simulation technology in the approval and early preparation stages of a large international project. J Supercomput 76(5):3689–3701

    Article  Google Scholar 

  7. Wei Y-C, Wu W-C, Lai G-H, Chu Y-CJTJoS, (2020) pISRA: privacy considered information security risk assessment model. J Supercomput 76(3):1468–1481

    Article  Google Scholar 

  8. Wei Y-C, Wu W-C, Chu Y-C (2019) (2019) Personally identifiable data field checking using machine learning. International Conference on Frontier Computing. Springer, pp 1789–1796

    Google Scholar 

  9. Manning CD, Manning CD, Schütze H (1999) Foundations of statistical natural language processing The MIT Press, America

  10. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  11. Malini N, Pushpa M (2017) Analysis on credit card fraud identification techniques based on KNN and outlier detection. In: 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), 2017 IEEE, pp 255–258

  12. Knorr EM, Ng RT (1997) A unified approach for mining outliers Paper presented at the Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research

  13. Campos GO, Zimek A, Sander J, Campello RJ, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927

    Article  MathSciNet  Google Scholar 

  14. Sathya R, Abraham A (2013) Comparison of supervised and unsupervised learning algorithms for pattern classification. Int J Adv Res Artif Intell 2(2):34–38

    Article  Google Scholar 

  15. Goecks J, Shavlik J (2000) Learning users' interests by unobtrusively observing their normal behavior In: Proceedings of the 5th international conference on Intelligent user interfaces, 2000 pp 129–132

  16. Claypool M, Le P, Wased M, Brown D (2001) Implicit interest indicators. In: Proceedings of the 6th international conference on Intelligent user interfaces, 2001 pp 33–40

  17. Paganelli L, Paternò F (2002) Intelligent analysis of user interactions with web applications In: Proceedings of the 7th international conference on Intelligent user interfaces, 2002 pp 111–118

  18. Nakamichi N, Shima K, Sakai M, Matsumoto K-i (2006) Detecting low usability web pages using quantitative data of users' behavior In: Proceedings of the 28th international conference on Software engineering, 2006 pp 569–576

  19. Martín-Albo D, Leiva LA, Huang J, Plamondon R (2016) Strokes of insight: user intent detection and kinematic compression of mouse cursor trails. Inf Process Manag 52(6):989–1003

    Article  Google Scholar 

  20. Zissman J (2020) TimeMe.js. https://github.com/jasonzissman/TimeMe.js

  21. Huiqin W, Weiguo L (2018) Analysis of the Art of War of Sun Tzu by Text Mining Technology. In: 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS), 2018. IEEE, pp 626–628

  22. Li P-H, Ma W-Y (2019) CkipTagger. https://github.com/ckiplab/ckiptagger

  23. Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Document 28(1):11–21

    Article  Google Scholar 

  24. Berry MW, Dumais ST, O’Brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37(4):573–595

    Article  MathSciNet  Google Scholar 

  25. Justeson JS, Katz SM (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Nat Lang Eng 1(1):9–27

    Article  Google Scholar 

  26. Zhang W, Yoshida T, Tang X (2011) A comparative study of TF* IDF, LSI and multi-words for text classification. Expert Syst Appl 38(3):2758–2765

    Article  Google Scholar 

  27. Ma C-M, Yang W-S, Cheng B-W (2014) How the parameters of k-nearest neighbor algorithm impact on the best classification accuracy: In case of parkinson dataset. J Appl Sci 14(2):171–176

    Article  Google Scholar 

  28. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  29. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intel Res 16:321–357

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by Onward Security (No.209A136), National Taipei University of Technology-Beijing University of Technology Joint Research Program (No. NTUT-BJUT-110-01) and Ministry of Science and Technology (NO. 110-2637-H-027-004-).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tzu-Yin Liao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, YC., Liao, TY. & Wu, WC. Using machine learning to detect PII from attributes and supporting activities of information assets. J Supercomput 78, 9392–9413 (2022). https://doi.org/10.1007/s11227-021-04239-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04239-9

Keywords

Navigation