Abstract
The identification of sensitive information, whether personal or institutional, is a fundamental step when dealing with the problem of information leakage. This problem is one of the most pressing to which companies and research centers dedicate a considerable amount of material and intellectual resources, as a particular case, to the development of methods or the application of some already known ones to the identification of sensitive information. This increased the proposals with promising results, but without yet offering a totally satisfactory solution to the problem. Under these conditions, it is considered necessary to make a critical analysis of the existing methods and techniques and their future projections. In this paper, a review of the proposals for the determination of sensitivity in textual documents is presented and a taxonomy is introduced to better understand the approaches with which this problem has been approached in the context of information leakage. Starting from the critical analysis and the practical needs raised by experts in the areas of possible application, lines of research on this subject are outlined that include the development of methods for the automation of the classification of sensitive textual documents. Possible extensions that these studies may have in similar application areas are proposed based on other information carriers, such as the cases of images, recordings and other forms of information object, each of which entails levels of complexity that merit studies analogous to the one carried out in this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Morales, S., Pérez, O., Ruiz, J.: Métodos para la determinación de la sensibilidad de documentos: un estado del arte. Serie Gris, Centro de Aplicaciones de Tecnologías de Avanzada, vol. 036, Habana, Cuba (2016)
Berardi, G., Esuli, A., Macdonald, C., Ounis, L., Sebastiani, F.: Semi-automated text classification for sensitivity identification. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1711–1714. ACM (2015)
Alzhrani, K., Ruddy, E., Chow, C., Boulty, T.: Automated U.S diplomatic cables security classification: topic model pruning vs. classification based on clusters. In: Proceedings of the 2017 IEEE International Symposium on Technologies for Homeland Security (HST), pp. 1–6 (2017)
Salahdine, F., Kaabouch, N.: Social engineering attacks: a survey. Future Internet 11(4), 1–17 (2019)
Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V.: A survey on data leakage prevention systems. J. Netw. Comput. Appl. 62, 137–152 (2016)
Wynne, N., Reed, B.: Magic quadrant for enterprise data loss prevention. Gartner Group Research Note (2016)
Ahmad, N.: Do data almost always eventually leak?: Computer 54(2), 70–74 (2021)
Shabtai, A., Yuval, E., Lior, R.: A Survey of Data Leakage Detection and Prevention Solutions. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-2053-8
Jena, M.D., Singhar, S.S., Mohanta, B.K., Ramasubbareddy, S.: Ensuring data privacy using machine learning for responsible data science. In: Satapathy, S.C., Zhang, Y.-D., Bhateja, V., Majhi, R. (eds.) Intelligent Data Engineering and Analytics. AISC, vol. 1177, pp. 507–514. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-5679-1_49
Ávila, R., Khoury, R., Khoury, R., Petrillo, F.: Use of security logs for data leak detection: a systematic literature review. Secur. Commun. Netw. 2021, 1–29 (2021)
Wadkar, H., Mishra, A., Dixit, A.: Prevention of information leakages in a web browser by monitoring system calls. In: Proceedings of the 2014 IEEE International Advance Computing Conference (IACC), pp. 199–204 (2014)
Liu, T., Pu, Y., Shi, J., Li, Q., Chen, X.: Towards misdirected email detection for preventing information leakage. In: Proceedings of the 2014 IEEE Symposium on Computers and Communication (ISCC), pp. 1–6 (2014)
Zilberman, P., Dolev, S., Katz, G., Elovici, Y., Shabtai, A.: Analyzing group communication for preventing data leakage via email. In: Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics, pp. 37–41 (2011)
Becchi, M., Crowley, P.: An improved algorithm to accelerate regular expression evaluation. In: Proceedings of the 2007 ACM/IEEE Symposium on Architecture for Networking and Communications Systems, pp. 145–154 (2007)
Sokolova, M., et al.: Personal health information leak prevention in heterogeneous texts. In: Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains, pp. 58–69 (2009)
Chen, K., Liu, L.: Privacy preserving data classification with rotation perturbation. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), pp. 1–4 (2005)
Aggarwal, C.C., Yu, P.S.: A general survey of privacy-preserving data mining models and algorithms. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining. ADBS, vol. 34, pp. 11–51. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-70992-5_2
Brown, J.D., Charlebois, D.: Security classification using automated learning (SCALE): optimizing statistical natural language processing techniques to assign security labels to unstructured text. Defense Research and Development Canada, Ottawa (Ontario) (2010)
Shapira, Y., Shapira, B., Shabtai, A.: Content-based data leakage detection using extended fingerprinting. arXiv prepint arXiv:1302.2028 (2013)
Vijayalakshmi, V., Rohini, T., Sujatha, S., Ishali, A.: Survey on detecting leakage of sensitive data. In: World Conference on Futuristic Trends in Research and Innovation for Social Welfare (Startup Conclave), pp. 1–3. IEEE (2016)
Hart, M., Manadhata, P., Johnson, R.: Text classification for data loss prevention. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 18–37. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22263-4_2
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Carvalho, V.R., Balasubramanyan, R., Cohen, W.W.: Information leaks and suggestions: a case study using Mozilla thunderbird. In: CEAS 2009 Sixth Conference on Email and Anti-Spam (2009)
Nikitinsky, N., Sokolova, T., Engelstad Ehotskaya, E.: DLP technologies: challenges and future directions. In: The International Conference on Cyber-Crime Investigation and Cyber Security (ICCICS 2014), pp. 31–36 (2014)
Engelstad, P., Hammer, H., Yazidi, A., Bai, A.: Advanced classification lists (dirty word lists) for automatic security classification. In: Cyber-Enabled Distributed Computing and Knowledge Discovery, pp. 44–53. IEEE (2015)
Kowsari, K., Jafari, M., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019)
Zorarpacı, E., Özel, S.A.: Privacy preserving classification over differentially private data. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 11(3), e1399 (2021)
Guo, Y., Liu, J., Tang, W., Huang, C.: Exsense: Extract sensitive information from unstructured data. Comput. Secur. 102, 102156 (2021)
Patil, D., Lokare, R., Patil, S.: Private data classification using deep learning. In: Proceedings of the 3rd International Conference on Advances in Science & Technology (ICAST) (2020)
Trieu, L.Q., Tran, T.N., Tran, M.K., Tran, M.T.: Document sensitivity classification for data leakage prevention with twitter-based document embedding and query expansion. In: 2017 13th International Conference on Computational Intelligence and Security (CIS), pp. 537–542. IEEE (2017)
Hassan, F., Sánchez, D., Soria-Comas, J., Domingo-Ferrer, J.: Automatic anonymization of textual documents: detecting sensitive information via word embeddings. In: 2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), pp. 358–365. IEEE (2019)
Lu, Y., Huang, X., Li, D., Zhang, Y.: Collaborative graph-based mechanism for distributed big data leakage prevention. In: 2018 IEEE Global Communications Conference GLOBECOM, pp. 1–7. IEEE(2018)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Morales Escobar, S.J., Ruiz Shulcloper, J., Juárez Landín, C., Pérez García, O.A., Ruiz Castilla, J.S. (2021). A Survey on the Methods to Determine the Sensitivity of Textual Documents: Solutions and Problems to Solve. In: Hernández Heredia, Y., Milián Núñez, V., Ruiz Shulcloper, J. (eds) Progress in Artificial Intelligence and Pattern Recognition. IWAIPR 2021. Lecture Notes in Computer Science(), vol 13055. Springer, Cham. https://doi.org/10.1007/978-3-030-89691-1_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-89691-1_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89690-4
Online ISBN: 978-3-030-89691-1
eBook Packages: Computer ScienceComputer Science (R0)