Abstract
Loss of sensitive data is a common problem with potentially severe consequences. By categorizing documents according to their sensitivity, security controls can be performed based on this classification. However, errors in the classification process may effectively result in information leakage. While automated classification techniques can be used to mitigate this risk, little work has been done to evaluate the effectiveness of such techniques when sensitive content has been transformed (e.g., a document can be summarized, rewritten, or have paragraphs copy-pasted into a new one). To better handle these more difficult data leaks, this paper proposes the use of controlled environments to detect misclassification. By monitoring the incoming information flow, the documents imported into a controlled environment can be used to better determine the sensitivity of the document(s) created within the same environment. Our evaluation results show that this approach, using techniques from machine learning and information retrieval, provides improved detection of incorrectly classified documents that have been subject to more complex data transformations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Transcendental meditation.
- 2.
- 3.
- 4.
- 5.
It should be noted that there exists a great number of heuristic variations of the tf and idf formulas.
- 6.
The Church of Jesus Christ of Latter-day Saints.
- 7.
Transcendental meditation.
References
Digitial National Security Archive. http://nsarchive.chadwyck.com/home.do. Accessed 26 Mar 2015
Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V.: A semantics-aware classification approach for data leakage prevention. In: Susilo, W., Mu, Y. (eds.) ACISP 2014. LNCS, vol. 8544, pp. 413–421. Springer, Heidelberg (2014). doi:10.1007/978-3-319-08344-5_27
Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern Information Retrieval, vol. 463. ACM Press, New York (1999)
Barreno, M., Nelson, B.A., Joseph, A.D., Tygar, D.: The security of machine learning. Technical report UCB/EECS-2008-43, EECS Department, University of California, Berkeley, April 2008. http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-43.html
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Brown, J.D., Charlebois, D.: Security classification using automated learning (scale): optimizing statistical natural language processing techniques to assign security labels to unstructured text. Technical report, DTIC Document (2010)
Clark, K.P.: Automated security classification. Master thesis Vrije Universiteit (2008)
Clinchant, S., Gaussier, E.: Information-based models for ad hoc IR. In: Proceeding ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 234–241 (2010)
Engelstad, P.E., Hammer, H., Kongsgård, K.W., Yazidi, A., Nordotten, N.A., Bai, A.: Automatic security classification with lasso. In: Proceeding International Workshop on Information Security Applications (2015)
Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2015)
Haakseth, R., Nordbotten, N.A., Jonsson, Ø., Kristiansen, B.: A high assurance guard for use in service-oriented architectures. In: Proc. International Conference on Military Communications and Information Systems (2015)
Hammer, H., Kongsgård, K.W., Bai, A., Yazidi, A., Nordbotten, N.A., Engelstad, P.E.: Automatic security classification by machine learning for cross-domain information exchange. In: Proceeding IEEE Military Communications Conference, vol. 31 (2015)
Hart, M., Manadhata, P., Johnson, R.: Text classification for data loss prevention. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 18–37. Springer, Heidelberg (2011). doi:10.1007/978-3-642-22263-4_2
Security, I.: Grand theft data - data exfiltration study: actors, tactics, and detection (2015). http://www.mcafee.com/us/resources/reports/rp-data-exfiltration.pdf
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). doi:10.1007/BFb0026683
Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall series in Artificial Intelligence, Pearson Prentice Hall (2009). https://books.google.no/books?id=fZmj5UNK8AQC
Lewellen, T., Silowash, G.J., Costa, D.L.: Insider threat control: Using plagiarism detection algorithms to prevent data exfiltration in near real time. Technical report CMU/SEI-2013-TN-008, Carnegie Mellon University (2013)
Ouellet, E.: Magic quadrant for content-aware data loss prevention. Gartner Inc. (2013)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proc. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281 (1998)
Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora, pp. 45–50, May 2012. http://is.muni.cz/publication/884893/en
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Foundations and trends in information retrieval (2009)
Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection
Shabtai, A., Elovici, Y., Rokach, L.: A Survey of Data Leakage Detection and Prevention Solutions. Springer, New York (2012)
Shu, X., Zhang, J., Yao, D., Feng, W.C.: Fast detection of transformed data leaks. IEEE Transactions on Information Forensics and Security (2016)
Symantec: Machine learning sets new standard for data loss prevention: describe, fingerprint, learn (2010). http://eval.symantec.com/mktginfo/enterprise/white_papers/b-dlp_machine_learning.WP_en-us.pdf
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceeding ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Kongsgård, K.W., Nordbotten, N.A., Mancini, F., Engelstad, P.E. (2016). Data Loss Prevention Based on Text Classification in Controlled Environments. In: Ray, I., Gaur, M., Conti, M., Sanghi, D., Kamakoti, V. (eds) Information Systems Security. ICISS 2016. Lecture Notes in Computer Science(), vol 10063. Springer, Cham. https://doi.org/10.1007/978-3-319-49806-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-49806-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49805-8
Online ISBN: 978-3-319-49806-5
eBook Packages: Computer ScienceComputer Science (R0)