Data Loss Prevention Based on Text Classification in Controlled Environments

Kongsgård, Kyrre Wahl; Nordbotten, Nils Agne; Mancini, Federico; Engelstad, Paal E.

doi:10.1007/978-3-319-49806-5_7

Kyrre Wahl Kongsgård^18,19,21,
Nils Agne Nordbotten^18,19,21,
Federico Mancini¹⁸ &
…
Paal E. Engelstad^18,20

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 10063))

Included in the following conference series:

International Conference on Information Systems Security

1287 Accesses
4 Citations
3 Altmetric

Abstract

Loss of sensitive data is a common problem with potentially severe consequences. By categorizing documents according to their sensitivity, security controls can be performed based on this classification. However, errors in the classification process may effectively result in information leakage. While automated classification techniques can be used to mitigate this risk, little work has been done to evaluate the effectiveness of such techniques when sensitive content has been transformed (e.g., a document can be summarized, rewritten, or have paragraphs copy-pasted into a new one). To better handle these more difficult data leaks, this paper proposes the use of controlled environments to detect misclassification. By monitoring the incoming information flow, the documents imported into a controlled environment can be used to better determine the sensitivity of the document(s) created within the same environment. Our evaluation results show that this approach, using techniques from machine learning and information retrieval, provides improved detection of incorrectly classified documents that have been subject to more complex data transformations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Transcendental meditation.
2.
http://www.abbyy.com/.
3.
https://www.spinrewriter.com/.
4.
https://cloud.google.com/translate/docs.
5.
It should be noted that there exists a great number of heuristic variations of the tf and idf formulas.
6.
The Church of Jesus Christ of Latter-day Saints.
7.
Transcendental meditation.

References

Digitial National Security Archive. http://nsarchive.chadwyck.com/home.do. Accessed 26 Mar 2015
Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V.: A semantics-aware classification approach for data leakage prevention. In: Susilo, W., Mu, Y. (eds.) ACISP 2014. LNCS, vol. 8544, pp. 413–421. Springer, Heidelberg (2014). doi:10.1007/978-3-319-08344-5_27
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern Information Retrieval, vol. 463. ACM Press, New York (1999)
Google Scholar
Barreno, M., Nelson, B.A., Joseph, A.D., Tygar, D.: The security of machine learning. Technical report UCB/EECS-2008-43, EECS Department, University of California, Berkeley, April 2008. http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-43.html
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MathSciNet MATH Google Scholar
Brown, J.D., Charlebois, D.: Security classification using automated learning (scale): optimizing statistical natural language processing techniques to assign security labels to unstructured text. Technical report, DTIC Document (2010)
Google Scholar
Clark, K.P.: Automated security classification. Master thesis Vrije Universiteit (2008)
Google Scholar
Clinchant, S., Gaussier, E.: Information-based models for ad hoc IR. In: Proceeding ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 234–241 (2010)
Google Scholar
Engelstad, P.E., Hammer, H., Kongsgård, K.W., Yazidi, A., Nordotten, N.A., Bai, A.: Automatic security classification with lasso. In: Proceeding International Workshop on Information Security Applications (2015)
Google Scholar
Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2015)
Google Scholar
Haakseth, R., Nordbotten, N.A., Jonsson, Ø., Kristiansen, B.: A high assurance guard for use in service-oriented architectures. In: Proc. International Conference on Military Communications and Information Systems (2015)
Google Scholar
Hammer, H., Kongsgård, K.W., Bai, A., Yazidi, A., Nordbotten, N.A., Engelstad, P.E.: Automatic security classification by machine learning for cross-domain information exchange. In: Proceeding IEEE Military Communications Conference, vol. 31 (2015)
Google Scholar
Hart, M., Manadhata, P., Johnson, R.: Text classification for data loss prevention. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 18–37. Springer, Heidelberg (2011). doi:10.1007/978-3-642-22263-4_2
Chapter Google Scholar
Security, I.: Grand theft data - data exfiltration study: actors, tactics, and detection (2015). http://www.mcafee.com/us/resources/reports/rp-data-exfiltration.pdf
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). doi:10.1007/BFb0026683
Chapter Google Scholar
Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall series in Artificial Intelligence, Pearson Prentice Hall (2009). https://books.google.no/books?id=fZmj5UNK8AQC
Lewellen, T., Silowash, G.J., Costa, D.L.: Insider threat control: Using plagiarism detection algorithms to prevent data exfiltration in near real time. Technical report CMU/SEI-2013-TN-008, Carnegie Mellon University (2013)
Google Scholar
Ouellet, E.: Magic quadrant for content-aware data loss prevention. Gartner Inc. (2013)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proc. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281 (1998)
Google Scholar
Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora, pp. 45–50, May 2012. http://is.muni.cz/publication/884893/en
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Foundations and trends in information retrieval (2009)
Google Scholar
Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection
Google Scholar
Shabtai, A., Elovici, Y., Rokach, L.: A Survey of Data Leakage Detection and Prevention Solutions. Springer, New York (2012)
Book Google Scholar
Shu, X., Zhang, J., Yao, D., Feng, W.C.: Fast detection of transformed data leaks. IEEE Transactions on Information Forensics and Security (2016)
Google Scholar
Symantec: Machine learning sets new standard for data loss prevention: describe, fingerprint, learn (2010). http://eval.symantec.com/mktginfo/enterprise/white_papers/b-dlp_machine_learning.WP_en-us.pdf
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceeding ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Norwegian Defence Research Establishment (FFI), P.O. Box 25, 2027, Kjeller, Norway
Kyrre Wahl Kongsgård, Nils Agne Nordbotten, Federico Mancini & Paal E. Engelstad
Department of Informatics, University of Oslo, Blindern, 0316, Oslo, Norway
Kyrre Wahl Kongsgård & Nils Agne Nordbotten
Oslo and Akershus University College of Applied Sciences (HiOA), 0130, Oslo, Norway
Paal E. Engelstad
University Graduate Center Kjeller, UNIK, Kjeller, Norway
Kyrre Wahl Kongsgård & Nils Agne Nordbotten

Authors

Kyrre Wahl Kongsgård
View author publications
You can also search for this author in PubMed Google Scholar
Nils Agne Nordbotten
View author publications
You can also search for this author in PubMed Google Scholar
Federico Mancini
View author publications
You can also search for this author in PubMed Google Scholar
Paal E. Engelstad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kyrre Wahl Kongsgård .

Editor information

Editors and Affiliations

Colorado State University, Fort Collins, Colorado, USA
Indrajit Ray
Malaviya National Institute of Technology, Jaipur, India
Manoj Singh Gaur
University of Padua, Padua, Italy
Mauro Conti
IIIT Delhi, Delhi, India
Dheeraj Sanghi
IIT Madras, Madras, India
V. Kamakoti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kongsgård, K.W., Nordbotten, N.A., Mancini, F., Engelstad, P.E. (2016). Data Loss Prevention Based on Text Classification in Controlled Environments. In: Ray, I., Gaur, M., Conti, M., Sanghi, D., Kamakoti, V. (eds) Information Systems Security. ICISS 2016. Lecture Notes in Computer Science(), vol 10063. Springer, Cham. https://doi.org/10.1007/978-3-319-49806-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-49806-5_7
Published: 24 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49805-8
Online ISBN: 978-3-319-49806-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics