Abstract
With accelerating generation of digital content, it is often impractical at the point of creation to manually segregate sensitive information from information which can be shared. As a result, a great deal of useful content becomes inaccessible simply because it is intermixed with sensitive content. This paper compares traditional and neural techniques for detection of sensitive content, finding that using the two techniques together can yield improved results. Experiments with two test collections, one in which sensitivity is modeled as a topic and a second in which sensitivity is annotated directly, yield consistent improvements with an intrinsic (classification effectiveness) measure. Extrinsic evaluation is conducted by using a recently proposed learning to rank framework for sensitivity-aware ranked retrieval and a measure that rewards finding relevant documents but penalizes revealing sensitive documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sayed, M.F., Oard, D.W.: Jointly modeling relevance and sensitivity for search among sensitive content. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 615–624. ACM (2019)
Thompson, E.D., Kaarst-Brown, M.L.: Sensitive information: a review and research agenda. J. Am. Soc. Inf. Sci. Technol. 56(3), 245–257 (2005)
Gabriel, M., Paskach, C., Sharpe, D.: The challenge and promise of predictive coding for privilege. In: ICAIL 2013 DESI V Workshop (2013)
Mcdonald, G., Macdonald, C., Ounis, I.: How the accuracy and confidence of sensitivity classification affects digital sensitivity review. ACM Trans. Inf. Syst. (TOIS) 39(1), 1–34 (2020)
Iqbal, M., Shilton, K., Sayed, M.F., Oard, D., Rivera, J.L., Cox, W.: Search with discretion: value sensitive design of training data for information retrieval. Proc. ACM Human Comput. Interact. 5, 1–20 (2021)
Biega, J.A., Gummadi, K.P., Mele, I., Milchevski, D., Tryfonopoulos, C., Weikum, G.: R-susceptibility: an IR-centric approach to assessing privacy risks for users in online communities. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 365–374 (2016)
Oard, D.W., Webber, W.: Information retrieval for e-discovery. Found. Trends Inf. Retrieval 7(2–3), 99–237 (2013)
Oard, D.W., Sebastiani, F., Vinjumur, J.K.: Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery. ACM Trans. Inf. Syst. (TOIS) 37(1), 11 (2018)
Vinjumur, J.K.: Predictive Coding Techniques with Manual Review to Identify Privileged Documents in E-Discovery. PhD thesis, University of Maryland (2018)
McDonald, G., Macdonald, C., Ounis, I.: Enhancing sensitivity classification with semantic features using word embeddings. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 450–463. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_35
Abril, D., Navarro-Arribas, G., Torra, V.: On the declassification of confidential documents. In: Torra, V., Narakawa, Y., Yin, J., Long, J. (eds.) MDAI 2011. LNCS (LNAI), vol. 6820, pp. 235–246. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22589-5_22
Baron, J.R., Sayed, M.F., Oard, D.W.: Providing more efficient access to government records: a use case involving application of machine learning to improve FOIA review for the deliberative process privilege. arXiv preprint arXiv:2011.07203, 2020
McDonald, G., Macdonald, C., Ounis, I., Gollins, T.: Towards a classifier for digital sensitivity review. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C.X., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 500–506. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_48
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Auto-sklearn 2.0: the next generation. arXiv preprint arXiv:2007.04074 (2020)
Adhikari, A., Ram, A., Tang, R., Lin, J.: DocBERT: BERT for document classification. arXiv preprint arXiv:1904.08398 (2019)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Alkhereyf, S., Rambow, O.: Work hard, play hard: email classification on the Avocado and Enron corpora. In: Proceedings of TextGraphs-11: The Workshop on Graph-based Methods for Natural Language Processing, pp. 57–65 (2017)
Crawford, E., Kay, J., McCreath, E.: Automatic induction of rules for e-mail classification. In: Australian Document Computing Symposium (2001)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 workshop, Madison, Wisconsin, vol. 62, pp. 98–105 (1998)
Wang, M., He, Y., Jiang, M.: Text categorization of Enron email corpus based on information bottleneck and maximal entropy. In: IEEE 10th International Conference on Signal Processing, pp. 2472–2475. IEEE (2010)
Sayed, M.F., et al.: A test collection for relevance and sensitivity. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1605–1608 (2020)
Cormack, G.V., Grossman, M.R., Hedin, B., Oard, D.W.: Overview of the TREC 2010 legal track. In: TREC (2010)
Vinjumur, J.K., Oard, D.W., Paik, J.H.: Assessing the reliability and reusability of an e-discovery privilege test collection. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1047–1050 (2014)
Brennan, W.: The declassification engine: reading between the black bars. The New Yorker (2013). https://www.newyorker.com/tech/annals-of-technology/the-declassification-engine-reading-between-the-black-bars
Oard, D., Webber, W., Kirsch, D., Golitsynskiy, S.: Avocado research email collection. Linguistic Data Consortium, Philadelphia (2015)
McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)
Metzler, D., Croft, W.B.: Linear feature-based models for information retrieval. Inf. Retrieval 10(3), 257–274 (2007)
Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10(3), 61–74 (1999)
De Winter, J.C.F.: Using the Student’s t-test with extremely small sample sizes. Pract. Assess. Res. Eval. 18(1), 10 (2013)
Sayed, M.F.: Search Among Sensitive Content. PhD thesis, University of Maryland, College Park (2021)
Acknowledgments
This work has been supported in part by NSF grant 1618695.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sayed, M.F., Mallekav, N., Oard, D.W. (2022). Comparing Intrinsic and Extrinsic Evaluation of Sensitivity Classification. In: Hagen, M., et al. Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13186. Springer, Cham. https://doi.org/10.1007/978-3-030-99739-7_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-99739-7_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99738-0
Online ISBN: 978-3-030-99739-7
eBook Packages: Computer ScienceComputer Science (R0)