Skip to main content

Document Sanitization: Measuring Search Engine Information Loss and Risk of Disclosure for the Wikileaks cables

  • Conference paper
Privacy in Statistical Databases (PSD 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7556))

Included in the following conference series:

Abstract

In this paper we evaluate the effect of a document sanitization process on a set of information retrieval metrics, in order to measure information loss and risk of disclosure. As an example document set, we use a subset of the Wikileaks Cables, made up of documents relating to five key news items which were revealed by the cables. In order to sanitize the documents we have developed a semi-automatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration, by (i) identifying and anonymizing specific person names and data, and (ii) concept generalization based on WordNet categories, in order to identify words categorized as classified. Finally, we manually revise the text from a contextual point of view to eliminate complete sentences, paragraphs and sections, where necessary. We show that a significant sanitization can be applied, while maintaining the relevance of the documents to the queries corresponding to the five key news items.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Executive Order 13526, of the US Administration - Classified National Security Information, Section 1.4, points (a) to (h) (2009), http://www.whitehouse.gov/the-press-office/executive-order-classified-national-security-information

  2. Wikileaks Cable repository, http://www.cablegatesearch.net

  3. Chakaravarthy, V.T., Gupta, H., Roy, P., Mohania, M.K.: Efficient Techniques for Document Sanitization. In: CIKM 2008, Napa Valley, California, USA, October 26–30 (2008)

    Google Scholar 

  4. Cumby, C., Ghani, R.: A Machine Learning Based System for Semi-Automatically Redacting Documents. In: Proc. IAAI 2011 (2011)

    Google Scholar 

  5. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. Journal of Uncertainty, Fuzziness and Knowledge-Based Systems (IJUFKS) 10(5), 557–570 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  6. Hong, T.-P., Lin, C.-W., Yang, K.-T., Wang, S.-L.: A Heuristic Data-Sanitization Approach Based on TF-IDF. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds.) IEA/AIE 2011, Part I. LNCS, vol. 6703, pp. 156–164. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  7. Samelin, K., Pöhls, H.C., Bilzhause, A., Posegga, J., de Meer, H.: Redactable Signatures for Independent Removal of Structure and Content. In: Ryan, M.D., Smyth, B., Wang, G. (eds.) ISPEC 2012. LNCS, vol. 7232, pp. 17–33. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  8. Chow, R., Staddon, J.N., Oberst, I.S.: Method and apparatus for facilitating document sanitization. US Patent Application Pub. No. US 2011/0107205 A1, May 5 (2011)

    Google Scholar 

  9. Neamatullah, I., Douglass, M.M., Lehman, L.H., Reisner, A., Villarroel, M., Long, W.J., Szolovits, P., Moody, G.B., Mark, R.G., Clifford, G.D.: Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8, 32 (2008)

    Google Scholar 

  10. Abril, D., Navarro-Arribas, G., Torra, V.: Towards Semantic Microaggregation of Categorical Data for Confidential Documents. In: Torra, V., Narukawa, Y., Daumas, M. (eds.) MDAI 2010. LNCS (LNAI), vol. 6408, pp. 266–276. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  11. Abril, D., Navarro-Arribas, G., Torra, V.: On the Declassification of Confidential Documents. In: Torra, V., Narakawa, Y., Yin, J., Long, J. (eds.) MDAI 2011. LNCS (LNAI), vol. 6820, pp. 235–246. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  12. Yahoo! News. Top 10 revelations from Wiki Leaks cables, http://news.yahoo.com/blogs/lookout/top-10-revelations-wikileaks-cables.html

  13. Pingar – Entity Extraction Software, http://www.pingar.com

  14. Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K.: WordNet: An online lexical database. Int. J. Lexicograph 3(4), 235–244 (1990)

    Article  Google Scholar 

  15. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  16. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edn. ACM Press Books (2011) ISBN: 0321416910

    Google Scholar 

  17. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008) ISBN: 0521865719

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nettleton, D.F., Abril, D. (2012). Document Sanitization: Measuring Search Engine Information Loss and Risk of Disclosure for the Wikileaks cables. In: Domingo-Ferrer, J., Tinnirello, I. (eds) Privacy in Statistical Databases. PSD 2012. Lecture Notes in Computer Science, vol 7556. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33627-0_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33627-0_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33626-3

  • Online ISBN: 978-3-642-33627-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics