Skip to main content
Log in

Anonymizing bag-valued sparse data by semantic similarity-based clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Web query logs provide a rich wealth of information, but also present serious privacy risks. We preserve privacy in publishing vocabularies extracted from a web query log by introducing vocabulary k-anonymity, which prevents the privacy attack of re-identification that reveals the real identities of vocabularies. A vocabulary is a bag of query-terms extracted from queries issued by a user at a specified granularity. Such bag-valued data are extremely sparse, which makes it hard to retain enough utility in enforcing k-anonymity. To the best of our knowledge, the prior works do not solve such a problem, among which some achieve a different privacy principle, for example, differential privacy, some deal with a different type of data, for example, set-valued data or relational data, and some consider a different publication scenario, for example, publishing frequent keywords. To retain enough data utility, a semantic similarity-based clustering approach is proposed, which measures the semantic similarity between a pair of terms by the minimum path distance over a semantic network of terms such as WordNet, computes the semantic similarity between two vocabularies by a weighted bipartite matching, and publishes the typical vocabulary for each cluster of semantically similar vocabularies. Extensive experiments on the AOL query log show that our approach can retain enough data utility in terms of loss metrics and in frequent pattern mining.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abril D, Navarro-Arribas G, Torra V (2010) Towards semantic microaggregation of categorical data for confidential documents. In: Torra V, Narukawa Y, Daumas M (eds) 7th International conference on modeling decisions for artificial intelligence (MDAI 2010), Perpignan, France. October 2010, pp 266–276

  2. Adar E (2007) User 4XXXXX9: anonymizing query logs. In: Query log analysis: social and technological challenges. A workshop at the 16th international world wide web conference (WWW 2007), Banff, Alberta, Canada, May 2007

  3. Aggarwal CC (2005) On k-anonymity and the curse of dimensionality. In: Bõhm K, Jensen CS, Haas LM et al (eds) Proceedings of the 31st international conference on very large data bases, Trondheim, Norway. August 2005, pp 901–909

  4. Aggarwal G, Feder T, Kenthapadi K (2006) Achieving anonymity via clustering. In: Vansummeren S (ed) Proceedings of the twenty-fifth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS 2006), Chicago, Illinois, USA. June 2006, pp 153–162

  5. Atzori M, Bonchi F, Giannotti F (2008) Anonymity preserving pattern discovery. VLDB J 17(4): 703–727

    Article  Google Scholar 

  6. Barbaro M, Zeller T (2006) A face is exposed for AOL searcher No. 4417749. New York Times, New York

    Google Scholar 

  7. Budanitsky A, Hirst G (2001) Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. In: WordNet and other lexical resources. A workshop at language technologies 2001: the second meeting of the North American chapter of the association for computational linguistics (NAACL 2001), Pittsburgh, USA, June 2001

  8. Cao J, Karras P, Raissi C (2010) ρ-Uncertainty: inference-proof transaction anonymization. In: Proceedings of the VLDB endowment (VLDB 2010). vol 3, no 1, Singapore, September 2010, pp 1033–1044

  9. Chakaravarthy VT, Gupta H, Roy P (2008) Efficient techniques for document sanitization. In: Shanahan JG, Amer-Yahia S, Manolescu I et al (eds) Proceedings of the 17th ACM conference on information and knowledge management (CIKM 2008), Napa Valley, California, USA. October 2008, pp 843–852

  10. Chen B, Kifer D, Lefevre K, Machanavajjhala A (2009) Privacy-preserving data publishing. Found Trends Databases 2(1–2): 1–167

    Article  Google Scholar 

  11. Chen CL, Tseng FSC, Liang T (2011) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst 28(3): 687–708

    Article  Google Scholar 

  12. Cooley R, Mobasher B, Srivastava J (1999) Data preparation for mining world wide web browsing patterns. Knowl Inf Syst 1(1): 5–32

    Google Scholar 

  13. Cooper A (2008) A survey of query log privacy-enhancing techniques from a policy perspective. ACM Trans Web 2(4): 19

    Article  Google Scholar 

  14. Cui H, Wen JR, Nie JY, Ma WY (2003) Query expansion by mining user logs. IEEE Trans Knowl Data Eng 15(4): 829–839

    Article  Google Scholar 

  15. Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min Knowl Discov 11(2): 195–212

    Article  MathSciNet  Google Scholar 

  16. Dwork C (2006) Differential privacy. In: Bugliesi M, Preneel B, Sassone V et al (eds) Automata, languages and programming. Proceedings of 33rd international colloquium (ICALP 2006), Part II, Venice, Italy. July 2006, pp 1–12

  17. Dwork C, Kenthapadi K, McSherry F (2006) Our data, ourselves: Privacy via distributed noise generation. In: Vaudenay S (ed) Advances in cryptology—EUROCRYPT 2006, 25th annual international conference on the theory and applications of cryptographic techniques, St. Petersburg, Russia. May 2006, pp 486–503

  18. Erola A, Castella-Roca J, Navarro-Arribas G (2010) Semantic microaggregation for the anonymization of query logs. In: Domingo-Ferrer J, Magkos E (eds) International conference on privacy in statistical databases—UNESCO Chair in data privacy (PSD 2010), Corfu, Greece. September 2010, pp 127–137

  19. Farahat AK, Kamel MS (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28(2): 365–393

    Article  Google Scholar 

  20. Fellbaum C (1998) WordNet, an electronic lexical database. MIT Press, Cambridge

    MATH  Google Scholar 

  21. Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman, San Francisco

    MATH  Google Scholar 

  22. Ghinita G, Tao Y, Kalnis P (2008) On the anonymization of sparse high-dimensional data. In: Alonso G, Blakeley JA, Chen ALP (eds) Proceedings of the 24th international conference on data engineering (ICDE 2008), Cancun, Mexico. April 2008, pp 715–724

  23. Götz M, Machanavajjhala A, Wang G (2012) Publishing search logs: a comparative study of privacy guarantees. IEEE Trans Knowl Data Eng 24(3): 520–532

    Article  Google Scholar 

  24. Han J, Wang J, Lu Y (2002) Mining top-k frequent closed patterns without minimum support. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM 2002), Maebashi City, Japan. December 2002, pp 211–218

  25. He Y, Naughton J (2009) Anonymization of set-valued data via top-down, local generalization. In: Proceedings of the VLDB endowment, (VLDB 2009), vol 2, no 1, Lyon, France. August 2009, pp 934–945

  26. Hong Y, He X, Vaidya J (2009) Effective anonymization of query logs. In: Cheung DWL, Song IY, Chu WW et al (eds) Proceedings of the 18th ACM conference on information and knowledge management (CIKM 2009), Hong Kong, China. November 2009, pp 1465–1468

  27. Horowitz A, Jacobson D, McNichol T (2007) 101 dumbest moments in business, the year’s biggest boors, buffoons, and blunderers. In: CNN Money

  28. Iyengar V (2002) Transforming data to satisfy privacy constraints. In: Proceedings of the Eighth ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, Alberta, Canada. July 2002, pp 279–288

  29. Jiang W, Murugesan M, Clifton C (2009a) t-Plausibility: semantic preserving text sanitization. In: Proceedings of 12th IEEE international conference on computational science and engineering, CSE’09/PASSAT 2009, Vancouver, BC, Canada. August 2009, pp 68–75

  30. Jiang S, Zilles S, Holte R (2009b) Query suggestion by query search: a new approach to user support in web search. In: 2009 IEEE/WIC/ACM international conference on web intelligence (WI 2009), Milan, Italy. September 2009, pp 679–684

  31. Jones R, Kumar R, Pang B, Tomkins A (2007) “I know what you did last summer”: query logs and user privacy. In: Silva MJ, Laender AHF, Baeza-Yates RA (eds) Proceedings of the sixteenth ACM conference on information and knowledge management (CIKM 2007), Lisbon, Portugal. November 2007, pp 909–914

  32. Kalogeratos A, Likas A (2011) Text document clustering using global term context vectors. Knowl Inf Syst. doi:10.1007/s10115-011-0412-6

  33. Karp RM (1972) Reducibility among combinatorial problems. In: Miller RE, Thatcher JW (eds) Complexity of computer computations. Plenum Press, New York, pp 85–103

    Chapter  Google Scholar 

  34. Korolova A, Kenthapadi K, Mishra N, Ntoulas A (2009) Releasing search queries and clicks privately. In: Quemada J, León G, Maarek YS (eds) Proceedings of the 18th international conference on world wide web (WWW 2009), Madrid, Spain. April 2009, pp 171–180

  35. Kumar R, Novak J, Pang B, (2007) On anonymizing query logs via token-based hashing. In: Williamson CL, Zurko ME, Patel-Schneider PF (eds) Proceedings of the 16th international conference on world wide web (WWW 2007), Banff, Alberta, Canada. May 2007, pp 629–638

  36. LeFevre K, DeWitt D, Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In: Özcan F (ed) Proceedings of the ACM SIGMOD international conference on management of data, Baltimore, Maryland, USA. June 2005, pp 49–60

  37. LeFevre K, DeWitt D, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: Liu L, Reuter A, Whang KY (eds) Proceedings of the 22nd international conference on data engineering (ICDE 2006), Atlanta, GA, USA. April 2006, p 25

  38. Li N, Li T, Venkatasubramanian S (2007) t-Closeness: privacy beyond k-anonymity and l-diversity. In: Chirkova R, Dogac A, Özsu MT (eds) Proceedings of the 23rd international conference on data engineering (ICDE 2007), Istanbul, Turkey. April 2007, pp 106–115

  39. Liu J, Wang K (2010a) Anonymizing transaction data by integrating suppression and generalization. In: Zaki MJ, Yu JX, Ravindran B (eds) 14th Pacific-Asia conference advances in knowledge discovery and data mining (PAKDD 2010), Hyderabad, India. June 2010, pp 171–180

  40. Liu J, Wang K (2010b) Enforcing vocabulary k-anonymity by semantic similarity based clustering. In: Webb GI, Liu B, Zhang C (eds) The 10th IEEE international conference on data mining (ICDM 2010), Sydney, Australia. December 2010, pp 899–904

  41. Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4): 309–317

    Article  MathSciNet  Google Scholar 

  42. Machanavajjhala A, Gehrke J, Kifer D (2006) -Diversity: privacy beyond k-anonymity. In: Liu L, Reuter A, Whang KY (eds) Proceedings of the 22nd international conference on data engineering (ICDE 2006), Atlanta, GA, USA. April 2006, p 24

  43. Machanavajjhala A, Kifer D, Abowd J (2008) Privacy: Theory meets practice on the map. In: Alonso G, Blakeley JA, Chen ALP (eds) Proceedings of the 24th IEEE international conference on data engineering (ICDE 2008), Cancun, Mexico. April 2008, pp 277–286

  44. Martin D, Kifer D, Machanavajjhala A (2007) Worst-case background knowledge for privacy preserving data publishing. In: Chirkova R, Dogac A, Özsu MT et al (eds) Proceedings of the 23rd international conference on data engineering (ICDE 2007), Istanbul, Turkey. April 2007, pp 126–135

  45. Navarro-Arribas G, Torra V (2009) Tree-based microaggregation for the anonymization of search logs. In: 2009 IEEE/WIC/ACM international conference on web intelligence (WI 2009), Milan, Italy. September 2009, pp 155–158

  46. Navarro-Arribas G, Torra V, Erola A (2011) User k-anonymity for privacy preserving data mining of query logs. Inf Process Manage. doi:10.1016/j.ipm.2011.01.004

  47. Nergiz M, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: Chan CY, Ooi BC, Zhou A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Beijing, China. June 2007, pp 665–676

  48. Nin J, Herranz J, Torra V (2008) On the disclosure risk of multivariate microaggregation. Data Knowl Eng 67(3): 399–412

    Article  Google Scholar 

  49. Pass G, Chowdhury A, Torgeson C (2006) A picture of search. In: Jia X (ed) Proceedings of the 1st international conference on scalable information systems (Infoscale 2006), Hong Kong. June 2006, p 1

  50. Poblete B, Spiliopoulou M, Baeza-Yates R (2010) Privacy-preserving query log mining for business confidentiality protection. ACM Trans Web 4(3): 10

    Article  Google Scholar 

  51. Rada R, Mili H, Bicknell E (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern 19(1): 17–30

    Article  Google Scholar 

  52. Rastegari H, Shamsuddin SM (2010) Web search personalization based on browsing history by artificial immune system. Int J Adv Soft Comput Appl 2(3): 282–301

    Google Scholar 

  53. Salam A, Khayal MSH (2011) Mining top-k frequent patterns without minimum support threshold. Knowl Inf Syst. doi:10.1007/s10115-010-0363-3

  54. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5): 513–523

    Article  Google Scholar 

  55. Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information. In: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Seattle, Washington, USA. June 1998, p 188

  56. Saygin Y, Hakkani-Tur D, Tur G (2005) Sanitization and anonymization of document repositories. In: Ferrari E, Thuraisingham B (eds) Web and information security. Idea, London

    Google Scholar 

  57. Schwartz J, Steger A, Weiβl A (2005) Fast algorithms for weighted bipartite matching. In: Nikoletseas SE (ed) 4th international workshop in experimental and efficient algorithms (WEA 2005), Santorini Island, Greece. May 2005, pp 476–487

  58. Sieg A, Mobasher B, Burke R (2007) Web search personalization with ontological user profiles. In: Silva MJ, Laender AHF, Baeza-Yates RA (eds) Proceedings of the sixteenth ACM conference on information and knowledge management (CIKM 2007), Lisbon, Portugal. November 2007, pp 525–534

  59. Sweeney L (2002) k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5): 571–588

    Article  MathSciNet  MATH  Google Scholar 

  60. Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy preserving anonymization of set valued data. In: Proceedings of the VLDB endowment (VLDB 2008), vol 1, no 1, Auckland, New Zealand. August 2008, pp 115–125

  61. Torra V, Domingo-Ferrer J (2003) Record linkage methods for multidatabase data mining. In: Torra V (ed) Information fusion in data mining. Springer, New York

    Google Scholar 

  62. Weber I, Castillo C (2010) The demographics of web search. In: Crestani F, Marchand-Maillet S, Chen HH (eds) Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010), Geneva, Switzerland. July 2010, pp 523–530

  63. Wen JR (2009) Enhancing web search through query log mining. In: Wang J (ed) Encyclopedia of data warehousing and mining, 2nd edn. IGI Global, London, pp 758–763

  64. Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: Dayal U, Whang KY, Lomet DB (eds) Proceedings of the 32nd international conference on very large data bases (VLDB 2006), Seoul, Korea. September 2006, pp 139–150

  65. Xiong L, Agichtein E (2007) Towards privacy-preserving query log publishing. In: Query log analysis: social and technological challenges. A workshop at the 16th international world wide web conference (WWW 2007), Banff, Alberta, Canada, May 2007

  66. Xu Y, Wang K, Fu A (2008) Anonymizing transaction databases for publication. In: Li Y, Liu B, Sarawagi S (eds) Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, USA. August 2008, pp 767–775

  67. Zhang Q, Koudas N, Srivastava D (2007) Aggregate query answering on anonymized tables. In: Chirkova R, Dogac A, Özsu MT (eds) Proceedings of the 23rd international conference on data engineering (ICDE 2007), Istanbul, Turkey. April 2007, pp 116–125

  68. Zhu Y, Xiong L, Verdery C (2010) Anonymizing user profiles for personalized web search. In: Rappa M, Jones P, Freire J (eds) Proceedings of the 19th international conference on world wide web (WWW 2010), Raleigh, North Carolina, USA. April 2010, pp 1225–1226

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junqiang Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, J., Wang, K. Anonymizing bag-valued sparse data by semantic similarity-based clustering. Knowl Inf Syst 35, 435–461 (2013). https://doi.org/10.1007/s10115-012-0515-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0515-8

Keywords

Navigation