Abstract
Web query logs provide a rich wealth of information, but also present serious privacy risks. We preserve privacy in publishing vocabularies extracted from a web query log by introducing vocabulary k-anonymity, which prevents the privacy attack of re-identification that reveals the real identities of vocabularies. A vocabulary is a bag of query-terms extracted from queries issued by a user at a specified granularity. Such bag-valued data are extremely sparse, which makes it hard to retain enough utility in enforcing k-anonymity. To the best of our knowledge, the prior works do not solve such a problem, among which some achieve a different privacy principle, for example, differential privacy, some deal with a different type of data, for example, set-valued data or relational data, and some consider a different publication scenario, for example, publishing frequent keywords. To retain enough data utility, a semantic similarity-based clustering approach is proposed, which measures the semantic similarity between a pair of terms by the minimum path distance over a semantic network of terms such as WordNet, computes the semantic similarity between two vocabularies by a weighted bipartite matching, and publishes the typical vocabulary for each cluster of semantically similar vocabularies. Extensive experiments on the AOL query log show that our approach can retain enough data utility in terms of loss metrics and in frequent pattern mining.
Similar content being viewed by others
References
Abril D, Navarro-Arribas G, Torra V (2010) Towards semantic microaggregation of categorical data for confidential documents. In: Torra V, Narukawa Y, Daumas M (eds) 7th International conference on modeling decisions for artificial intelligence (MDAI 2010), Perpignan, France. October 2010, pp 266–276
Adar E (2007) User 4XXXXX9: anonymizing query logs. In: Query log analysis: social and technological challenges. A workshop at the 16th international world wide web conference (WWW 2007), Banff, Alberta, Canada, May 2007
Aggarwal CC (2005) On k-anonymity and the curse of dimensionality. In: Bõhm K, Jensen CS, Haas LM et al (eds) Proceedings of the 31st international conference on very large data bases, Trondheim, Norway. August 2005, pp 901–909
Aggarwal G, Feder T, Kenthapadi K (2006) Achieving anonymity via clustering. In: Vansummeren S (ed) Proceedings of the twenty-fifth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS 2006), Chicago, Illinois, USA. June 2006, pp 153–162
Atzori M, Bonchi F, Giannotti F (2008) Anonymity preserving pattern discovery. VLDB J 17(4): 703–727
Barbaro M, Zeller T (2006) A face is exposed for AOL searcher No. 4417749. New York Times, New York
Budanitsky A, Hirst G (2001) Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. In: WordNet and other lexical resources. A workshop at language technologies 2001: the second meeting of the North American chapter of the association for computational linguistics (NAACL 2001), Pittsburgh, USA, June 2001
Cao J, Karras P, Raissi C (2010) ρ-Uncertainty: inference-proof transaction anonymization. In: Proceedings of the VLDB endowment (VLDB 2010). vol 3, no 1, Singapore, September 2010, pp 1033–1044
Chakaravarthy VT, Gupta H, Roy P (2008) Efficient techniques for document sanitization. In: Shanahan JG, Amer-Yahia S, Manolescu I et al (eds) Proceedings of the 17th ACM conference on information and knowledge management (CIKM 2008), Napa Valley, California, USA. October 2008, pp 843–852
Chen B, Kifer D, Lefevre K, Machanavajjhala A (2009) Privacy-preserving data publishing. Found Trends Databases 2(1–2): 1–167
Chen CL, Tseng FSC, Liang T (2011) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst 28(3): 687–708
Cooley R, Mobasher B, Srivastava J (1999) Data preparation for mining world wide web browsing patterns. Knowl Inf Syst 1(1): 5–32
Cooper A (2008) A survey of query log privacy-enhancing techniques from a policy perspective. ACM Trans Web 2(4): 19
Cui H, Wen JR, Nie JY, Ma WY (2003) Query expansion by mining user logs. IEEE Trans Knowl Data Eng 15(4): 829–839
Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min Knowl Discov 11(2): 195–212
Dwork C (2006) Differential privacy. In: Bugliesi M, Preneel B, Sassone V et al (eds) Automata, languages and programming. Proceedings of 33rd international colloquium (ICALP 2006), Part II, Venice, Italy. July 2006, pp 1–12
Dwork C, Kenthapadi K, McSherry F (2006) Our data, ourselves: Privacy via distributed noise generation. In: Vaudenay S (ed) Advances in cryptology—EUROCRYPT 2006, 25th annual international conference on the theory and applications of cryptographic techniques, St. Petersburg, Russia. May 2006, pp 486–503
Erola A, Castella-Roca J, Navarro-Arribas G (2010) Semantic microaggregation for the anonymization of query logs. In: Domingo-Ferrer J, Magkos E (eds) International conference on privacy in statistical databases—UNESCO Chair in data privacy (PSD 2010), Corfu, Greece. September 2010, pp 127–137
Farahat AK, Kamel MS (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28(2): 365–393
Fellbaum C (1998) WordNet, an electronic lexical database. MIT Press, Cambridge
Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman, San Francisco
Ghinita G, Tao Y, Kalnis P (2008) On the anonymization of sparse high-dimensional data. In: Alonso G, Blakeley JA, Chen ALP (eds) Proceedings of the 24th international conference on data engineering (ICDE 2008), Cancun, Mexico. April 2008, pp 715–724
Götz M, Machanavajjhala A, Wang G (2012) Publishing search logs: a comparative study of privacy guarantees. IEEE Trans Knowl Data Eng 24(3): 520–532
Han J, Wang J, Lu Y (2002) Mining top-k frequent closed patterns without minimum support. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM 2002), Maebashi City, Japan. December 2002, pp 211–218
He Y, Naughton J (2009) Anonymization of set-valued data via top-down, local generalization. In: Proceedings of the VLDB endowment, (VLDB 2009), vol 2, no 1, Lyon, France. August 2009, pp 934–945
Hong Y, He X, Vaidya J (2009) Effective anonymization of query logs. In: Cheung DWL, Song IY, Chu WW et al (eds) Proceedings of the 18th ACM conference on information and knowledge management (CIKM 2009), Hong Kong, China. November 2009, pp 1465–1468
Horowitz A, Jacobson D, McNichol T (2007) 101 dumbest moments in business, the year’s biggest boors, buffoons, and blunderers. In: CNN Money
Iyengar V (2002) Transforming data to satisfy privacy constraints. In: Proceedings of the Eighth ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, Alberta, Canada. July 2002, pp 279–288
Jiang W, Murugesan M, Clifton C (2009a) t-Plausibility: semantic preserving text sanitization. In: Proceedings of 12th IEEE international conference on computational science and engineering, CSE’09/PASSAT 2009, Vancouver, BC, Canada. August 2009, pp 68–75
Jiang S, Zilles S, Holte R (2009b) Query suggestion by query search: a new approach to user support in web search. In: 2009 IEEE/WIC/ACM international conference on web intelligence (WI 2009), Milan, Italy. September 2009, pp 679–684
Jones R, Kumar R, Pang B, Tomkins A (2007) “I know what you did last summer”: query logs and user privacy. In: Silva MJ, Laender AHF, Baeza-Yates RA (eds) Proceedings of the sixteenth ACM conference on information and knowledge management (CIKM 2007), Lisbon, Portugal. November 2007, pp 909–914
Kalogeratos A, Likas A (2011) Text document clustering using global term context vectors. Knowl Inf Syst. doi:10.1007/s10115-011-0412-6
Karp RM (1972) Reducibility among combinatorial problems. In: Miller RE, Thatcher JW (eds) Complexity of computer computations. Plenum Press, New York, pp 85–103
Korolova A, Kenthapadi K, Mishra N, Ntoulas A (2009) Releasing search queries and clicks privately. In: Quemada J, León G, Maarek YS (eds) Proceedings of the 18th international conference on world wide web (WWW 2009), Madrid, Spain. April 2009, pp 171–180
Kumar R, Novak J, Pang B, (2007) On anonymizing query logs via token-based hashing. In: Williamson CL, Zurko ME, Patel-Schneider PF (eds) Proceedings of the 16th international conference on world wide web (WWW 2007), Banff, Alberta, Canada. May 2007, pp 629–638
LeFevre K, DeWitt D, Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In: Özcan F (ed) Proceedings of the ACM SIGMOD international conference on management of data, Baltimore, Maryland, USA. June 2005, pp 49–60
LeFevre K, DeWitt D, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: Liu L, Reuter A, Whang KY (eds) Proceedings of the 22nd international conference on data engineering (ICDE 2006), Atlanta, GA, USA. April 2006, p 25
Li N, Li T, Venkatasubramanian S (2007) t-Closeness: privacy beyond k-anonymity and l-diversity. In: Chirkova R, Dogac A, Özsu MT (eds) Proceedings of the 23rd international conference on data engineering (ICDE 2007), Istanbul, Turkey. April 2007, pp 106–115
Liu J, Wang K (2010a) Anonymizing transaction data by integrating suppression and generalization. In: Zaki MJ, Yu JX, Ravindran B (eds) 14th Pacific-Asia conference advances in knowledge discovery and data mining (PAKDD 2010), Hyderabad, India. June 2010, pp 171–180
Liu J, Wang K (2010b) Enforcing vocabulary k-anonymity by semantic similarity based clustering. In: Webb GI, Liu B, Zhang C (eds) The 10th IEEE international conference on data mining (ICDM 2010), Sydney, Australia. December 2010, pp 899–904
Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4): 309–317
Machanavajjhala A, Gehrke J, Kifer D (2006) ℓ-Diversity: privacy beyond k-anonymity. In: Liu L, Reuter A, Whang KY (eds) Proceedings of the 22nd international conference on data engineering (ICDE 2006), Atlanta, GA, USA. April 2006, p 24
Machanavajjhala A, Kifer D, Abowd J (2008) Privacy: Theory meets practice on the map. In: Alonso G, Blakeley JA, Chen ALP (eds) Proceedings of the 24th IEEE international conference on data engineering (ICDE 2008), Cancun, Mexico. April 2008, pp 277–286
Martin D, Kifer D, Machanavajjhala A (2007) Worst-case background knowledge for privacy preserving data publishing. In: Chirkova R, Dogac A, Özsu MT et al (eds) Proceedings of the 23rd international conference on data engineering (ICDE 2007), Istanbul, Turkey. April 2007, pp 126–135
Navarro-Arribas G, Torra V (2009) Tree-based microaggregation for the anonymization of search logs. In: 2009 IEEE/WIC/ACM international conference on web intelligence (WI 2009), Milan, Italy. September 2009, pp 155–158
Navarro-Arribas G, Torra V, Erola A (2011) User k-anonymity for privacy preserving data mining of query logs. Inf Process Manage. doi:10.1016/j.ipm.2011.01.004
Nergiz M, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: Chan CY, Ooi BC, Zhou A (eds) Proceedings of the ACM SIGMOD international conference on management of data, Beijing, China. June 2007, pp 665–676
Nin J, Herranz J, Torra V (2008) On the disclosure risk of multivariate microaggregation. Data Knowl Eng 67(3): 399–412
Pass G, Chowdhury A, Torgeson C (2006) A picture of search. In: Jia X (ed) Proceedings of the 1st international conference on scalable information systems (Infoscale 2006), Hong Kong. June 2006, p 1
Poblete B, Spiliopoulou M, Baeza-Yates R (2010) Privacy-preserving query log mining for business confidentiality protection. ACM Trans Web 4(3): 10
Rada R, Mili H, Bicknell E (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern 19(1): 17–30
Rastegari H, Shamsuddin SM (2010) Web search personalization based on browsing history by artificial immune system. Int J Adv Soft Comput Appl 2(3): 282–301
Salam A, Khayal MSH (2011) Mining top-k frequent patterns without minimum support threshold. Knowl Inf Syst. doi:10.1007/s10115-010-0363-3
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5): 513–523
Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information. In: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Seattle, Washington, USA. June 1998, p 188
Saygin Y, Hakkani-Tur D, Tur G (2005) Sanitization and anonymization of document repositories. In: Ferrari E, Thuraisingham B (eds) Web and information security. Idea, London
Schwartz J, Steger A, Weiβl A (2005) Fast algorithms for weighted bipartite matching. In: Nikoletseas SE (ed) 4th international workshop in experimental and efficient algorithms (WEA 2005), Santorini Island, Greece. May 2005, pp 476–487
Sieg A, Mobasher B, Burke R (2007) Web search personalization with ontological user profiles. In: Silva MJ, Laender AHF, Baeza-Yates RA (eds) Proceedings of the sixteenth ACM conference on information and knowledge management (CIKM 2007), Lisbon, Portugal. November 2007, pp 525–534
Sweeney L (2002) k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5): 571–588
Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy preserving anonymization of set valued data. In: Proceedings of the VLDB endowment (VLDB 2008), vol 1, no 1, Auckland, New Zealand. August 2008, pp 115–125
Torra V, Domingo-Ferrer J (2003) Record linkage methods for multidatabase data mining. In: Torra V (ed) Information fusion in data mining. Springer, New York
Weber I, Castillo C (2010) The demographics of web search. In: Crestani F, Marchand-Maillet S, Chen HH (eds) Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010), Geneva, Switzerland. July 2010, pp 523–530
Wen JR (2009) Enhancing web search through query log mining. In: Wang J (ed) Encyclopedia of data warehousing and mining, 2nd edn. IGI Global, London, pp 758–763
Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: Dayal U, Whang KY, Lomet DB (eds) Proceedings of the 32nd international conference on very large data bases (VLDB 2006), Seoul, Korea. September 2006, pp 139–150
Xiong L, Agichtein E (2007) Towards privacy-preserving query log publishing. In: Query log analysis: social and technological challenges. A workshop at the 16th international world wide web conference (WWW 2007), Banff, Alberta, Canada, May 2007
Xu Y, Wang K, Fu A (2008) Anonymizing transaction databases for publication. In: Li Y, Liu B, Sarawagi S (eds) Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, USA. August 2008, pp 767–775
Zhang Q, Koudas N, Srivastava D (2007) Aggregate query answering on anonymized tables. In: Chirkova R, Dogac A, Özsu MT (eds) Proceedings of the 23rd international conference on data engineering (ICDE 2007), Istanbul, Turkey. April 2007, pp 116–125
Zhu Y, Xiong L, Verdery C (2010) Anonymizing user profiles for personalized web search. In: Rappa M, Jones P, Freire J (eds) Proceedings of the 19th international conference on world wide web (WWW 2010), Raleigh, North Carolina, USA. April 2010, pp 1225–1226
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, J., Wang, K. Anonymizing bag-valued sparse data by semantic similarity-based clustering. Knowl Inf Syst 35, 435–461 (2013). https://doi.org/10.1007/s10115-012-0515-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0515-8