ABSTRACT
The question of how to publish an anonymized search log was brought to the forefront by a well-intentioned, but privacy-unaware AOL search log release. Since then a series of ad-hoc techniques have been proposed in the literature, though none are known to be provably private. In this paper, we take a major step towards a solution: we show how queries, clicks and their associated perturbed counts can be published in a manner that rigorously preserves privacy. Our algorithm is decidedly simple to state, but non-trivial to analyze. On the opposite side of privacy is the question of whether the data we can safely publish is of any use. Our findings offer a glimmer of hope: we demonstrate that a non-negligible fraction of queries and clicks can indeed be safely published via a collection of experiments on a real search log. In addition, we select an application, keyword generation, and show that the keyword suggestions generated from the perturbed data resemble those generated from the original data.
- E. Adar. User 4xxxxx9: Anonymizing query logs. In Query Log Analysis: Social And Technological Challenges Workshop at WWW, 2007.Google Scholar
- M. Arrington. AOL proudly releases massive amounts of private data. August 2006.Google Scholar
- R. Baeza-Yates and A. Tiberi. Extracting semantic relations from query logs. In KDD, pages 76--85, 2007. Google ScholarDigital Library
- J. Bar-Ilan. Access to query logs -- an academic researcher's point of view. In Query Log Analysis: Social And Technological Challenges Workshop at WWW, 2007.Google Scholar
- M. Barbaro and T. Zeller. A face is exposed for AOL searcher No. 4417749. New York Times, Aug 2006.Google Scholar
- A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In STOC, pages 609--618, 2008. Google ScholarDigital Library
- K. Chaudhuri and N. Mishra. When random sampling preserves privacy. In CRYPTO, volume 4117, pages 198--213, 2006. Google ScholarDigital Library
- N. Craswell and M. Szummer. Random walks on the click graph. In SIGIR, pages 239--246, 2007. Google ScholarDigital Library
- C. Dwork. An ad omnia approach to defining and achieving private data analysis. In Lecture Notes in Computer Science, volume 4890, pages 1--13. Springer, 2008. Google ScholarDigital Library
- C. Dwork, K. . Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, volume 4004, pages 486--503, 2006. Google ScholarDigital Library
- C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, pages 265--284, 2006. Google ScholarDigital Library
- D. Fallows. Search engine users. Pew Internet and American Life Project, 2005.Google Scholar
- A. Fuxman, P. Tsaparas, K. Achan, and R. Agrawal. Using the wisdom of the crowds for keyword generation. In WWW, pages 61--70, 2008. Google ScholarDigital Library
- A. Horowitz, D. Jacobson, T. McNichol, and O. Thomas. 101 dumbest moments in business, the year's biggest boors, buffoons, and blunderers. In CNN Money, 2007.Google Scholar
- T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In SIGIR, pages 154--161, 2005. Google ScholarDigital Library
- R. Jones, R. Kumar, B. Pang, and A.Tomkins. "I know what you did last summer": query logs and user privacy. In CIKM, pages 909--914, 2007. Google ScholarDigital Library
- R. Jones, R. Kumar, B. Pang, and A. Tomkins. Vanity fair: Privacy in querylog bundles. In CIKM, pages 853--862, 2008. Google ScholarDigital Library
- R. Kessler, M. Stein, and P. Berglund. Social Phobia Subtypes in the National Comorbidity Survey. Am J Psychiatry, 155(5):613--619, 1998.Google Scholar
- R. Kumar, J. Novak, B. Pang, and A. Tomkins. On anonymizing query logs via token-based hashing. In WWW, pages 629--638, 2007. Google ScholarDigital Library
- F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94--103, 2007. Google ScholarDigital Library
- F. McSherry and K. Talwar. Private communication. 2008.Google Scholar
- A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In IEEE Symposium on Security and Privacy, pages 111--125, 2008. Google ScholarDigital Library
- K. Nissim. Private data analysis via output perturbation. In Privacy-Preserving Data Mining: Models and Algorithms, pages 383--414. Springer, 2008.Google Scholar
- B. Tancer. Click: What Millions of People Are Doing Online and Why it Matters. Hyperion, 2008.Google Scholar
- L. Xiong and E. Agichtein. Towards privacy-preserving query log publishing. In Query Log Analysis: Social And Technological Challenges Workshop in WWW, 2007.Google Scholar
Index Terms
- Releasing search queries and clicks privately
Recommendations
UMicS: from anonymized data to usable microdata
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementThere is currently a tug-of-war going on surrounding data releases. On one side, there are many strong reasons pulling to release data to other parties: business factors, freedom of information rules, and scientific sharing agreements. On the other side,...
Differentially private search log sanitization with optimal output utility
EDBT '12: Proceedings of the 15th International Conference on Extending Database TechnologyWeb search logs contain extremely sensitive data, as evidenced by the recent AOL incident. However, storing and analyzing search logs can be very useful for many purposes (i.e. investigating human behavior). Thus, an important research question is how ...
A semantic-preserving differentially private method for releasing query logs
Highlights- We discuss the challenges and particularities of privacy-preserving releases of query logs.
AbstractQuery logs are of great interest for data analysis. They allow characterizing user profiles, user behaviors and search habits. However, since query logs usually contain personal information, data controllers should implement ...
Comments