Elsevier

Knowledge-Based Systems

Volume 33, September 2012, Pages 124-135
Knowledge-Based Systems

Privacy-preserving SOM-based recommendations on horizontally distributed data

https://doi.org/10.1016/j.knosys.2012.02.013Get rights and content

Abstract

To produce predictions with decent accuracy, collaborative filtering algorithms need sufficient data. Due to the nature of online shopping and increasing amount of online vendors, different customers’ preferences about the same products can be distributed among various companies, even competing vendors. Therefore, those companies holding inadequate number of users’ data might decide to combine their data in such a way to present accurate predictions with acceptable online performance. However, they do not want to divulge their data, because such data are considered confidential and valuable. Furthermore, it is not legal disclosing users’ preferences; nevertheless, if privacy is protected, they can collaborate to produce correct predictions.

We propose a privacy-preserving scheme to provide recommendations on horizontally partitioned data among multiple parties. In order to improve online performance, the parties cluster their distributed data off-line without greatly jeopardizing their secrecy. They then estimate predictions using k-nearest neighbor approach while preserving their privacy. We demonstrate that the proposed method preserves data owners’ privacy and is able to suggest predictions resourcefully. By performing several experiments using real data sets, we analyze our scheme in terms of accuracy. Our empirical outcomes show that it is still possible to estimate truthful predictions competently while maintaining data owners’ confidentiality based on horizontally distributed data.

Introduction

Rapid improvements in the Internet technology help people purchase several kinds of products through the Internet facilities. Due to its attractiveness, many online vendors have been founded to promote online shopping. To facilitate their customers choose the right products, e-commerce sites employ Collaborative Filtering (CF) schemes because selecting appropriate products to purchase becomes a challenging problem as number of choices increases [1]. In addition to recommending various products like books, movies, music CDs, and so on, CF systems are also used to suggest web pages.

The basic steps in CF process are, as follows [10], [17]: After collecting users’ likings about various items, an n × m user-item matrix (D) is created, where n and m represent number of users and items, respectively. CF schemes then estimate similarities between users in their database and an active user (a) who is looking for a prediction for a target item (q). Next, they determine neighbors of the active user a (the best k similar users) according to the similarity weights. Finally, a weighted average of their ratings on the target item q is calculated.

One of the main purposes of CF systems is to offer truthful and reliable referrals. To produce precise and dependable predictions, such systems should collect ratings from enough number of users. When online vendors own limited number of users’ data, it becomes a challenge to form reliable and large enough neighborhoods; that might cause low quality CF services. Additionally, inadequate number of users’ data lead to cold start problem, where e-commerce sites can recommend predictions for limited number of items. That might cause to lose customers due to the lack of accuracy in the recommendations received [6]. Therefore, holding sufficient number of users’ ratings is imperative for the overall success of CF systems.

Some companies, especially recently established ones, might not have enough users’ data for recommendation purposes. Moreover, customers may prefer different online vendors for shopping. In other words, different users purchase the same products from different companies and they can request referrals from corresponding vendors. Consequently, ratings of the same items collected from many users for CF purposes might be horizontally partitioned among multiple vendors. For example, some clients purchase books from Amazon.com and some prefer Barnes & Noble.com, while others get them from Borders, and so on. These book sellers’ databases may include ratings for the same books recorded from disjoint sets of customers, and these can be jointly used for better referrals. Notice that this does not mean that online vendors sell exactly the same items; however, containing huge number of identical items in their database is comprehensible. Such data distribution leads to Horizontally Distributed Data (HDD). Formally, D is partitioned between C companies, where C is a constant representing number of collaborating sites and C  n. Each collaborating party j holds Dj, where Dj is an nj × m matrix, j = 1, 2,  , C; and nj shows the number of users whose data held by the retailer j. Thus, each party j holds the ratings of nj users for the same m items. We assume that each collaborating party’s database includes ratings for exactly the same products. In other words, D is the most comprehensive intersection of products in the cooperating companies’ databases; and it is updated periodically by inserting new users and/or items.

There are numerous opportunities in e-commerce to enable beneficial association. When data are distributed, in order to overcome accuracy and cold start problems, data owners want to produce predictions on their integrated data [7]. Privacy-Preserving Collaborative Filtering (PPCF) on distributed data is important for both online companies and users due to common advantages. However, companies do not want to share confidential data with each other, because they do not want to give up competitive knowledge advantages or violate anti-trust law [10]. Like in e-commerce applications, data integration is becoming imperative in healthcare applications, sharing scientific research data, and solving life-threading problems like efficient disease control and effective public safety [4]. It is inevitable to amalgamate data in such applications; that can only be possible if privacy is preserved. Without secrecy, the parties hesitate to collaborate due to privacy, financial, and legal reasons.

Data collected for prediction purposes are considered companies’ secret information because they can be used to profile their customers. Such data are also valuable asset and transferred or sold in case of bankruptcy. Users’ ratings could be utilized to recruit new customers and increase sales by advertising on users’ profiles. Online vendors are also obliged to protect the collected data. It is not legal to transfer users’ preferences. According to reports published by the Organisation for Economic Co-operation and Development – OECD [32], [33], exposing of customers’ privacy is very serious issue, and the companies are obliged to protect the data. Therefore, utilizing privacy-preserving measures is vital for alleviating privacy, financial, and legal concerns.

In this study, we propose a privacy-preserving method for providing k-nearest neighbor (k-nn)-based predictions on HDD without jeopardizing data owners’ confidentiality. In addition to preserving privacy, offering predictions during an online interaction in a limited time is also essential for the overall success of CF schemes. Since determining the nearest neighbors is difficult and time consuming when predictions are produced in a distributive manner, we propose to cluster the distributed data using Self-Organizing Map (SOM) clustering. E-commerce sites can cluster their split data off-line using SOM clustering while preserving their confidentiality so that they are able to improve online performance. Moreover, besides privacy and performance, the parties are also able to make accuracy better by data integration. They can protect their secrecy during data clustering and estimating predictions without disclosing their data to each other. Since precision, privacy, and performance conflict with each other, we aim to provide a solution, which results equilibrium among them.

We explain related studies conducted so far in Section 2. Section 3 presents SOM clustering and gives a brief description of CF based on k-nn. After extensively presenting our privacy-preserving scheme for providing predictions on HDD in Section 4, we scrutinize our scheme in terms of privacy in Section 5. Section 6 presents additional costs like storage, computation, and communication costs caused by privacy-preserving measures. After presenting our real data-based trials, empirical results, and discussion about the outcomes in Section 7, we finally elucidate our conclusions and briefly present future work in Section 8.

Section snippets

Methods for enhancing online performance of collaborative filtering schemes

Various approaches have been proposed to enhance the online performance of CF systems. Goldberg et al. [12] make use of Principle Component Analysis (PCA) to generate constant time predictions. Their method can produce a referral for a single item in constant time O(1). Sarwar et al. [39] propose to reduce dimensions of data by using Singular Value Decomposition (SVD) and they improve performance of producing referrals. Clustering is also among the methods that are applied to CF to improve

SOM clustering and k-nn-based collaborative filtering

Roh et al. [38] apply SOM clustering to CF for better predictions. According to their empirical results, SOM-based CF scheme provides higher quality predictions than other comparative models. In addition to providing high quality recommendations, SOM has capability of clustering large-scale databases and it is an important aptitude to handle with high dimensional data in recommender systems [31]. SOM was introduced by Kohonen [25]. SOM reduces dimensions into one or two-dimensional lattice by

Privacy-preserving SOM-based predictions on distributed data

The companies, especially malicious ones, participating in distributed CF services might try to derive information about each other’s data. They can try to obtain useful information from interim results or final predictions. To protect data owners’ confidentiality, our proposed scheme has to overcome privacy attacks. We can define privacy, as follows: The parties should not be able to learn the true ratings values and the rated and/or unrated items held by each other. Besides protecting

Privacy analysis

Lindell and Pinkas [29] define privacy in terms of distributed data-based data mining, as follows: “No party should learn anything more than its prescribed output. In particular, the only information that should be learned about other parties’ inputs is what can be derived from the output itself.” Similarly, in our proposed scheme, the companies should not be able to learn the true ratings and the rated and/or unrated items held by each other. To analyze our proposed approach in terms of

Supplementary costs analysis

Due to privacy protection measures, extra costs like storage, communication, and computation costs are inevitable because privacy, accuracy, and performance conflict with each other. Note that off-line computation and communication costs are not critical for overall performance. Therefore, it is better to conduct as many computations as possible off-line in order improve online efficiency. However, in order to provide new recommendations after users provide new ratings, the collaborating

Accuracy and overall performance analysis

To test our scheme in terms of accuracy and investigate its overall performance, we perform various experiments on real data sets. Accuracy shows how precise our privacy-preserving scheme-based recommendations are. We conduct trials for testing how the proposed scheme affects the quality of the predictions.

Conclusions and future work

We presented a privacy-preserving scheme to provide recommendations based on horizontally distributed data among multiple parties using clustering-based collaborative filtering algorithm. Accuracy, performance, and privacy are major goals that recommender systems want to accomplish. Since they are conflicting goals, we provided a scheme finding equilibrium among them. To improve online performance, clustering is widely used. We also applied clustering in our proposed scheme. Data collected for

Acknowledgment

This work was supported by the Grant 108E221 from TUBITAK.

References (44)

  • M. Berry et al.

    Mastering Data Mining

    (2000)
  • S.S. Bhowmick, L. Gruenwald, M. Iwaihara, S. Chatvichienchai, PRIVATE-IYE: a framework for privacy-preserving data...
  • J. Canny, Collaborative filtering with privacy, in: Proceedings of IEEE Symposium on Security and Privacy, CA, USA,...
  • J. Canny, Collaborative filtering with privacy via factor analysis, in: Proceedings of the 25th Annual International...
  • C. Clifton, M. Kantarcioglu, A. Doan, G. Schadow, J. Vaidya, A. Elmagarmid, D. Suciu, Privacy-preserving data...
  • G. Gan, C. Ma, J. Wu, Data Clustering: Theory, Algorithms, and Applications, SIAM,...
  • K. Goldberg et al.

    Eigentaste: a constant time collaborative filtering algorithm

    Information Retrieval

    (2001)
  • D. Gupta, M. Digiovanni, H. Narita, K. Goldberg, Jester 2.0: evaluation of a new linear time collaborative filtering...
  • S. Haykin

    Neural Networks: A Comprehensive Foundation

    (1999)
  • J.L. Herlocker et al.

    Evaluating collaborative filtering recommender systems

    ACM Transactions on Information Systems (TOIS)

    (2004)
  • C. Kaleli, H. Polat, Providing naive Bayesian classifier-based private recommendations on partitioned data, in: J.N....
  • C. Kaleli, H. Polat, Providing private recommendations using naive Bayesian classifier, in: K.M. WegrzynWolska, P.S....
  • Cited by (27)

    • An efficient multi-party scheme for privacy preserving collaborative filtering for healthcare recommender system

      2018, Future Generation Computer Systems
      Citation Excerpt :

      Yakut and Polat proposed PPCF scheme based on Singular Value Decomposition for HDD and VDD [12]. Kaleli and Polat proposed PPCF technique for HDD on multiple parties based on clustering [13]. Different parties perform clustering of their distributed data off-line securely and then generate recommendations online, based on k-nearest neighbor approach.

    • Robustness analysis of arbitrarily distributed data-based recommendation methods

      2016, Expert Systems with Applications
      Citation Excerpt :

      However, due to commercial concerns and obligations arising from the regulations published by OECD (2005)1, they might hesitate to collaborate. Hence, researchers propose methods enabling data holders’ collaboration without jeopardizing their privacy (Kaleli & Polat, 2012a; Yakut & Polat, 2012a). In the proposed studies, researchers consider two or more parties collaboration on three different data distribution scenarios, i.e., horizontal, vertical, and arbitrarily.

    • Privacy preserving sub-feature selection in distributed data mining

      2015, Applied Soft Computing Journal
      Citation Excerpt :

      The techniques of above paper indicate the importance of the role of participating parties in distributed environment where each party never wants to release their private data without protection. So many researchers develop the different privacy preservation model to protect the individual or organizational data such as secure multiparty computation [13], multiparty privacy preservation distributed data mining [14], privacy preserving data publishing [15], k-anonymity and l-diversity approach for privacy preservation in social network [16], privacy preserving SOM-based recommendations on horizontally distributed data [17], etc. Although each feature in a database has important role for different data mining computation such as classification, clustering, etc., yet it needs to filter or to select the required sensitive feature from different database.

    • Fast clustering-based anonymization approaches with time constraints for data streams

      2013, Knowledge-Based Systems
      Citation Excerpt :

      With the advance of the data mining techniques and people’s increasing concerns about the personal privacy, how to share the information without disclosing the personal privacy has become an important research topic in recent years [1]. Extensive research work has been done on the protection of static data [2–14]. k-anonymity [15,16], ℓ-diversity [17], t-closeness [18], ε-differential privacy [19] and other principles are widely applied in designing the privacy preserving methods.

    • Recommender systems survey

      2013, Knowledge-Based Systems
      Citation Excerpt :

      For privacy preservation in RS, a certain level of uncertainty must be introduced into the predictions [156], primarily through tradeoffs between accuracy and privacy [146]. Furthermore, privacy can be preserved when different RS companies share information (combining their data) [116,242]. Privacy becomes more important as RS increasingly incorporate social information.

    • Estimating NBC-based recommendations on arbitrarily partitioned data with privacy

      2012, Knowledge-Based Systems
      Citation Excerpt :

      Yakut and Polat [33] propose a privacy-protecting solution for two parties desiring to provide SVD-based CF services on alternatively HPD or VPD. Kaleli and Polat [16] focus on privacy-concerning multiparty scenarios on CF. They examine how to produce SOM-based recommendations on data horizontally distributed among multiple parties. NBC also takes attention of P2D2M community.

    View all citing articles on Scopus
    View full text