1 Introduction

Electronic commerce has become widespread due to the rapid development of in Internet technologies. A large number of products are sold via the Internet. Recommender systems have been developed to help customers select appropriate products. One of the most commonly used recommender systems is collaborative filtering (CF) (Bobadilla et al. 2013; Jia and Liu 2015). Customers may not want their product preferences and the products they rate to be known. To protect such personal preferences, privacy-preserving collaborative filtering (PPCF) methods have been developed (Bilge et al. 2013a; Casino et al. 2013, 2015; Jeckmans et al. 2013; Ozturk and Polat 2015). The goal of PPCF schemes is to provide recommendations with acceptable accuracy while preserving private data.

One common privacy-preserving technique utilized in PPCF algorithms is called randomized perturbation (Bilge et al. 2013a), which disguises data by adding noise data. Thus, data collectors storing the disguised data cannot learn actual rates but continue to produce accurate predictions. A Gaussian or uniform distribution with a zero mean (μ) and a standard deviation (σ) is used to produce random numbers. To hide the rated and/or unrated products, some of the uniformly randomly selected unrated items cells are filled with random numbers.

Malicious users who attempt to manipulate the outcomes of CF and PPCF systems can attack such systems by adding fake profiles in their databases. The purpose of these attacks is usually to increase the popularity of a target product (push attack) or reduce it (nuke attack). These types of attacks are known as shilling attacks (O’Mahony et al. 2004; Gunes et al. 2014). Mobasher et al. (2007) show that CF schemes are vulnerable to shilling attacks. Various PPCF systems are also vulnerable to such attacks, as shown by Gunes et al. (2013a, b) and Bilge et al. (2014a). Shilling attacks might significantly affect the accuracy of the estimated predictions in PPCF schemes. Thus, it is very important to detect these types of attacks and reduce their effects for recommendation systems to function correctly. Different detection methods have been developed and applied to CF algorithms to identify fake profiles (Li and Luo 2011; Zhang and Zhou 2014; Zhou et al. 2014a, b; Xia et al. 2015; Zhang and Zhou 2015). However, only a single study by Gunes and Polat (2015a) has focused on detecting shilling profiles in PPCF algorithms. The authors propose a hierarchical clustering-based scheme to detect fake profiles in private environments. They consider a single detection method. Hence, we apply the most commonly used four detection methods to PPCF schemes. We modify the current four detection methods so that they are applicable to PPCF methods and conduct experiments with real data sets. In practice, six attacking models developed previously for attacking PPCF algorithms are employed. Four of the most common detection schemes are utilized as detection technique, where two common data sets are used for empirical analysis.

The contributions of this article in general can be summarized as follows:

  1. 1.

    Four widely used shilling attack detection techniques are modified in such a way to detect shilling profiles generated by six shilling attacks (four push and two nuke attacks) in masked data in PPCF systems’ databases.

  2. 2.

    Different sets of experiments are performed using two real data sets to evaluate the success of the detection techniques for PPCF systems.

  3. 3.

    The modified detection methods are compared with the one proposed by Gunes and Polat (2015a) with respect to detection performance. They are also compared with the ones used in non-private environments.

The rest of the paper is structured as follows. In Sect. 2, related studies are reviewed, and the differences between this work and existing work are briefly presented. Preliminary work is concisely explained in Sect. 3. In Sect. 4, we explain how to apply these four detection methods to PPCF algorithms and describe the modified detection techniques. We explain in detail real data-based trials and their outcomes in Sect. 5. Empirical results are discussed in Sect. 6. We conclude the paper and provide future research directions in Sect. 7.

2 Related work

Chirita et al. (2005) performed the first work on the detection of shilling profiles by checking profile properties in CF systems. They considered the simplest attacking models for random and average methods. Burke et al. (2006) studied the effectiveness of different characteristics derived from user profiles for attack detection. Their study demonstrated that a machine learning sorting method that included attributes derived from attack models was more successful than more widespread detection algorithms. To detect attack profiles in CF systems, variable selection based on principal component analysis (PCA) can be utilized (Mehta et al. 2007). This approach can only be applied to a dense user-item matrix because PCA cannot tolerate null values. Mehta (2007) and Mehta and Nejdl (2009) attempted to detect attack profiles using PLSA-based clustering algorithm. Attack profiles were distributed to the same clusters based on their similarities. Bhaumik et al. (2011) reported that the similarity of the values of the detection attributes depends on the similarity of the attack profiles. Based on these attributes, the profiles were separated into clusters using a k-means algorithm.

Bhaumik et al. (2006) studied two statistical operation control methods and proposed statistical process control-based methods as alternative solutions. Zhang et al. (2006b) suggested an attack detection method establishing a low dimensional linear model. Hurley et al. (2009) discussed the detection of standard attack models using statistical approaches. Using the Neyman–Pearson method, they distinguished real and attack profiles. Li and Luo (2011) established a probability model within the framework of a Bayesian network to detect any possible attack. Zhou et al. (2014a, b) reported the utilization of statistical metrics to determine the rating patterns of attackers and employed this metric to examine differences in rating configurations between shilling and genuine profiles in shilling attacks. Su et al. (2005) focused on detecting group attack users rather than individual attacks using a similarity algorithm that evaluated the consistency of a user’s votes for similar products. O’Mahony et al. (2006) recommended a signal detection approach for determining shilling profiles. Zhang et al. (2006a) suggested the construction of a time series of the votes for a product, in which a frame is used to group successive votes for a product. In each frame, the sample average and entropy values are calculated, and the results are interpreted to identify suspicious attitudes. Zhang et al. (2009) suggested using trust values as a metric to protect systems based on trust.

Tang and Tang (2011) analyzed the time gaps between voting times to identify suspicious attitudes affecting top-N lists in prediction systems. Zhang (2011) focused on protecting trust-based recommendation systems from attacks. A data genealogical tree method was proposed to defend attacks by tracing the recommendation history and finding the victim nodes. Noh et al. (2014) proposed a novel robust recommendation algorithm, RobuNec, that provides admission control as a defense mechanism against shilling attacks. Cao et al. (2013) intended to utilize semi-supervised learning to identify attack profiles and described the application of semi-supervised learning to shilling attack detection. Zhang et al. (2013) proposed two methods, CluTr and WCluTr, for building robust CF system to prevent shilling attacks. CluTr filters out suspicious fake users in the formed clusters, and WCluTr uses trust information to strengthen similarities among genuine users. Morid et al. (2013) proposed a new attack detection method that detects influential users instead of the whole user set to improve detection performance. Xia et al. (2015) proposed a novel detection scheme based on dynamic time interval segmentation. The method is able to detect fake profiles regardless of specific attack types.

Zhang and Zhou (2014) constructed rating series for each profile based on originality and product reputation. They employed an experimental mode decomposition method to decompose each rating series and extract Hilbert spectrum-based characteristics to describe shilling attacks. Zhang and Zhou (2015) also proposed a shilling attack detection method based on a back propagation neural network and ensemble learning method. Bilge et al. (2014b) recommended an original shilling attack identification technique for specific attacks based on bisecting k-means clustering method. The results indicated that the technique was exclusively successful against bandwagon, segment, and average attacks. Li (2014) proposed a method that discloses latent factors appealing missing ratings under the non-arbitrary-missing mechanism and further unites these hidden issues using a Dirichlet process in the framework of a probabilistic generative model. Zhuo and Kulkarni (2014) presented a technique to make CF systems resistant to shilling attacks. The maximum sub-matrix is explored by converting the issue into a graph and combining nodes by heuristic functions. Chung et al. (2013) suggested a Beta-protection approach to address the drawbacks of current detection techniques.

Our study differs from those described above. We attempted to detect attacks on PPCF schemes (private environments), whereas other studies have focused on attack detection on CF schemes without privacy (non-private environments). A literature review revealed that only one detection method has been used for attacks on PPCF. Gunes and Polat (2015a) proposed a hierarchical clustering-based shilling attack detection method in private environments. They scrutinized the ratings of target items to improve the overall performance of their scheme. Their empirical results revealed that the proposed method could identify shilling profiles with decent accuracy. We researched all of the detection methods applied to CF algorithms described above and selected four that are the most used and most applicable to PPCF schemes. We used each of these four methods to detect fake profiles in the databases of PPCF systems.

In addition to detecting shilling attacks, robust PPCF schemes have been proposed (Bilge et al. 2013b). Also, there are various studies focusing on the robustness analysis of PPCF schemes, where the authors show how robust the schemes are against different shilling attacks (Gunes et al. 2013a, b; Bilge and Polat 2013b; Bilge et al. 2014a; Gunes and Polat 2015b; Yurekli and Kaleli 2016). Our study presented here is different from the abovementioned ones because we focus on attack detection while they focus on robustness analysis. Attack detection and robustness analysis require different approaches.

3 Preliminaries

3.1 Shilling attacks on disguised data

Applying traditional shilling attacks against PPCF systems is difficult due to disguised data. Therefore, attackers must modify conventional shilling attacks. Gunes et al. (2013a, b) redesigned well-known shilling attacks to enable their application to masked databases. Attackers must decide on either a uniform or Gaussian random number distribution to generate random numbers to mask their data. Moreover, standard deviations (σ values) are uniformly randomly selected from the range (0, σ max ] for each attack profile prior to generating fake profiles, where σ max is a privacy parameter. Modified attacks can be briefly explained as follows (Gunes et al. 2013a, b). In a typical shilling attack, one item is selected as a target item whose popularity is manipulated by the attackers. A set of items called filler items is selected to be filled with bogus ratings. Another set of items, referred to as selected items, are selected to enable specific shilling attacks by filling these items with values estimated from known knowledge about the attacked system. Finally, the remaining items in a profile are left unrated.

In the random attack model, the set of selected items is empty. Randomly selected filler items are filled with random values. The target item is assigned the highest possible random value. Similarly, in the average attack model, the set of selected items is also empty. Randomly selected filler items are filled with the item’s mean. Recall that such mean values are estimated based on perturbed data due to privacy concerns in PPCF schemes. The target item is given the maximum random value to increase its popularity. In the bandwagon attack model, items are selected from popular items that are densely rated and have high means. The selected items are chosen from among popular items. The selected items and randomly chosen filler items are filled with random values. The selected items are given the largest random numbers because they are popular items and are expected to have higher ratings. Similarly, the target item is assigned the maximum possible random value to push its popularity. The segment attack model is designed to attack a specific group or segment of users. The selected items are chosen from among high average products in a specific segment. The selected items and the target item are filled in a similar manner as in the bandwagon attack model.

The reverse bandwagon and the love/hate attack models are nuke attacks. In a reverse bandwagon attack, selected items are chosen from among unpopular items that are densely rated and have low means. The selected items and randomly chosen filler items are filled with random values. In this case, the selected items are given the lowest values, whereas the target item is assigned the minimum random value so that the target item can be nuked. In the case of a love/hate attack, a set of selected items is empty for the same reasons. Randomly determined filler items are filled with high random values. The target item is appointed the minimum random value to decrease its popularity.

3.2 Shilling attack detection methods

We explain the Chirita algorithm, kNN classifier, k-means clustering, and PCA-based variable selection methods as shilling attack detection methods. Chirita et al. (2005) attempted to classify profiles using eight generic attributes described as follows:

  1. 1.

    Number of prediction-differences (TFS): A prediction is determined for each user. TFS describes the net difference after erasing the user from the system.

  2. 2.

    Standard deviation in user’s ratings: This metric shows the selecting degree above the average of a user.

  3. 3.

    Degree of agreement with other users: This metric exhibits the difference degree of each of a user’s selections from the average selecting degree of an item.

  4. 4.

    Degree of similarity with top neighbors: The weight of the similarities between a user and the closest k number of users of her.

  5. 5.

    Rating deviation from mean agreement (RDMA): This metric determines the deviation from the pre-given average values of some of specific items.

  6. 6.

    Weighted deviation from mean agreement (WDMA): The RDMA is weighted by the square of the number of votes for the item.

  7. 7.

    Weighted degree of agreement (WDA): This metric differs from the RDMA metric in that the former is not divided by the total number of votes given by the user.

  8. 8.

    Length variance (lengthVar): The metric measures the extent to which the length of the investigated profile differs from the average profile length.

In addition to generic attributes, Burke et al. (2006) subsequently used five model-specific attributes for classifying profiles described as follows:

  1. 1.

    Filler mean variance (FMV): FMV calculates the variation between the average value of the item and the value of each item (filler items I F ) assumed to exist in the item set of each profile.

  2. 2.

    Filler mean difference (FMD): The major difference between FMD and the model-based metric is the use of the absolute value of the difference between the vote of the user and the average of the votes instead of the square of the difference value.

  3. 3.

    Filler average correlation: This metric calculates the correlation between each item value and the average item value in the filler item set of the investigated profile.

  4. 4.

    Filler mean target difference (FMTD): FMTD calculates the difference between the average of the assumed filler item set and the average of the possible target item set.

  5. 5.

    Profile variance: This metric calculates the profile variance, which tends to be low compared to authentic users.

3.2.1 Chirita algorithm

Chirita et al. (2005) proposed an algorithm (referred to as the Chirita algorithm) on RDMA for detecting shilling attackers. The proposed algorithm is a two-step method as follows:

  1. 1.

    The algorithm computes the average similarity with the top neighbors for all users using the Pearson correlation coefficient. Thus, only a subset of the total users are used for computations. The algorithm then selects only those users with an average similarity less than 0.5 of the maximum average similarity in the system to compute the RDMA.

  2. 2.

    The algorithm associates with each value of RDMA a function that evaluates the probability (PA u ) that the respective user is a shilling attacker. The first s profiles, sorted based on PA u , are considered attack profiles. PA u is used to decide whether a profile is an attack profile or not. Higher PA u values for RDMA mean that the related profiles are attack profiles.

3.2.2 kNN classifier-based detection algorithm

Mobasher et al. (2006, 2007) proposed a method based on classification that utilizes a total of 15 detection attributes: six generic attributes (WDMA, RDMA, WDA, LengthVar, DegSim with k = 450 and DegSim′ with k = 2, d = 963, where k is the number of neighbors and d is co-rate factor); six average attack model attributes (filler mean variance, filler mean difference and profile variance; computed for both push and nuke); two bandwagon attack model (FMTD; computed for both push and nuke); and one target detection model attribute (TMF). Class labels and detection attributes are generated for the entire data set, which is divided into two equal-sized sub-sections of training and test data sets. A kNN with k = 9 is used in the classifier. The kNN classifiers are implemented using Weka. For each test, the second half of the data is injected with attack profiles and then run through the classifier built on the augmented first half of the data.

3.2.3 k-means clustering-based detection algorithm

The attack profiles are similar to each other because they are generated by known algorithms. Consequently, when a k-means clustering algorithm is employed on the data set to which attack profiles are added, most of the attack profiles should be distributed to the same cluster. The most important issue at this stage is to identify the cluster in which the attack profiles will be gathered. Mehta and Nejdl (2009) aimed to find the tightest cluster (in which the elements are very similar to each other) in their study of clustered-based detection. Consequently, for each cluster, the distance of the profiles from the center is calculated. Then, the average distance is computed. The cluster with the shortest average distance to the center is defined as the attack cluster. The determined cluster is finally isolated.

3.2.4 PCA-based variable selection detection algorithm

In a recommendation system, if users are considered variables, there will be data with a similar number of dimensions. Thus, dimensionality reduction would discard these dimensions due to low covariance of the dimensions. Low covariance is observed not only between shilling users but also shilling users and normal users. PCA computes principal components that are oriented more toward real users, who exhibit the maximum variance of the data. Consequently, those users who display the least covariance with all other users should be selected. This quantity is used to select some variables from the original data to which PCA was applied. In the algorithm below (Algorithm 1), Mehta and Nejdl (2009) depicted their proposed approach for variable selection. The first s users are selected as the attack profiles and are isolated from the system, where s is the number of attack profiles added to the system.

figure a

4 Shilling detection methods for PPCF schemes

The detection algorithms described above have been used for CF attacks. However, we use these algorithms for PPCF attacks by adapting them to allow their use to identify shilling attacks on masked data. There are two confidential data types in PPCF schemes: actual rating values and rated and/or unrated items. To protect these private data, random numbers are generated using either a uniform or Gaussian distribution with zero mean (μ) and σ, which is uniformly randomly selected from (0, σ max ]. These noise data are added to actual votes. Additionally, some of the uniformly randomly selected unrated items cells are filled with noise data. To select unrated cells, a β value is uniformly randomly selected from (0, β max ], where β max is a privacy parameter and represents the upper bound of β values. Then, β percent of empty cells are filled with random numbers. The values of σ max and β max depend on the privacy and accuracy levels required by the CF users (Bilge et al. 2014a).

Recall that Chirita algorithm computes similarities using the Pearson correlation coefficient. Such similarities can also be estimated with decent accuracy based on perturbed data (Bilge et al. 2014a). Similarly, RDMA can be computed from masked data. Finally, the probability that a profile is an attack profile can be calculated using RDMA on masked data. Therefore, the Chirita algorithm can be employed to determine shilling profiles in PPCF schemes. The generic attribute values used in the Chirita algorithm are calculated for disguised data. Chirita et al. (2005) define α = 10 in the formula to calculate the probability that a profile is an attack profile. When the Chirita algorithm is used on PPCF schemes, the best result is obtained when α = 1, and this value is used in our trials.

The second detection method, the kNN classifier, is based on detection attributes. The values of such attributes can be determined from disguised data. The modified classifier utilizes 14 detection attributes: six generic attributes (WDMA, RDMA, WDA, LengthVar, DegSim with k = 450 and DegSim′ with k = 2, d = 963); six average attack models (filler mean variance, filler mean difference, and profile variance; computed for both push and nuke); and two bandwagon attack models (FMTD; computed for both push and nuke). As in a non-private environment, class labels and detection attributes are generated for the whole data set. A kNN with k = 9 is used as our classifier. This is the same value used in the study by Mobasher et al. (2007) and allows the results of our proposed method to be compared with the result obtained by Mobasher et al. (2007). All experiments in the present study were conducted using both the proposed method and the one introduced by Mobasher et al. (2007).

The k-means clustering-based detection method utilizes the Pearson correlation coefficient to group users into k clusters. As shown by Bilge and Polat (2013a), k-means clustering can group users into clusters with decent accuracy using disguised data. The success of this method mainly depends on how accurately users are clustered. The similarity of each profile in the clusters to the cluster center is calculated to determine the attack cluster. The similarity between the attack profiles is higher than that of the other profiles. The cluster with the highest average similarity is isolated from the system. The selection of the cluster number is important for the performance of the application. The results of the trials revealed that the ideal number of clusters is 12. The choice of initial cluster centers can affect the results slightly. The steps of the k-means clustering-based detection algorithm employed on perturbed data in PPCF schemes are defined as follows:

Algorithm 2 k - means clustering - based detection method for masked data

Let U′ = {u 1 , u 2 , …, u n } be a set of disguised data vectors and C = {c 1 , c 2 , …, c k } be a set of cluster centers.

  • 1: Randomly select the ‘k’ cluster centers.

  • 2: Estimate the similarity between each data vector and cluster centers.

  • 3: Assign the data vector to the closest cluster.

  • 4: Recalculate the new cluster center.

  • 5: Recalculate the similarity between each data vector and new obtained cluster centers.

  • 6: If no data vector is reassigned then stop, otherwise, repeat from step 3.

  • 7: Determine the cluster with the highest average similarity as the shilling cluster.

The steps defined in Algorithm 1 for the PCA-based variable selection detection method are also used in PPCF. However, in PPCF schemes, disguised z-score data are used as input data. Although the data are disguised, random numbers with an average value of 0 are added to z-scores, and masked values are obtained (z′ uj  = z uj  + r uj ). Similarly, the average of the z-score data is expected to be 0. Bilge and Polat (2013a) state that during the scalar product and sum process, the effect of random numbers can be neglected because the average of random numbers is 0. However, the same random numbers must be multiplied while calculating the diagonal components of the matrix obtained as a result of the COV ← D T D process described in the third line of Algorithm 1. Multiplying the same random numbers, r 2 uj , will create an excess value. To reduce these effects, the 2 r value is extracted from the diagonal components, in which n indicates the number of random numbers and σ r indicates the standard deviation of the random numbers. After modifying Algorithm 1 as described above, we utilize it as a detection method for filtering out shilling profiles in PPCF schemes’ databases.

5 Experiments

To demonstrate the ability of the four modified shilling attack detection methods on disguised databases in PPCF schemes for six shilling attack models, different sets of experiments were performed on real data sets. The success of shilling attacks depends on two control parameters: filler size and attack size. Filler size is the percentage of empty cells that are filled in the attacker’s profile. Attack size denotes the number of attack profiles to inject. In this sense, attack size is directly proportional to the number of users in the system. For instance, a five percent attack size indicates that there are 50 attack profiles in a system initially holding 1000 users. Privacy-preserving control parameters are first kept constant, β max  = 25 % and σ max  = 2. Then, to demonstrate how detection performance changes with varying values of these parameters, different trials are performed by changing their values.

5.1 Data sets and evaluation criteria

The MovieLens public data set (MLP) and Jester were used in the experiments. The GroupLens research team collected MLP (http://www.grouplens.org). MLP comprises 100 K ratings of 1682 movies by 943 users. Within the set, the ratings are discrete values from 1 to 5. Each user rated at least 20 movies. Jester data set was released from the Jester Joke Recommender System (http://eigentaste.berkeley.edu/dataset/). Jester includes numeric continuous ratings ranging from −10 to 10. It is relatively dense (around 56 %). There are 73,496 users and 100 jokes in the set. Although we used all users’ data in MLP, we randomly selected 1000 users from Jester.

To measure the performance of the detection methods, the standard measurements of precision and recall are used. The basic definition of such metrics is given as follows:

Precision = Number of true positives/(Number of true positives + Number of false positives)

Recall = Number of true positives/(Number of true positives + Number of false negatives)

Number of true positives is the number of correctly classified attack profiles, while number of false positives is the number of authentic profiles misclassified as attack profiles, and number of false negatives is the number of attack profiles misclassified as authentic profiles.

5.2 Methodology

Our experimental methodology was as follows. Two distinct target item sets were first formed for each real data set. In MLP, each target item set includes 50 movies for push and nuke attacks. Due to the limited number of jokes in Jester, target item sets for push and nuke attacks include 25 jokes. Target items were randomly selected by stratified sampling. Intuitively, attempting to push a popular item or nuke an unpopular one is considered unreasonable. Thus, the push and nuke attack sets comprised unpopular and popular items, respectively. During the trials, attack profiles were created to manipulate the outcomes of the target items. We did not perform segment attack in Jester because there is no joke category in Jester. The tests were repeated 100 times due to randomization in the perturbation process.

5.3 Empirical results

5.3.1 Effects of filler size parameter

Experiments were performed to illustrate the performance of the detection methods with varying filler size values while detecting fake profiles in masked databases. Filler size was varied from 5 to 50 %, while attack size, β max , and σ max were maintained constant at 15, 25, and 2 %, respectively. The overall averages of the precision and recall values for the Chirita, kNN classifier, k-means clustering-based, and PCA-based detection algorithms with varying filler size values are presented in Tables 1 and 2 for MLP and Jester, respectively, where RB refers to the reverse bandwagon attack model.

Table 1 Performance of detection algorithms with varying filler size (MLP)
Table 2 Performance of detection algorithms with varying filler size (Jester)

As indicated in Tables 1 and 2, the empirical outcomes for precision and recall were equal for the Chirita algorithm for both data sets. The profiles are listed from top to bottom according to PA u , and the first s profiles are classified as attack profiles. Since s number of attack profiles added to the system, the precision and recall values are equal. An increase in the filler size value did not significantly change the precision and recall values for all attack models. Therefore, the Chirita algorithm exhibited weak detection operation performance in private environments. The best outcomes were usually observed when the filler size was 25 and 50 % for MLP and Jester, respectively. The most successful results were obtained for a random attack and love/hate attack for MLP and Jester, respectively. For an average attack, all filler size values received a value 0 because the Chirita algorithm performs the classification operation by specifically considering the RDMA attribute value. The RDMA values will be higher for attack profiles than real profiles. However, while forming average attack profiles, because the filler items are filled with the item mean, the RDMA value decreases. Compared to the outcomes for a non-private environment published by Chirita et al. (2005), the Chirita algorithm provides lower results for private environments. There are couple of reasons why our results are lower. First, in the report by Chirita et al. (2005), there were simultaneous attacks to three target items. Consequently, the RDMA values were higher. Second, our values might be lower because of the use of disguised data in PPCF schemes.

The kNN classifier algorithm was quite successful in the detection operation of the PPCF attack models. As the filler size value increased from 5 to 50 %, nearly all of the precision and recall values varied between 0.800 and 1.000 for all attack models for both data sets. In other words, they are very close to each other for both data sets. The change in the precision value depicts the variability as a function of the filler size for the push attacks. At precision values less than 1.0, some of the real profiles were classified as attack profiles. However, in general, the kNN classifier algorithm was also successful in the PPCF algorithm, as in the CF algorithm. The disguise operation in the PPCF algorithm does not significantly affect the detection algorithm’s performance. Because the kNN classifier can also create a model using training data, which were disguised in the PPCF schemes, the attack profiles on masked data can be detected easily in the test set using this model. As shown in Tables 1 and 2, like the precision values, the recall values of the kNN classifier algorithm indicate high success rates. For all filler size values, the algorithm performs very well with respect to recall. The recall value of the segment attack, which differs in purpose from the other attacks, might be slightly lower than those of the other attacks for MLP. Our results on MLP are similar to those calculated for the CF algorithm by Mobasher et al. (2007).

As indicated in Tables 1 and 2, the recall values were very close to each other for all attack models in MLP and they are the same for Jester. The precision values decreased with increasing filler size values for all attack models. Recall that the k-means clustering-based detection method performs the clustering operation by considering the similarities between profiles. Consequently, the attack model type is not significant. Because all of the attack models are formed using defined algorithms, they are all naturally similar to each other. Based on this similarity, the k-means clustering-based detection method identifies the tightest cluster and isolates that cluster from the system. As the filler size value increases, the attack profiles become more similar to the real profiles. Thus, there will be more real profiles in the cluster with the attack profiles. In this situation, more real profiles were isolated from the system, leading to lower precision values. The increase in filler size values only increased the number of real profiles in the cluster under search; and thus, the recall value was unaffected. Consequently, the k-means clustering-based detection method classifies nearly 100 % of the attack profiles correctly but pushes many of the real profiles to the outside of the system.

As shown in Table 1, the best results for the average attack model were obtained when the PCA-based detection method was utilized. Because this method generates average attack profiles by filling the filler item set with values around the item mean, a small covariance value of the attack profiles among each other is expected. Mehta et al. (2007) reported that the covariance value among attack profiles is smaller than that among real profiles. Consequently, the attack profiles might be detected by PCA-based variable selection technique. As shown in Table 1, both the precision and recall values for the average attack model reached 0.670. When establishing the other attack models, the filler item set is filled with random numbers generated with a known standard deviation. In this set of experiments, because σ max was set to 2, σ was randomly selected from the range (0, 2]. If σ is high, the covariance among the profiles will be high. In this case, the PCA-based detection algorithm did not yield successful results. For Jester, we observed more successful results, as shown in Table 2. This phenomenon can be explained the larger rating range in Jester. That leads to higher covariance value of the genuine profiles. On the other hand, covariance value of the attack profiles is smaller because similar ratings are used to create attack profiles. Hence, PCA-based detection method successfully determine the attack profiles and separate them from genuine ones.

5.3.2 Effects of the attack size parameter

Various sets of experiments were conducted to scrutinize the success of shilling attack detection methods with changing attack size values on private environments. The influence of attack size emphasizes the impact of determining the number of bogus profiles to be inserted into a database as well as the utility perspective of an attack. The attack size establishes a compromise between the detectability and the impact of the applied attack model. Therefore, we performed experiments while varying the attack size from 1 to 15 % with a constant filler size of 25 %. The overall averages of precision and recall with varying attack size values for the Chirita, kNN classifier, k-means clustering-based, and PCA-based detection schemes are presented in Tables 3 and 4 for MLP and Jester, respectively.

Table 3 Performance of detection algorithms with varying attack size (MLP)
Table 4 Performance of detection algorithms with varying attack size (Jester)

The Chirita algorithm successfully detected shilling attacks with dense attacker profiles but was unsuccessful against attacks with small size and high sparsity (Williams et al. 2007). As shown in Table 3, as the attack size increased, this algorithm was more successful toward attacks, excluding the average attack. Because the RDMA values of the random attack profiles were higher, the most successful precision and recall values were obtained for random attack. By contrast, the average attack profiles had lower RDMA values because of the established methodology and because the Chirita algorithm could not detect these attack profiles. Although the detection performance of the algorithms improved with increasing attack size, they were not successful in detecting shilling profiles. As seen from Table 4, the detection performances of all methods for all attack size values are smaller than the ones for MLP. This phenomenon can be explained the density of Jester data set. As stated by Williams et al. (2007), the Chirita algorithm does not perform well for dense data sets.

As shown in Tables 3 and 4, the precision and recall values usually ranged between 0.8 and 1.0 for the kNN classifier for both data sets. The success of the algorithm increased when the attack size was 15 %. The number of attack profiles in the training and test data was not sufficient for stable classification when the attack size was low. Thus, zero precision and recall values were obtained for an attack size of 1 % for random, bandwagon, and segment attacks for MLP, whereas better results were obtained for other attack types. For Jester, the detection performance of the kNN classifier method is zero for all attack types. Because the training data set was used in all attack models for this method, it does not matter which attack model is used. As long as there are sufficient training data, this method will be successful. Therefore, the attack size parameter plays an important role in the success of this method. Moreover, since similar training data was used for both data sets, similar results were observed.

As indicated in Tables 3 and 4, there was a direct correlation between attack size and the precision value of the k-means clustering-based detection algorithm toward the attacks for both data set. As the attack size increased, the number of attack profiles in the cluster of interest increased, thus improving the precision value. Because the tightest cluster was identified and isolated from the database, many real profiles were omitted from this cluster. Because real profiles were omitted from the user-item matrix, the precision value was lower than the recall value. Since the attack profiles were so similar, they were located in the same cluster. Therefore, the recall value was successful in all attack models. As shown in the tables, the recall values varied between approximately 0.9 and 1.0 for all attacks for both data sets. Differences in attack size can only increase the number of attacks in a cluster and thus did not significantly affect the recall value.

As in the case of the application based on the filler size parameter describe above, the application based on the attack size parameter yielded the best results for the average attack for the PCA-based detection scheme, as shown in Tables 3 and 4. Since fake profiles are generated using mean values in the average attack, the covariance is smaller for this attack. This leads to better results. The increase in the attack size value resulted in an increase in the precision and recall values of nearly all of the attack models. For MLP, the precision and recall values of the average attack were as high as 0.65, an acceptable and successful value. Better results were obtained for Jester due to the higher covariance of the genuine profiles. Shilling profiles have lower covariance. Thus, it becomes easy to distinguish fake profiles from genuine ones. This method was unsuccessful for the other attack models because the covariance values of the profiles were higher due to the generating algorithms.

5.3.3 Effects of the β max parameter

We performed another set of experiments to illustrate how changing the values of β max affects the success of detection schemes. We fixed the filler size, attack size, and σ max at 25, 15, and 2 %, respectively, while β max parameter was varied from 5 to 25 %. The overall averages of the precision and recall values for the four detection algorithms on six attack models are presented in Tables 5 and 6 for MLP and Jester, respectively.

Table 5 Performance of detection algorithms with varying β max (MLP)
Table 6 Performance of detection algorithms with varying β max (Jester)

As shown in Tables 5 and 6, in general, the success of the detection algorithms decreased with increasing β max values. The only exception was the recall values for the k-means clustering-based detection method. The recall values for this algorithm slightly improved with increasing β max values due to increasing similarity among shilling profiles. Although the detection performance of the algorithms decreased with increasing β max values, the outcomes remained close to each other for both data sets. The value of β max had a greater effect on the Chirita algorithm than the other algorithms. The best outcomes for precision and recall were observed for the Chirita and kNN classifier algorithms for a β max value of 5. For the k-means clustering-based method, β max  = 25 provided the best recall values. However, the best precision values differed for the same algorithm with varying β max values. Similar trends in precision and recall were observed for the PCA-based detection scheme. As indicated in Tables 5 and 6, the precision and recall values for varying β max values were closer to each other for the k-means clustering-based scheme compared to the other algorithms.

Increasing β max values significantly affected the ratings distribution of genuine user ratings profiles due to the sparse nature of such profiles. In this case, it might become difficult to differentiate fake profiles from genuine profiles because the filled profiles will become more similar to each other. Creating fake profiles by filling some filler items does not change the general rating distribution of the filled profiles. The detection schemes, in general, then will have difficulties detecting bogus profiles due the increase in β max values. For example, in the Chirita algorithm, increasing β max increases the number of random values inserted into genuine profiles. These noise data increase RDMA values due to larger dispersion. As stated by Chirita et al. (2005), due to the high standard deviation among the rating dispersions in attack profiles, the RDMA attribute value will be higher for attack profiles than for real profiles. Larger RDMA values make it difficult to detect fake profiles using the Chirita algorithm. Hence, the detection rate of the Chirita algorithm diminishes with increasing β max .

Noise data is inserted into genuine profiles for privacy protection. Moreover, shilling profiles are created using noise data. This leads to very similar results for k-means and PCA-based detection algorithms for both data sets, as seen from Tables 5 and 6. Therefore, it is difficult to determine the optimum values of β max for k-means and PCA-based detection algorithms for both data sets.

5.3.4 Effects of the σmax parameter

To illustrate how the detection methods perform with varying values of σ max , we performed another set of experiments. We fixed filler size, attack size, and β max at 25, 15, and 25 %, respectively, while varying the value of σ max from 0.5 to 2. The overall averages of precision and recall values for all detection methods on the six attack models after running the trials 100 times are presented in Tables 7 and 8 for MLP and Jester, respectively.

Table 7 Performance of detection algorithms with varying σ max (MLP)
Table 8 Performance of detection algorithms with varying σ max (Jester)

The results in Tables 7 and 8 show that the performance of the Chirita algorithm with respect to both precision and recall improved with increasing σ max . Smaller σ max values resulted in smaller RDMA values. Smaller RDMA values increase the difficulty of detecting attack profiles using the Chirita algorithm. Therefore, with decreasing σ max values, the RDMA values become smaller; and the performance of the Chirita algorithm consequently decreases.

In contrast to the Chirita algorithm, in general, the performance of the PCA-based detection scheme with respect to both precision and recall increased with decreasing σ max values for both data sets. As σ max increases, the covariance value among profiles also increases; therefore, the PCA-based variable selection-based detection method may not be able to detect these attack profiles. In studies by Mehta (2007), Mehta et al. (2007), and Mehta and Nejdl (2009), the authors state that attack profiles are expected to have lower covariance values than real profiles because the filler items set is generally completed with the item mean or system overall mean when creating attack profiles. Hence, the success of the PCA-based detection algorithm can be improved if smaller σ max values are used.

The kNN classifier and k-means clustering-based methods behaved similarly with varying σ max values. The best precision values were observed when σ max was 0.25 for both detection algorithms for both data sets. In contrast to the precision values, both schemes produced the most promising outcomes with respect to recall when σ max was 2. Although the recall values seemed to decrease with decreasing σ max values for MLP, this change was very stable, especially for k-means clustering-based method. The recall values for Jester are almost the same for both methods. Smaller σ max values decrease the amount of noise data inserted into profiles during data disguising. The reduced amount of noise does not weaken the effects of filler items during fake profile generation and facilitates the successful retrieval of fake profiles by k-means clustering and the kNN classifier.

6 Discussion

An examination of the empirical results presented in the tables above revealed that the kNN classifier method is the most successful method for all attack models. The disguise operation in private environments does not have a significant effect on detection algorithm performance. The kNN classifier detection algorithm calculates a number of generic and model-specific attribute values for each profile, creates a new data table, and performs classifications using this new data table. The kNN classifier detection algorithm divides this attribute table into two groups under the headings of training and test data. Because it creates a model using training data, data masking does not have significant effects for the kNN classifier. The use of a training set generated from perturbed data enables the creation of a new model to detect PPCF attack profiles in the test set.

The performance of the Chirita algorithm against attacks in private environments was not highly successful. Chirita et al. (2005) stated that due to the high standard deviation among the rating dispersions in attack profiles, the RDMA attribute values of attach profiles will be higher than those of real profiles. The attack profiles generated on the PPCF schemes were filled with random numbers. When the σ max value was small, the RDMA value for the attack profiles was small, and the Chirita algorithm could not detect these PPCF attacks successfully. We hypothesized that at higher σ max values, the Chirita algorithm may become more successful. This hypothesis was verified by the empirical outcomes presented in Tables 7 and 8. Larger σ max values significantly enhanced the performance of the Chirita algorithm. Because the item mean was used to fill the profiles in the average attack, the RDMA value was low even when larger σ max values were used. In this case, as shown in Tables 7 and 8, the Chirita algorithm will be unsuccessful for the average attack.

Although the precision value of the k-means algorithm was not very good, the recall value was highly successful. Our k-means algorithm performs clustering by considering the similarities between profiles. A certain profile is created for each attack model in non-private and private environments. Therefore, these profiles are similar to each other and are expected to be dispersed in the same cluster. Moreover, shilling profiles mimic real profiles to effectively manipulate the outcomes. Consequently, in k-means clusters, many real profiles may cluster together with the attack profiles and be omitted form the database upon isolation of the defined attack clusters, reducing the precision.

The PCA-based detection method was successful for the average attack only when σ max was 2 due to smaller covariance values among the profiles because the filler items set was completed with the item mean when creating the average attack profiles. In other attack models, the filler items specified in the profiles are completed with random numbers generated with a certain σ max value. If the σ max values are high, the covariance value among profiles also becomes high, and the PCA algorithm may not be able to detect these attack profiles. Consequently, the success of the PCA-based detection algorithm might be improved using smaller σ max values. It performs better for Jester due to higher covariance of the fake profiles.

We compared the detection methods used for PPCF with those used in CF schemes under the same conditions. In Table 9, the precision and recall values are compared for corresponding algorithms in non-private and private environments. The results of the experiments conducted for CF schemes were compiled from related studies, where MLP was used. The comparison with the results of the study by Burke et al. (2006) of the Chirita algorithm reveals a similar precision value as for our results; however, the CF algorithm is more successful for the recall value. Our results for the kNN classifier algorithm and those of Mobasher et al. (2007) differ but are similar at a higher filler size and attack size. Bhaumik et al. (2011) utilized the k-means algorithm on CF in a different manner than in the present study. They defined generic attribute values and performed the cluster process using these values. A comparison of the results indicates that their results were more successful than ours. The precision and recall values achieved for the PCA method in the study by Mehta and Nejdl (2009) are considerably higher than the values we obtained because we selected higher σ max values to generate attack profiles.

Table 9 Comparison of detection algorithms in non-private and private environments

We finally compared the precision and recall of the four detection algorithms with those of the algorithm proposed by Gunes and Polat (2015a) under the same conditions. The profiles in a perturbed user-item matrix are clustered using hierarchical clustering, and the cluster that probably contain fake profiles is identified as the attacked cluster. The authors also scrutinized the ratings of target items to enhance the performance of their scheme. They slightly improved their method by investigating the ratings of target items. The results of the four detection methods presented in this study and the hierarchical clustering-based method are compared in Table 10.

Table 10 Comparison of detection algorithms for PPCF

As shown in Table 10, the kNN classifier performs better than the hierarchical clustering-based method in terms of precision for the random, average, and love/hate attack models. However, the hierarchical clustering-based scheme provides more promising results than our methods with respect to precision for the bandwagon, segment, and reverse bandwagon attack models. In terms of recall, our k-means clustering-based scheme provides the best outcomes. The hierarchical clustering-based method performs very similar to our k-means clustering-based method for average, bandwagon, segment, and reverse bandwagon attacks. All of our algorithms achieve better outcomes than the hierarchical clustering-based method for the random attack model.

7 Conclusions and future work

Detecting shilling profiles is important in privacy-preserving collaborative filtering methods. We modified four widely used detection algorithms, proposed for detecting shilling profiles in non-private environments, in such a way to determine fake profiles created using six shilling attacks in private environments. We compared the modified methods with their correspondence for non-private environments. Also, we compared them with the hierarchical clustering-based detection method, which is proposed for private environments. We evaluated the schemes in terms of precision and recall by conducting experiments on two real data sets.

Our key findings can be summarized as follows:

  1. 1.

    The most successful detection methods are the kNN classifier and k-means methods. However, the kNN classifier requires a training data set. The k-means method might isolate a great number of real profiles, which will negatively affect system accuracy.

  2. 2.

    The PCA algorithm performs better for the data set including ratings with larger range. It classifies profiles according to the covariance value. When the rating range is larger, the covariance value of genuine profiles becomes larger. Shilling profiles have smaller covariance value. Thus, PCA then can successfully differentiate them.

  3. 3.

    Although the kNN classifier requires a training set for detection, it might be considered the best algorithm of the presented detection algorithms for both non-private and private environments according to the empirical outcomes.

  4. 4.

    The success of the Chirita algorithm is generally low compared to other algorithms, particularly for attacks with a smaller attack size.

  5. 5.

    With increasing filler size values, in general, the detection performance of the detection method for both data sets decreases.

  6. 6.

    The methods, in general, perform better for larger attack size values due to increasing number of attacks.

  7. 7.

    Increasing β max values negatively affect the performance of the algorithms.

  8. 8.

    Smaller σ max values improve the detection performance of all algorithms except the Chirita method due to smaller randomness.

We are planning to develop new detection methods to reduce the disadvantages of these algorithms. We also want to investigate how to further improve the success of the existing detection algorithms. In addition to numeric ratings-based recommendation schemes, there are binary ratings-based prediction algorithms. Therefore, we are planning to develop detection algorithms that can filter out binary ratings-based shilling profiles. Another important future research is to study how shilling attacks affect the top-N recommendation lists. In other words, we are planning to scrutinize how shilling attacks change the position or rank of the targeted items in top-N recommendation lists.