1 Introduction

The popularization of information technologies and the Internet in the last decades has completely and irreversibly changed the way of life of most people on Earth. Nowadays, on the Internet we read the daily news, we communicate with our friends and family, we search for the information we need to get our job done, and we even go shopping! Recommender systems are thus the perfect complement to search engines, and many companies around the world are using them to help their customers to find the products they are looking for, and, at the same time, to increase their sales. In particular, the technique of collaborative filtering (CF) has become very popular because its good results in domains such as e-commerce. A great advantage of this approach is that it is based on user opinions, which leads to high-quality recommendations because the products are evaluated by real people. Such opinions are usually stored in the form of numeric ratings, in a data structure called the rating matrix.

Collaborative filtering techniques are usually classified according to how they process the information in this matrix. Memory-based approaches directly use the ratings to generate recommendations, while model-based techniques (Marlin 2004) use them to train a model which is then used to compute the recommendations. In general, model-based algorithms are more accurate as long as the model fits the actual data (Cacheda et al. 1921), but the training step can be extremely costly. On the other hand, memory-based techniques are simpler, intuitive, and do not require a training step, which makes them suitable for many real applications (Desrosiers and Karypis 2011). A popular memory-based technique is the k-Nearest Neighbors algorithm (k-NN), which recommends to a given user those items that other users with similar tastes have liked in the past (Resnick et al. 1994).

Given real applications can have millions of users and items, recommender systems must be fast and scalable in order to succeed outside the research lab. Therefore, the data structures used to store and access the rating matrix play an important role in the system design. In particular, k-NN algorithms can be efficiently implemented using inverted indexes, as shown by Cöster and Svensson (2002).

However, with the amount of information available in many real applications, additional techniques should be employed in order to reduce the rating matrix size even further. Index compression (Witten et al. 1999) is usually used in Information Retrieval (IR) in order to reduce the index size. Those techniques reduce storage costs and at the same time they speed up system operation because less storage space leads to faster data transfer and better memory cache usage.

In this paper, we study the application of compression techniques to CF, in order to reduce the rating matrix size and the recommendation time. First, we propose an implementation of k-NN algorithms using indexes, by extending Cöster and Svensson’s proposal, where indexes were only used for neighborhood selection, to the recommendation step. The details of our approach are described in Sect. 3. In Sect. 4, we introduce compression techniques and we study how they can be used in CF. We also evaluate the performance of several methods, showing that compression can significantly reduce the matrix size. Then, in Sect. 5, we show how those techniques also have an important impact in the algorithm efficiency, proving that compression does not only reduce the matrix size, but also speeds up the recommendation. Additionally, in Sect. 6 we propose and evaluate a novel id reassignment technique that leads to further improvements. It is based on assigning identifiers in descending frequency order, so commonly rated items (and users) will receive the smallest identifiers, thus increasing matrix compression rates. Finally we present our conclusions and outline the future lines of research.

2 Characterizing the recommendation problem

A recommender system can be characterized by its recommendation function, \(rec : \mathcal{U} \times \mathcal{P} \to \mathcal{I}^{N}, \) where \(\mathcal{U}\) is the set of system users, \(\mathcal{I}\) the set of available items, \(\mathcal{P}\) the set of user profiles and N the number of items to recommend. The user profiles represent the information the system keeps about users.

In a CF algorithm, the user profile is composed of user opinions about items. Such opinions are stored as user ratings, being \(\mathcal{R}\) the set of valid ratings. Usually, \(\mathcal{R}\) is a finite set (for example, integer numbers from 1 to 5) and can be represented as a subset of the natural numbers, so \({\mathcal{R} \subset \mathbb{N}}\). We can also define the rating function, \(r : \mathcal{U} \times \mathcal{I} \to \mathcal{R} \cup \varnothing, \) such that for a given \(u \in \mathcal{U}\) and a given \(i \in \mathcal{I},\, r(u,i)\) represents the rating of user u to item i, being \(r(u,i) = \varnothing\) if the user has not rated the item yet. For simplicity, in the following we will denote it as r ui .

Therefore, in a CF system, \(\mathcal{P} = \mathcal{M}, \) where \(\mathcal{M} \in (\mathcal{R} \cup \varnothing)^{|\mathcal{U}|\times|\mathcal{I}|}\) is the rating matrix, that contains all the opinions available in the system. Thus, given a user \(u \in \mathcal{U}\) and the rating matrix \(\mathcal{M}, \) the recommendation function returns an ordered set of N items not rated by u yet, that is, \(rec(u,\mathcal{M}) = \{i_1, i_2, \ldots, i_N | i \in \mathcal{I} \wedge M_{ui} = \varnothing \}\). By definition, \(\mathcal{M}_{ui} = r_{ui}\).

Of course, not all ratings are equally interesting in order to generate the recommendation. Depending on the user tastes or interests, represented by her profile, some ratings can be more meaningful than others. For example, it is more useful to recommend items liked by a user with similar tastes than items liked by users completely unrelated. A good CF algorithm should infer these relationships from the rating matrix in order to generate really useful recommendations. As introduced before, both model-based and memory-based approaches have been used to resolve this problem. Among memory-based algorithms, the user-based k-NN approach (Resnick et al. 1994) is very popular. Given a user \(a \in \mathcal{U}, \) usually named active user, the algorithm first selects a set of other users with a similar rating pattern. These users comprise the neighborhood. Items the neighbors have liked are then recommended.

The operation of the algorithm is pretty straightforward. First, the similarity between the active user and the remaining users is computed, using a given similarity function \({s: \mathcal{U} \times \mathcal{U} \rightarrow \mathbb{R}}\). In the literature several similarity measures have been used, such as Pearson correlation (Resnick et al. 1994) or cosine vector similarity (Breese et al. 1998). We will denote as s au the similarity between two users \(a \in \mathcal{U}\) and \(u \in \mathcal{U}, \) that is, s(au). The k most similar users are then selected, obtaining the neighborhood \(N(a)=\{ n_{1},\ldots,n_{j} | j\leq k \wedge n\in \mathcal{U} \}\) (satisfying that \(\forall n:\nexists x\in \mathcal{U}-N(a)\diagup s_{ax}>s_{an}\)).

The recommendation list is computed based on neighbors ratings: the items with the highest ratings are recommended. Of course, items already rated by the active user are discarded. If an item has been rated by several neighbors, the ratings are combined. For example, using the average rating weighted by the similarity of each neighbor (Resnick et al. 1994), or more complex approaches, such as z-score normalization (Herlocker et al. 2002).

A similar alternative is the item-based approach (Linden et al. 2003; Sarwar et al. 2001) which looks for items similar to those the active user has rated high. Similarity is computed between items instead of users, using measures such as cosine. In this paper we have focused on the user-based approach, but the methods proposed can be easily adapted to the item-based algorithm.

3 Computing recommendations efficiently

k-NN algorithms can be efficiently implemented using an index structure. To compute recommendations, two indexes are actually needed. First, the user profile index, that maps each user with her profile, that is, the items she has rated together with the ratings. Second, the inverted user profile index, that maps each item with the users that have rated it, together with the given rating. Formally, the user profile index is a function \({f_u : \mathcal{U} \to 2^{\mathcal{I} \times \mathcal{R}}}\) such that \(f_u(u) = \{ (i, r_{ui} ) | i \in \mathcal{I}_u \}, \) being \(u \in \mathcal{U}\) a user and \(\mathcal{I}_u \subseteq \mathcal{I}\) the set of items rated by u, that is, \(\mathcal{I}_u = \{i \in \mathcal{I} | r_{ui} \neq \varnothing \}\). Similarly, the inverted user profile index is a function \({f_i : \mathcal{I} \to 2^{\mathcal{U} \times \mathcal{R}}}\) such that \(f_i(i) = \{ (u, r_{ui} ) | u \in \mathcal{U}_i \}, \) being \(i \in \mathcal{I}\) an item and \(\mathcal{U}_i \subseteq \mathcal{U}\) the set of users who have rated the item i, that is, \(\mathcal{U}_i = \{u \in \mathcal{U} | r_{ui} \neq \varnothing \}\).

Recommendation can be easily computed using both indexes. First, the neighborhood needs to be selected, which requires to compute the similarity between the active user a and the remaining users. As an example, we will use the cosine similarity, defined as in Eq. 1.

$$ s(a,u)=\frac{\sum_{j\in {\mathcal{I}}}{r_{aj}r_{uj}}}{\sqrt{\sum_{k\in {\mathcal{I}}_{a}}r_{ak}^{2}}\sqrt{\sum_{k\in {\mathcal{I}}_{u}}r_{uk}^{2}}} $$
(1)

Algorithm 1 shows how it can be computed using an item-at-a-time approach. For each item in the user profile, its ratings are obtained by querying the inverted user profile index, and the similarity computation is updated with the contribution of such item. The denominator of Eq. 1 is a normalization factor, constant for each user, that can be computed at index build time, and stored together with the index.

Actually, the item-at-a-time technique is the adaptation to CF of the term-at-a-time (TAAT) approach used in IR (Turtle and Flood 1995). Alternatively, a document-at-a-time (DAAT) (Turtle and Flood 1995) approach could be used. For CF, that would imply to compute the similarity user by user, reading the inverted user profile for each item in parallel. In IR, several optimizations of these basic techniques have been proposed (Ding and Suel 2011).

Once the neighborhood has been computed, the items to recommend are selected among the items rated by the neighbors. A weight w is assigned to each item, and the N top-weighted items are recommended. In this case, we use as weight the average of the rating given to the item by all neighbors, weighted by the neighbor similarity previously computed. The actual computation is shown in Eq. 2.

$$ w_{i}=\frac{\sum_{u\in N_{a}}{s_{au}r_{ui}}}{\sum_{u\in N_{a}}{s_{au}}} $$
(2)

This step is also done using an index, in this case the user profile index, following a process similar to the neighborhood computation step. In this case, we follow a user-at-a-time approach, where the weight for each item is computed neighbor by neighbor. The details of the process are shown in Algorithm 2:

For an item-based approach, the process is similar. First, the profile of the active user is retrieved from the index. For each item, similar items are computed following a process similar to Algorithm 1, but where the users who have rated the item are retrieved first (using the inverted user profile index). Then, for each user, the user profile index is used in order to retrieve items with a similar rating. The contribution of each user to the item-item similarity is accumulated and, finally, most similar items are recommended.

4 Compressing the rating matrix

4.1 Index compression: benefits and techniques

Compression techniques have been successfully used in IR for a long time, where they are an essential technique for efficient query processing. Index compression has several benefits (Manning et al. 2008):

  • Reduced storage costs, which is obvious as the index will be smaller after being compressed. In IR, compression ratios of 1:4 are easily achieved (Manning et al. 2008).

  • Better utilization of fast memory storage. A smaller index means that more data can be kept in the different cache levels of modern computers, so commonly used data can be directly retrieved from a fast storage instead of the slower (but larger) magnetic disks.

  • Faster data transfer. As compressed data needs less space than an uncompressed one, the cost of reading it from the disk is smaller.

Compression techniques take advantage of the probability distribution of the elements (often denoted as symbols) to be compressed, in order to reduce the required space. If the symbols that appear frequently are encoded with fewer bits, the total number of bits required for storing the list is expected to decrease. Compression techniques make use of variable-length codes that take advantage of this fact in order to achieve high compression rates.

In IR, the inverted index structure commonly used (Manning et al. 2008) keeps, for each term, a list with the identifiers of those documents where the term appears. To compress it, best results have been obtained when the document identifiers are stored as an initial position followed by a list of d-gaps, that is, the difference between the identifier of a document and the following (Witten et al. 1999). The distribution of d-gaps has been widely studied, and several specific models have been proposed. They can be classified in two classes (Witten et al. 1999): global methods, where a single model is used to compress all the lists, and local methods, where the model is adjusted by some parameter, computed independently for each list (for example, the term frequency), and stored in the index. Global methods can also be parametrized.

Nonparametrized global models are based on the observation that gap distribution depends on term frequency. For frequent terms gaps are small (which is obvious, since they are frequent precisely because they appear in many documents), while for rare terms big gaps are more common. Thus, a variable-length code that assigns a shorter representation to small gap values is a good choice. For example, Elias’ γ and δ codes (Elias 1975) work much better than a fixed length binary code. However, they assume a fixed probability distribution, regardless of the actual data.

A better alternative is to parametrize the model according to some characteristics of the data being indexed. A particularly successful method is the Golomb code (Golomb 1966), that assumes that gaps follow a geometric distribution corresponding to Bernoulli trials with a given probability p of appearance of a term in a document. This probability is computed from the number of terms, documents, and term-document pairs in the indexed data. The code requires a parameter b that is directly computed from this probability. Although this model makes some assumptions that are obviously wrong (for example, independence of term-document pairs), it works pretty well in practice.

Local methods can be used to further improve this model. For example, the local Bernoulli model, that also makes use of a Golomb code (Witten et al. 1999), uses a different probability for each term. The pairs term-document will have a greater probability for frequent terms than for rare terms, which leads to higher compression rates, at the cost of needing to store a model parameter for each term. Another method that also performs very well is the Skewed Golomb model (Witten et al. 1999), that assumes a skewed Bernoulli distribution to obtain small codes for small gaps but without overpenalizing the large gaps. Many other techniques have also been proposed.

4.2 Compression techniques for collaborative filtering

Although compression is a fundamental technique in IR, its benefits for CF have not been deeply studied. Cöster and Svensson (2002) did make use of γ,  δ and Golomb compression techniques for item identifiers, but they have not analyzed the actual benefit of using them, or which one performs better. Moreover, in the experiments they have carried out compression was not used.

In this Section we study the utility of compression techniques for CF, analyzing their actual impact in reducing the rating matrix size. Next, in Sect. 5 we also study the performance benefits.

In our study we have focused on three commonly used datasets, two from the movie recommendation domain, Netflix (Bennett and Lanning 2007) and Movielens 10M, and another containing real data from an online dating service, LibimSeTi (Brozovsky and Petricek 2007). A summary of the three datasets is shown in Table 1. It can be seen that both Movielens 10M and Netflix have many more users than items, while LibimSeTi has a similar number of both.

Table 1 Characteristics of the datasets used

Compression techniques for CF should address the particularities of this new problem, and its differences with traditional IR indexes. It should be studied whether the d-gap technique used in IR can be successfully applied to compress item and user identifiers in CF, or if a different method should be used instead. Additionally, a technique for compressing the ratings (which do not exist in IR indexes) should be chosen. Finally, it should be noted that, as explained in Sect. 3, an index structure for CF actually consists of two indexes: a user profile index and an inverted user profile index.

Let us first consider how to compress the identifiers. Rating distribution for the three datasets studied is plotted in Fig. 1. If we focus on the distribution across items, we can see a great variability. First, 1 % of the items concentrate over 20 % of the ratings. At the other end, around 60 % of items only have 5 % of the ratings. Clearly, items can be divided in a small set of popular items, a set of known-but-not-popular items and finally a large set of rare items (often known as the long tail).

Fig. 1
figure 1

Rating distribution according to the percentage of items (left) and users (right)

A similar behavior can be seen for users, but in this case the differences are smoother. 10 % of the users concentrate almost half of the ratings. Those are the most active users, who use the system frequently. On the other end, some users interact much less with the system, and thus they have very few ratings.

In either case, it seems reasonable to store identifiers in ascending order, using a d-gap technique like the one used in IR, and a compression method that assigns fewer bits to smaller gaps. The rating list for popular items (or active users) will obviously contain small gaps, and thus a big percentage of the identifiers will be stored using very few bits.

In Tables 2 and 3 we can see the average number of bits per identifier for the user profile index and the inverted user profile index, respectively. As expected from the discussion above, the benefit of compression is huge, over 50 % size reduction in most cases, and even around 75 % in some situations. In general, local methods obtain higher compression rates, which is expected as the assumed probability distribution, parametrized for each list, is closer to the reality.

Table 2 Average number of bits per identifier for the user profile index of different datasets
Table 3 Average number of bits per identifier for the inverted user profile index of different datasets
Algorithm 1 Neighborhood computation for a user \(a \in \mathcal{U}\)
Algorithm 2 Recommendation for a user \(a \in \mathcal{U}\)

Regarding the performance of global methods, it may seem surprising the good behavior of γ and δ codes in some cases. For example, in the user profile index for Movielens 10M, they perform very close (or even better) than local methods. This is related with the particular rating distribution of this dataset. In Fig. 2, left chart, we can see how in Movielens 10M dataset most popular items are those with smaller identifiers. Thus, the gaps will tend to be small, making both γ and δ codes a good alternative. This is, however, a rather unusual situation, derived from the fact that Movielens is a research dataset. Most real applications will behave like Netflix (Fig. 2, right chart), where there is no relationship between the item identifier and its popularity. In Section 6 we propose a novel technique that takes advantage of this observation to significantly increase the compression rate on real datasets.

Fig. 2
figure 2

Number of ratings by item id, for Movielens 10M (left) and Netflix (right)

In addition to the identifiers, in CF the index also contains the actual rating values. In general, ratings are real numbers, so they would need to be stored as a floating point value. However, in most cases there is a finite set of possible ratings, so the rating index can be stored instead.

Given the small number of possible ratings in most applications, a simple fixed-length encoding can store a rating in a few bits. For example, only 3 bits are required to store the 5 possible ratings of the Netflix dataset.

However, this assumption of uniform distribution is far from the real distribution of ratings that, as shown in Fig. 3, is different for each domain. In movie recommendation, for example, positive ratings are more probable than negative ratings. A variable length code can take advantage of this and obtain better compression rates. If the rating distribution is known in advance (for example, if we use a two pass indexing), a Huffman code (1952) can be efficiently used. It reduces the required number of bits per rating to 2.15 (Netflix), 2.79 (Movielens 10M) and 3.24 (LibimSeTi). This is, respectively, 28.44, 30.23 and 18.99 % improvement compared to a fixed-length encoding.

Fig. 3
figure 3

Rating frequency for different datasets

Using both identifier and rating compression techniques can yield significant improvements regarding space requirements. For example, an uncompressed index for the Netflix dataset requires 216 MiB for the user profile index, plus 264 MiB for the inverted user profile index. Using a Huffman code for the ratings and a local Golomb for the identifiers reduces the required space to 113 and 95 MiB respectively, 43 % of the original size.

For such a small index, it may seem unclear the actual benefit of compression. However, in many real applications the uncompressed index size can grow up to several gigabytes or even terabytes, and compression rates around 60 % can represent an important saving in storage costs. While we have not obtained as good compression rates as commonly obtained in IR, it should be noted that the datasets we have used are smaller than those used in IR. With a bigger number of items or users, a fixed-length code will require more bits per pointer, but the variable size codes we have used will stay more or less the same, increasing the compression rate.

5 Improving k-NN performance through compression

As explained in Sect. 4.1, in addition to reduced storage requirements, index compression can lead to performance gains. We study those benefits in this Section.

In this case, we have used the Netflix dataset, as it is the biggest dataset we have access to. Still, it is very small compared to real application requirements. In the powerful modern machines with gigabytes of RAM, the entire rating matrix fits in the main memory, and the benefits of compression would be unclear. Thus, for our experiments we have setup a PC some years old, with a Intel Pentium 4 CPU at 3.20 GHz and just 256 MiB of RAM. In order to evaluate how compression techniques improve recommendation efficiency, the relation between the size of the Netflix dataset and the hardware capabilities of our test environment is more useful that if we had used a modern PC. This approach is commonly used for efficiency evaluation in IR (Badue et al. 2007).

To perform the evaluation, we have randomly selected 500 users, computing a recommendation for each one. Recommendations were computed one after another, so they could benefit from the possible presence in the memory cache of data already used for previous recommendations, thus simulating a real environment where recommendations are being computed continually. We have measured the time spent on both user profile index and inverted user profile index for each user, using a Java implementation of the k-NN algorithm described in Sect. 3 Statistical significance of the results has been tested using Analysis of Variance (ANOVA), and a Scheffé method for multiple testing. Significance level was set to 5 %.

In Fig. 4 we plot the user profile index access time for the different compression methods studied, as well as the access time with a fixed-length code. It can be seen that compression methods offer a great performance improvement. Without compression, the average access time per recommendation is 127 ms. With the best performing method (local skewed Golomb), access time is reduced to 68 ms, almost 50 % improvement. Results are statistically significant. Other compression methods show similar improvements: γ and local Golomb, with 70 and 71 ms respectively, do not show statistically significant differences with the skewed Golomb method. δ code is only slightly worse. Among the studied techniques, global Golomb is the worst alternative, but still significantly better than no compression at all, with an improvement of around 25 % on average. Access time is related to the index size (more data in main memory and less data to be transfered), but also to the decoding speed. That is the reason both γ and δ coding outperform global Golomb despite of having worse compression rates (see Table 2).

Fig. 4
figure 4

Boxplot of user profile index access time by coding method

On the other hand, for the inverted user profile index, access times do not show significant differences among techniques, as shown in Fig. 5. This might seem surprising, given the performance improvement obtained in the user profile index, as well as the compression rates obtained in the inverted index (Table 3). To explain this behavior we should analyze the index access pattern and how it could take advantage of the memory cache, which is the main factor of performance improvement due to compression.

Fig. 5
figure 5

Boxplot of inverted user profile index access time (per neighborhood computation) by coding method

In Fig. 6, we plot the number of accesses per user (and item), sorted by access frequency. If we look at the user accesses (figure at the left), we can see that there are a few users (around 100) which are frequently accessed. And then a huge amount of users (around 480,000) which are rarely accessed, if accessed at all. In this kind of access pattern, the usage of a memory cache becomes very important. If the profiles for those commonly accessed users were kept in main memory, the performance gain would be high. Compression techniques can help reducing the size of the user profile, so most of the profiles will fit in main memory, therefore approaching that ideal situation. On the other hand, the item accesses (figure at the right) show a slightly different pattern. A significant amount of items are frequently accessed (2,000 of about 17,000), and, additionally, the average size of an item profile is much bigger than the size of a user profile, since there are many more users than items. In this case, commonly accessed data does not fit in the cache, even compressed, so the performance improvement is negligible.

Fig. 6
figure 6

Number of accesses to the profile of each user and item

To summarize, in the hardware configuration we have chosen for our experiments, the usage of compression significantly improves the usage of the cache in the user profile index, but not in the inverted user profile index. In real applications, cache size and replacement algorithm should be carefully designed in order to keep as many commonly accessed profiles as possible in the cache. Index compression can be a key technique to achieve this.

6 Optimizing the assignment of identifiers in collaborative filtering

The compression techniques studied so far use fewer bits to code small gaps than large gaps. Given that for most applications document identifiers are meaningless, the technique of document id reassignment (Blandford and Blelloch 2002) attempts to reassign the document identifiers in order to reduce the gaps and, thus, the index size. In general, the identifier reassignment problem can be characterized as a minimization problem, where we have to find the optimal arrangement of document identifiers in order to minimize the index size. It has been proved this problem is NP-complete (Blanco and Barreiro 2005), but in practice different techniques to approximate the solution have been proposed. Most of them assume that if similar documents (documents that share many terms) are assigned close identifiers, d-gaps will be smaller. Proposed techniques can be classified in (Ding et al. 1772):

  • Top-Down. They begin with the collection of documents and split it according to document similarity. Graph (Blandford and Blelloch 2002) and cluster (Silvestri et al. 2004) techniques have been proposed.

  • Bottom-up. They consider each document separately, and group together similar documents. Both clustering (Silvestri et al. 2004) and graph techniques based on the Traveling Salesman Problem (Blanco and Barreiro 2006; Ding et al. 1772) have been proposed, too.

  • Sorting. They just sort the collection by document URL (Silvestri 1763).

In this Section we analyze how CF can take advantage of identifier reassignment techniques in order to reduce the rating matrix size. However, instead of complex techniques like most methods just presented, we use an efficient approach that can be effectively used in applications with millions of users and items.

Our idea is based on a really intuitive and obvious observation. Whatever kind of item you attempt to recommend in your system (books, movies, web pages, persons, etc), there will always be some particular items that are very popular, and many items few people know. The reason for that is beyond the scope of the recommender system itself. An item can become very popular because of its quality or usefulness, an aggressive advertising campaign, a good evaluation by a famous critic, etc. Similarly, many items (even good ones) remain mostly unknown due to a variety of reasons. Thus, a recommender system usually contains a few very popular items, rated by a lot of users, and many rare items very few people have rated.

In fact, if we take a look again to Fig. 1, we can see that both users and items follow this kind of pattern, as already discussed in Sect. 4.2 There, we observed that, for example, if an item is very popular, its profile will have a lot of users and thus the gaps between them will be small. We showed how compression techniques can take advantage of this.

In this Section, however, we look at the same fact from a different point of view. If an item is very popular, it will not only have a big profile: it will also appear in many user profiles. That observation leads to the key point of our approach: if popular items are assigned the smallest identifiers, user profiles will likely contain small identifiers in the first positions, and thus the gaps between them will be small. This is how our approach works: we reassign identifiers according to item (or user) frequency, in a descending order. Most popular items receive the smallest identifiers.

Figure 7 shows the item id frequency before and after identifier reassignment, on Netflix dataset. In the original distribution, there is no relation between item identifiers and frequency. Thus, it is highly probable that a user rates both an item with a small identifier and an item with a big identifier, leading to big gaps. On the other hand, after reassignment most ratings will correspond to small item identifiers, and the gaps are expected to be small.

Fig. 7
figure 7

Number of ratings per item, before (left) and after id reassignment (right), on Netflix dataset

To confirm this, in Fig. 8 we plot the probability distribution of gap size, before and after reassignment. It can be seen that after reassignment the smallest gaps are more probable, which will reduce the average number of bits required. In particular, if we use a δ code, the average number of bits per gap is reduced from 9.10 to 5.35 (41 % improvement), and with γ code, from 9.57 to 5.25 (45 % improvement). After reassignment, compression rates of 1:3 are achieved even using a global coding, significantly outperforming local methods without reassignment.

Fig. 8
figure 8

Probability mass function of item d-gap size, originally and after id reassignment. Logarithmic scale on both axis

Moreover, we can observe that after reassignment, the distribution of identifier frequencies follows a Zipf’s law, as shown in Fig. 8. It turns out that integer values distributed according to a Zipf’s law with an exponent smaller than 2 (in this case, the exponent α is 1.44) can be efficiently stored using a \(\zeta\) code (Boldi and Vigna 2005). So, if we use this coding technique, we can obtain further improvements. In particular, the average number of bits per gap is reduced to 4.97, far beyond the 15 bits required without compression.

Of course, if the identifiers are already sorted by frequency (see, for example, the distribution of item identifiers on Movielens 10M in Fig. 2, that follows a similar pattern), the gain is smaller. In real applications, however, identifiers are not expected to be already sorted, so a reassignment step before indexing can significantly reduce the storage costs.

Furthermore, the higher compression rates also improve the system performance, reducing the average user profile index access time from 6.78 ms (skewed Golomb without reassignment) to 6.05, a small but important improvement. Results, that are shown in Fig. 9, are statistically significant. With the reassignment technique just proposed, a global \(\zeta\) code outperforms local codes in both compression rates and access time.

Fig. 9
figure 9

Boxplot of user profile index access time by coding method

Finally, our technique is faster than most of the alternatives proposed in IR, with a computational complexity of O(n log(n)) in time and O(n) in space. For large-scale applications, the benefit of id reassignment regarding reduced recommendation time and/or neighborhood computation is huge compared to the time required for the reassignment. Moreover, user and especially item frequencies are not expected to dramatically change over short periods of time, so in applications that require to rebuild the index frequently, the reassignment does not need to be done every time.

7 Conclusions and future work

In this paper we have shown how compression techniques can be effectively used in order to increase the performance of CF algorithms. First, we have shown how to efficiently implement k-NN algorithms using an index structure, extending previous research in the field.

Then, we have analyzed the benefits of using compression techniques, proposing a coding scheme for item and user identifiers, as well as ratings. For both user and item profiles storage, compression methods studied showed an important benefit in terms of the required space for storage. Best results were generally obtained with local codes, although simple global codes such as γ and δ yield very good compression rates, up to near 1:4. We have also shown how compression can reduce recommendation times up to 50 %. We have discussed the relation between achieved speed-up and the cache size compared to index size and commonly accessed profiles.

Finally, we have proposed a novel id reassignment technique that can improve even further both the recommendation time and storage requirements. Our technique, based on assigning the smallest identifiers to frequently rated items and more active users, is very effective and can be easily implemented using a two pass indexing. Moreover, when used together with a \(\zeta\) encoding for identifiers, it achieves very high compression rates.

We think that real applications of CF can be strongly benefited from the use of these kind of techniques.

In the future, we plan to study the benefits of additional compression methods such as an interpolative coding, as well as optimizations to the recommendation process such as the usage of index pruning techniques. We also plan to study the performance improvements on the item-based approach. Finally, it will be also interesting to further study the relationship between cache size and recommendation performance, in order to derive a general method for computing the optimal cache size given an index.