1 Introduction

Internet has changed the way people conduct business and created a new economy [36]. Instead of buying products on conventional physical stores, customers are used to search and purchase commercial items via online shopping websites, such as Amazon and eBay, for convenience. However, if people do not know the specific information of a commercial item, e.g., the name or the seller, they usually cannot find it with traditional text-based search methods [22]. For example, if a girl is reading a fashion magazine and likes one model’s dressing, she probably desires to know its related information, such as the price, the brand, and so forth. The girl can try to make some searches on these websites by typing in a few keywords, but it is often the case that she just cannot get the right results due to the ambiguity of the keywords she feels intuitively to use, such as “red silk”, “double-breasted”, and “long sleeves”. Therefore, image-based search is arguably a more natural and easy way to use in terms of human behaviors [6, 18, 34, 37, 40].

Further, once a commercial item was found, people would usually want to take a look at other similar ones and make comparisons to ensure which they really want to purchase when making buying decisions. Therefore, the development of effective recommending technologies for helping users to find out the desired commercial items becomes crucial. There have been a variety of recommending methods employed in current online shopping websites, such as “the most popular products”, “products from the same seller”, “customers who viewed this also viewed”, and “customers who bought this also bought”. Several examples are shown in Fig. 1. It can be found that most of these recommending lists are produced according to external attributes of commercial items (e.g., the seller and the price) or the associated social behavior of website users (e.g., the other users’ buying history). However, without explicit taking into account the personal preference of an individual user, many of the user’s common buying scenarios cannot be fulfilled.

Fig. 1
figure 1

Examples of various recommending methods employed on online shopping websites

In particular, when a user chooses a commercial item, it is arguable that the visual appearance is one of the most influential factors affecting people’s buying decisions [31]. According to the consumer psychology [7], we can observe that people’s preferences on commercial items highly depend on their visual styles, and sometimes are simply in favor of their partial visual appearances. That is, a small change in the product design (e.g., the color style of a T-shirt) could largely influences consumers’ product choices. For example, if a user wants to buy a handbag, the user might consider the three ones simultaneously as shown in Fig. 2(a), because they share the same visual style although the colors are different. Figure 2(b) shows another example. The three watches would be probably compared by the same user because they all are characterized by the similar partial appearance of a two-clock watch face. Content-based object retrieval [4, 16, 33, 34, 45] is a promising technique to enable the visual appearance based recommendation of commercial items. However, most of the previous work only treated each single object (e.g., a watch or a handbag) as a whole but ignored the fact that different parts of the object do contribute unequally to the user’s preference in selecting the interested commercial items.

Fig. 2
figure 2

Commercial item groups of similar visual appearances

In responding to the above challenges, in this work, we first conducted a user study to verify the hypothesis that different parts of the commercial item do contribute unequally to a user’s preference when making buying selections. According to the hypothesis, we further proposed a novel representation for commercial item images, named Visual Part-based Object Representation (VPOR). Every commercial item image will be decomposed into a set of disjoint partitions, with each of them represents a meaningful semantic part. For example, a watch image can be constituted by two disjoint parts, i.e., the watch face and the watch band. With the VPOR, users can assign non-uniform preferences on the different parts of a chosen commercial item to obtain personalized recommendation results. The proposed framework is shown in Fig 3.

Fig. 3
figure 3

The framework of the proposed UbiShop

Our main contributions are summarized as follows.

  • We are pioneers to conduct a quantitative study to verify the hypothesis that different parts of a commercial item do contribute unequally to the users’ preferences when making buying selections. The user study also verified that the preference of users on commercial items highly depends on partial visual appearances.

  • Taking the above hypothesis into consideration, we proposed a novel representation for commercial item images, abbreviated as VPOR, for reflecting the unequal significance of different meaningful parts of commercial items.

  • We proposed a VPOR-generation scheme for large-scale image databases with minimal user intervention. That is, all item images are grouped into clusters using the extracted shape features such that only a small number of images (e.g., the cluster centers) has to be manually labeled, and the VPORs of the other images can be automatically transferred from the labeled images.

  • We presented a mechanism of part-based feature propagation for discovering auxiliary features to suppress the visual dissimilarity from the user’s less preferred parts for improving the relevance of recommended results.

The rest of this paper is organized as follows. In Section 2, we introduce the related work of commercial item search and recommendation. Section 3 describes the proposed VPOR thoroughly. The VPOR based commercial item recommendation is then depicted in Section 4. Section 5 shows the conducted user studies and the experimental results. Finally, we conclude this paper and give direction of the future work in Section 6.

2 Related work

To help users quickly find the desired items among hundreds and thousands of sold items online, all online shopping websites, such as Amazon, provide the basic keyword-based search services. However, it is difficult to find a specific commercial item if users do not know its information in text (e.g., the name or the seller), especially for those appearance-oriented commercial items, such as handbags and clothes. To clearly describe such items, image-based search by example [5, 6, 17, 18, 34, 37] is adopted to help people obtain the information they want. Google developed a visual search engine in which users are allowed to search information for the images and photos they are interested in with the given input query. The Taotaosou Footnote 1 also provided a visual product search engine whereby users can obtain the detailed information for target commercial items using photo-based search. In the literature, the iLike [8] proposed a vertical search approach for commercial items by integrating visual features and textual features of the target item image. The experimental results showed that the proposed iLike can improve the product search performance for apparels and accessories. Wang et al. [42] further presented a clothes search framework by leveraging low-level features, e.g., colors, and high-level features, e.g., item attributes of clothes. Berg et al. proposed an approach that can automatically learn common attributes (e.g., color, texture, or shape) for objects with specific types based on their visual appearances [5]. After discovering those visual attributes of specific objects, they build the relationship between a variety of ambiguous keywords on the web and those learnt attributes using the techniques of natural language processing. Kuo et al. presented a framework for semantic feature discovery [23]. The proposed framework is able to propagate and select important features according to both visual and textual graphs. Authors of [6] developed a mobile sensing framework for robust real-scene object recognition and localization. Observing the fact that visual objects in mobile photos are often captured in various viewpoints, positions, scales, and background clutters, they extended the conventional techniques of structured output learning with the proposed grid based representation to formulate an output structure in order that the proposed MOSRO is able to locate the visual objects precisely but also achieve real-time performances. Girshick et al. [17] proposed a scalable object detection algorithm that achieves a 30 % relative improvement over the best previous results on PASCAL VOC 2012. The technique of deep learning and the combination of previous classification methods are adopted to achieve the significant results.

Instead of considering the global visual appearance of an object as like in general image retrieval, people often buy a commercial item simply because they like the appearance of specific parts of the products [7, 31]. That is, users tend to pay more attention on the visual styles of certain product parts. If two items share a similar visual style, users who like one will probably like the other as well. Since the perceived visual style of an object would be dominated more by its visually attractive parts, detecting the salient regions [1, 19, 28] of a commercial item may help to find the similar items that users may also like. Moreover, Haojie et al. [27] presented iSearch that allows users to search commercial items based on both local and global visual features of a target item. Wu et al. [44] proposed a product image search framework for interactive commercial item search. Users are allowed to search desired commercial items by selecting a salient region of the image. Tang et al. [41] proposed an image search approach using an intention based weighting scheme on visual features of the query images.

However, in the task of commercial item recommendation, a user’s preferred parts on commercial items are subjective and inconsistent between different people [11, 31], such that the preferred parts of users are not always equivalent to the detected salient regions. Motivated by this observation, in this work, we thus proposed the use of VPOR to reflect the fact of unequal significance of different semantic parts of commercial items. Users can choose their preferred parts as the personalized preferred regions on the target commercial item to obtain a list of recommended items, which all share a similar visual appearance or style for the preferred parts.

Recommender systems are developed for years with technologies such as data mining [2, 24, 36], collaborative filtering [20, 29, 47], and user behaviors [21, 25, 38]. The technique of association rules is employed on famous online shopping websites (e.g., Amazon and eBay) [2]. By finding the similar customers based on their shopping behaviors, a customer can be recommended with a list of commercial items that he/she would like to buy. Kim et al. integrated K-means clustering and generic algorithms to facilitate the exploration on the online shopping market [24]. All customers in the shopping market are segmented into different groups based on their profile data. Then shoppers can recommend customers with different strategies according to their groups. Recently, as the popularity of social media network services (e.g., Facebook and Twitter), user behaviors on the Internet are dramatically changed [29]. The relationship and interaction between users are trusted to be an important factor for improving the recommendation results [38]. Based on the insight of previous work, we will integrate these social-network-clue based methods with the proposed image based method to formulate a more powerful recommendation method in the future.

3 Visual part-based object representation

According to the hypothesis that different meaningful parts of a commercial item do contribute unequally to the users’ preferences when making buying selections, it is crucial to recommend commercial items based on the appearance of preferred products parts. In Section 3.1, we show how to define such meaningful parts for commercial item images, named VPOR definition. Then we introduce the generating procedure of VPORs for the database images, named VPOR formulation, in Section 3.2.

3.1 VPOR definition

Given an item image I, the VPOR of I is composed of a set of M disjoint parts I 1, I 2,..., I M, with each of them represents a semantic portion of the item:

$$ I \equiv \{I^{1}, I^{2}, ..., I^{M}\}, I = \bigcup I^{i}, \forall i \neq j: I^{i} \cap I^{j} = \emptyset $$
(1)

For instance, as shown in Fig. 4, the VPOR of a helmet image is composed of three disjoint parts, the top, the shield, and the visor. In addition, for certain categories, the number of parts M is fixed of that each part I i has the same semantic meaning among different item images. For example, as shown in Fig. 4, all T-shirt images are defined to be three parts and all watch images are constituted by two parts.

Fig. 4
figure 4

Selected VPORs for commercial items in five categories

There are two fundamental principles in giving a VPOR definition. First, since the mechanical components of commercial items usually have their distinct appearances due to their different formed materials, we can thus consider each mechanical component as a semantic part in a VPOR. For instance, a watch is composed of two components, the watch face and the watch band, and each of them can be regarded as a meaningful VPOR part. Second, certain parts of some commercial items tend to attract people’s attention because of the personal aesthetic and iconography [7, 31]. Therefore, these peculiar regions, usually within mechanical parts, should also be regarded as a semantic part in the watch’s VPOR. For instance, the bag center is isolated from the rest of a bag face as another meaningful part. Based on the above two principles, the defined parts in a VPOR for commercial items in five categories are illustrated in Fig. 4. With the determined VPORs, we can thus recommend users more appropriate commercial items according to their personal preferences by simply controlling the relative weighting of each part.

Given a commercial item for a new category without its VPOR parts defined beforehand, a general part-based model [12] is adopted to segment it into non-overlapping portions initially. Then a user study is employed to collect the level of interests of the users on each portion by using a user interface which is similar to the one for the step of “assign personal preferences” (cf. Fig. 7) in Section 4.4. Finally, we apply the clustering algorithm [13] for grouping the image portions to form semantic groups and formulate the resulting VPOR parts. Note that the k-value (number of clusters) can be automatically chosen by the adopted clustering algorithm [13], such that the number of VPOR parts is dynamically determined for the new commercial item.

3.2 VPOR formulation

Although it is useful to use VPORs to represent commercial item images for personalized commercial item recommendation, it is still a hard task to manually label the VPORs for all commercial item images. Therefore, it becomes crucial to generate the VPORs for all commercial item images in a large-scale database automatically. In the literature, a variety of methods for automatic image segmentation in large-scale databases have been proposed [3, 9, 12, 48], and they can be classified into two main categories: inter-object segmentation and intra-object segmentation. The inter-object segmentation methods usually consider a complete daily-life object (e.g., a watch or a car) as the smallest unit that cannot be further partitioned. The VPOR parts are thus an undefined concept in inter-object segmentation methods. On the contrary, intra-object segmentation methods partition an object into finer portions but these portions usually correspond to a perceptually consistent unit rather than a semantically consistent one, because this category of methods are mostly proposed for general purpose [12, 14]. According to the above observations, we find that it is difficult to automatically generate the VPOR for a commercial item without any manual intervention.

Therefore, in this work, we proposed a VPOR-generation scheme for large-scale databases with minimal users intervention, as shown in Fig. 5. Given two item images in the same category, we observed that if they have similar shapes, they will have similar VPORs [31]. If we can label the VPOR of a selected item image, the VPOR of other images, which have similar shapes as the selected one, can be adapted from the VPOR of the selected one. The VPOR formulation is detailed as follows.

Fig. 5
figure 5

The VPOR formulation of UbiShop

The first step is to extract silhouettes for all database images. The edge detection [10] is employed on item images to obtain the edge segments at first. Then, the techniques of image dilation [43] is adopted for completing the broken edge. Finally, the hole filling algorithm [39] is applied to generate the resultant silhouettes. In addition, the second step is to extract the shape features [15] based on the extracted silhouettes from the previous step. We use the tool of Poisson Solvers to extract the Poisson features from the silhouettes and then formulate the feature representation as 30-dimensional moments [15]. The third step is to cluster the database images for each category based on the extracted shape features. The technique of affinity propagation [13] is employed for shape clustering because it does not require to specify the number of clusters in advance and can return the cluster centroid, which can be used further in the next step, directly. We tried three different distances for computing the pairwise similarity for the shape feature, i.e., the Euclidean distance, the Mahalanobis distance, and the Chi-square distance. According to our experimental results, the Chi-square distance tend to have over-clustering issues. The Mahalanobis distance has more dissimilar shapes in a cluster than the Euclidean distance. Thus, we finally chose Euclidean distance as our pairwise similarity measure.

After the shape clustering, the item images in a cluster are similar to each other in terms of the silhouettes. Thus we can label the VPOR manually for the centroids of item images for each cluster, and the VPORs of other images in the same clusters can be automatically generated by the technique of shape matching. Sample VPORs for the cluster centroids are shown in Fig. 6. Specifically, we propose a procedure, named VPOR transfer, to automatically label the VPORs for other item images in a cluster by transferring the VPOR from the cluster centroid. Inspired by [4], the VPOR transfer is described as follows. Given an item image I i and its centroid item image I c , we first normalize their silhouettes, S i and S c , to a normalized scale, S N i and S N c . Next, we rotate S N c to find a best match between the two normalized silhouettes and take the rotation angle as a matched angle a n g. The VPOR of I c will then be rotated by the ang, rescaled to the same scale of I i , and outputed as the VPOR of the given item image I i .

Fig. 6
figure 6

Sample VPORs for the cluster centroids

4 VPOR based commercial item recommendation

In this section, we will introduce how to adopt the proposed novel VPOR representation for the recommendation of visually similar commercial items. In Section 4.1, we briefly describe the procedure of low-level feature extraction for a given query item from a user. Then, in Section 4.2, we describe the proposed part-based feature propagation for discovering auxiliary features, which can make two items that share a similar part but are different in the remaining parts become more similar for improving the relevance of recommended results. Next, in Section 4.3, we introduce the ranking scheme for personalized commercial item recommendation. Finally, we demonstrate the user interface for users to assign their personal preferences in Section 4.4.

4.1 Part-based feature extraction

When receiving a preferred query item from a user, we extract low-level visual features for the given item image at first. We adopt HSV color histogram [30] and Bag-of-Visual-Word (BoVW) histogram [32] to extract visual features since we observed that both the color and local structures are important. In addition, there are M parts for a query item image I i , we will extract regional histogram \({X_{i}^{k}}\), constituted by both a color histogram \(hist^{k}_{color}\) and a BoVW histogram \(hist^{k}_{structure}\), for each k th part M k :

$$ {X^{k}_{i}} \equiv (hist^{k}_{color}, hist^{k}_{structure}) $$
(2)

4.2 Part-based feature propagation

Different from finding a specific item for a given query, the objective of recommendation is to retrieve items which are similar to the original item, instead of retrieving the same item. For instance, if a user likes a watch because of its two-clock watch face, the user might like another watch which has a similar two-clock watch face. However, if they do not share a similar appearance of watch band, the resulting score of their global visual similarity is probably not too high to make the candidate watch in a high rank.

A naïve way is to increase the weighting on the part of watch face so that the resulting score would depend more on the part of watch face. However, the simple method would cause a shortage that the weightings of the different parts are dominated more from the preferences of users. For example, if a user do not assign sufficient importance on the watch face, the resulting score may be still low and the resulting recommended items will not include the watches that share a similar watch face but different watch bands as the query item. In addition to the weighting scheme of the visual part-based feature in commercial items, in this work, we propose a mechanism of part-based feature propagation for discovering auxiliary features, named augmented features, making two items which share a similar part but are different in the rest parts become more similar for improving the relevance of recommended results. That is, as the above example, the proposed feature propagation will propagate the feature of watch face to the watch band in order to reformulate the visual property for the part of watch band, thus the resulting visual similarity of these two watches would be increased.

Here we introduce how to formulate the auxiliary features. Assume there are N images denoted as I 1, I 2,..., I N in a category C. Each image I i is composed of M parts denoted as \({I_{i}^{1}}, {I_{i}^{2}}, ..., {I_{i}^{M}}\). The matrix \(X_{i} \in \mathbb {R}^{M \times D}\) represents the features of the image I i , and each row, which denotes \({X_{i}^{k}}\) as we defined in Equation (2), represents the 1×D feature vector of \({I_{i}^{k}}\). The feature propagation is conducted by the propagation matrix \(P \in \mathbb {R}^{M \times M}\), and each row denoted as P l holds the contributions from other parts to l th part. Then the augmented feature X i, a u g is defined as

$$ X_{i,aug} \equiv P X_{i} $$
(3)

Given the initial propagation matrix P 0 (i.e., P 0(i, i) = 1, a diagonal matrix), we want to find a better propagation matrix P by the propagation operation as

$$ f_{P} = \min_{P}\alpha \frac{{\left\Vert P X_{i} \right\Vert}_{F}^{2}}{{\left\Vert P_{0}X_{i} \right\Vert}_{F}^{2}} + (1-\alpha) \frac{{\left\Vert P-P_{0} \right\Vert}_{F}^{2}}{{\left\Vert P_{0} \right\Vert}_{F}^{2}} $$
(4)

The objective of first term is for preventing from propagating too many features from other parts to the target k th part. The second term is to maintain the similarity to the original propagation matrix P 0. We keep the parameter α for the importance between the first term and second term. Previous work [26] has a proof that the equation (4) is a strictly convex, unconstrained quadratic problem. By adopting the analytic solver, the final propagation matrix P can be derived as

$$ P = \alpha_{2} P_{0}(\alpha_{1} X_{i} (X_{i})^{T} + \alpha_{2} I_{N})^{-1} $$
(5)

where \( \alpha _{1} = \frac {\alpha }{ {\left \Vert P_{0} X_{i} \right \Vert }_{F}^{2}}\) and \( \alpha _{2} = \frac {1-\alpha }{{\left \Vert P_{0} \right \Vert }_{F}^{2}} \). Finally, after we solve the propagation matrix P, the augmented feature X i, a u g can thus be calculated by Equation (3).

4.3 VPOR based recommendation

In this section, we introduce how to apply the proposed VPOR (cf. Section 3) and the augmented feature (cf. Section 4.2) for recommending a list of similar commercial items to users according to their personal preferences. Given the target commercial item image I T and the user’s assigned preference W = {W 1, W 2,..., W M} for each part, respectively, according to VPOR, we calculate the two visual similarity score, S b a s e and S a u g , between I T and database images I 1, I 2,..., I N as follows

$$ S_{base} = \sum\limits_{k} W^{k} Sim({X^{k}_{T}}, {X^{k}_{i}}) $$
(6)
$$ S_{aug} = \sum\limits_{k} W^{k} Sim(X^{k}_{T,aug}, X^{k}_{i,aug}) $$
(7)

where \({X^{k}_{i}}\) and \(X^{k}_{i,aug}\) represent the original feature and augmented feature of k th part in image I i , respectively. Since we could recommend only the limited number of items, the rank of each score is regarded as important for fusion. Therefore, we further apply the idea of Ordered Weighted Averaging (OWA) strategy [46] to generate the final fusion score S, which becomes higher in proportion to the higher score of S b a s e and S a u g based on their rank, as follows.

$$ S = v^{r_{base}} S_{base} + v^{r_{aug}} S_{aug} $$
(8)

where r b a s e and r a u g are the rank of S b a s e and S a u g , respectively, and \(v^{r_{base}}\) and \(v^{r_{aug}}\) are the weightings of S b a s e and S a u g , respectively, according to the following formula

$$ v^{r} = (\frac{r}{N} )^{\alpha} - (\frac{r-1}{N})^{\alpha} $$
(9)

where N is the number of scores, v r is the weighting of the r th ordered score, and α is an adjustable parameter.

4.4 User interface

When users find a preferred commercial item on the online shopping website, our UbiShop provides a user interface that can recommend users a list of relevant commercial items according to their personal preferences. As shown in Fig. 7, there are yes/no options for each VPOR parts on the right alongside the target item. Users can click the yes/no options to indicate whether they prefer the associated item parts or not. For example, if users prefer the watch face of the target watch, they can click the watch face part to formulate their personal preferences that represent they would like the watches which have similar watch faces to the target one. Then our UbiShop can recommend them a list of watches that share the same style in the watch face part. Users can thus browse the recommended commercial items to make comparisons and ensure which ones they really want to buy. Moreover, if users are not satisfied with the recommended results or would like to see more other related items, they can repeat the above procedure to make another query with another preference assignment to obtain a new recommended list of relevant commercial items until they find the items they want.

Fig. 7
figure 7

The user interface for personalized recommendation of UbiShop

5 Experimental results

In this section, we evaluate our proposed recommendation algorithms and give analysis to all experimental results. We first introduce the detailed information of our dataset in Section 5.1. Then in Section 5.2, we evaluate our proposed VPOR based commercial item recommendation, which is for recommending relevant commercial items by the recommendation query of users, in the user study. The results show that not only the concept of VPOR is helpful for users but also our approach can achieve better performance in comparison with the text-based approaches and the visual-based approaches without VPOR. Finally, in Section 5.3, we describe the influential factors during the process of VPOR formulation and explain how we manipulate these factors in experiments.

5.1 Dataset

We collected totally 23,854 commercial item images in five categories from Amazon, including 4,612 images of helmet, 1,099 images of sports bottle, 494 images of T-shirt, 7,543 images of watch, and 10,106 images of handbag. Each of the downloaded images corresponds to an actual item sold in Amazon. We chose these five categories of commercial items as our datasets for two main reasons. First, these chosen categories are popular on well-known online shopping websites, such as Amazon and eBay. In some extent, the number of our downloaded images reflects the popularity of the corresponding category. For example, the category of handbag is often the most popular, especially to the female shoppers. Second, when a user makes buying selections, the appearances of commercial items in the chosen categories could be more influential than the item’s functionality. For example, people will consider if a laptop computer is worthy to buy because of its computing performance but tend to care more about the visual style when buying clothes and handbags. Sample dataset images are shown in Fig. 8.

Fig. 8
figure 8

Sample commercial item images of the chosen five categories in our dataset, all collected from Amazon

5.2 User study

5.2.1 Goal

In the experiment, we would like to verify the feasibility of the proposed VPOR based commercial item recommendation approach. Based on our techniques, we hypothesized the following:

H1: It is sufficient and convenient for users to find the interested commercial item they want to buy using the existed techniques (e.g., keyword-based search) on online shopping websites.

H2: Different parts of the commercial item do contribute unequally to the user’s preference in selecting the interested commercial items.

5.2.2 Participants

Since the feasibility of the proposed approach is subjective to users, in this experiment, we invited 90 users, where 49 of them are males and 41 of them are females, to attend our user study. They are in 20 to 35 years old, where 29 of them are students and the rest are office workers.

5.2.3 Task

The participants were asked to answer a questionnaire and the questionnaire contains two parts, part A and part B. The questions in part A are general questions about the purchase behavior of users on online shopping websites, as shown in Table 1 and Fig. 9. The design of part A is to verify the hypotheses aforementioned in Section 5.2.1. Moreover, we would like to compare our recommended list of visually similar commercial items with the other recommended lists adopted on online shopping websites. In part B, there are three to five queries for each category. As shown in Fig. 10, each query contains a target commercial item image, its VPOR image, and the corresponding recommending lists. Each recommending list has at most 15 commercial item images. Users can label “O” or “X” for each image according to the partial visual similarity between the target commercial item and items in recommended list. In this experiment, the score of “O” is regarded as 1 (similar) and the score of “X” is regarded as 0 (dissimilar).

Table 1 Part A of our questionnaire
Fig. 9
figure 9

The commercial item images and their corresponding VPORs given in the part A of our questionnaire

Fig. 10
figure 10

The recommending lists given in the part B of our questionnaire

5.2.4 Method

The recommending lists are generated from three different approaches. The first list, named VPOR recommendation, is generated by our proposed algorithms of VPOR based commercial item recommendation. For each recommended list, we consider the user’s own answer in QA5 as the personalized preference, which will be regarded as the preference assignment for each query, respectively. The second list, named specific retrieval, is generated by the visual-based image retrieval method [26]. The third list, named association rule, is the recommending list of “customers who viewed this also viewed”, which are generated by text-based association rules [35], from the famous online shopping website, Amazon.

5.2.5 Results

The experimental results of part A are shown in Figs. 11 and 12. As shown in Fig. 11(a) and Fig. 11(b), we can find that most of the participants used to search commercial items they want by traditional keyword-based search method. However, by Fig. 11(c), we find out that they also want a more convenient method, such as the proposed UbiShop, if existed. That is, although H1 is supported for most users, they also desire a more convenient method to help them find the interested commercial items. Moreover, as shown in Fig. 11(d), we can find that people usually take partial visual appearances of commercial items as the important consideration when they make buying decisions. As shown in Fig. 12, the opinions from the participants support the H2, which is also the main assumption of the proposed UbiShop framework.

Fig. 11
figure 11

The statistics of the part A

Fig. 12
figure 12

The statistics of QA5 in the part A

The experimental results of part B are shown in Fig. 13. The results show that the visual-based methods, such as the proposed VPOR based recommendation, can achieve better performance in comparison with text-based methods, such as the association rule for Amazon’s recommending list. However, by Fig. 13(c), we can find that the performance of visual-based methods perform worse than text-based methods beyond top-10. The main reason is that the image dataset in T-shirt is relatively small, with only 494 images of T-shirt. The quantity of visually similar commercial items of a query is not enough. Moreover, we can find that the performance of visual-based methods, especially our VPOR based recommendation, usually achieve better in the top-5 than out-of top-5, yet the performance of text-based methods are almost equal in top-K for any K. This fact implies that visual-based methods can rank more precise than text-based methods in corresponding to recommend the visually similar lists of commercial items. By Fig. 13(e), the performance of the association rule is better than the specific retrieval, but it is still worse than our VPOR based recommendation. With the concept of VPOR involved, the proposed VPOR based recommendation can achieve from 5.67 % to 21.43 % gain in the performance than the specific retrieval. Referring to Figs. 12 and 13, we can find that if the preferences of users in a certain commercial item is non-uniform over the different parts, our VPOR based recommendation can achieve better performance in comparison with the specific retrieval. For example, the performance gain in helmet (11.16 %) is better than in sports bottle (5.67 %). There is also an important factor, the performance of VPOR image, which can strongly influence the performance of our VPOR based recommendation. We will take several examples for further discussion in the next section.

Fig. 13
figure 13

The experimental results of the part B

5.3 Discussion of VPOR formulation

The objective of VPOR is to decompose a commercial item image into a set of disjoint parts for that users can assign their preferences on specific parts of the item by the VPOR to obtain the personalized recommending lists of relevant commercial items. Therefore, it is obvious that the performance of generated VPOR images directly affects the performance of recommending lists. For example, if we select the commercial item in Fig. 14(a) as the target item to obtain the corresponding visually similar commercial items, we might not get the commercial item in Fig. 14(b) because of its wrong VPOR image, which may be caused by incorrect shape clustering during VPOR formulation. If we have a proper VPOR image in Fig. 14(c), the retrieved recommending list will usually contains this item. Consequently, for the proposed VPOR based commercial item recommendation, to generate proper VPOR images for large-scale image databases becomes crucial.

Fig. 14
figure 14

Examples of VPOR images. a The query commercial item and its VPOR image. b A similar commercial item with wrong VPOR image. c A similar commercial item with proper VPOR image

Although it is hard to evaluate the generated VPOR images since we do not have ground truth for each image in our dataset, we can still show several representative failure examples, which result in the incorrectness of the VPOR image during the process of VPOR formulation, and show how we overcome the inaccuracy of the VPOR in these cases to demonstrate the effectiveness of our method. Referring to the VPOR formulation (cf. Section 3.2 and Fig. 5), there are two main possible factors that will lead to inaccuracy VPOR images in each sub process of VPOR formulation.

  • If a given target item image was clustered into an improper centroid item image, its VPOR image, which is generated by VPOR transfer according to the VPOR image of the centroid image, will be transferred incorrectly, as shown in Fig. 15. The error can be eliminated by increasing the number of shape clusters, but it will also costs extra efforts due to the human intervention in labeling VPOR for centroids. In order to balance the accuracy and efficiency, we empirically set the number of clusters for a range between 1.5 % to 2.5 % of the total number of images in each category. In particular, the category that the images have a large variety of shapes, such as handbag, is set to be more centroids during shape clustering, and vice versa.

  • The boundary error may occur in the process of VPOR transfer. The error is not large when the transferred image and its centroid image share a similar shape. The error can thus be ignored because it does not contribute much in the recommendation results. However, as these two shapes become more dissimilar, the boundary error will increase. If we can reduce the intra-difference in the clusters, the boundary error can be reduced effectively, too. Here we use the same techniques before in treating the incorrect clustering results, which is to set the number of clusters for a range relating to the category size (e.g., 1.5 % to 2.5 %), to eliminate the boundary error.

Fig. 15
figure 15

The incorrect clustering results will generate incorrect VPOR images. For each sub-figure, the image pair in the left is a given item image and its VPOR image. The given item image is incorrectly clustered into the improper centroid item image (the image pair in the right of the sub-figure)

Samples of VPOR images are shown in Fig. 16. Based on these VPOR images, the proposed VPOR based commercial item recommendation is thus feasible to achieve the objective to recommend users a recommending list of personalized visually similar commercial items according to the assigned partial preferences of the query item from users.

Fig. 16
figure 16

Samples of VPOR images which are generated by VPOR transfer

6 Conclusions and future work

In this work, we first conducted a user study to demonstrate the hypothesis that different parts of a commercial item do contribute unequally to a user’s preference when making buying selections. Taking the hypothesis into consideration, we further proposed a novel representation for commercial item images, named Visual Part-based Object Representation (VPOR). Every commercial item image will be decomposed into a set of disjoint parts, with each of them represents a meaningful semantic part. With the VPOR, users can assign non-uniform preferences on the different item parts of a chosen commercial item to obtain personalized recommendation results. Moreover, for scalability, we proposed a VPOR-generation scheme for large scale image database with minimal user intervention. Furthermore, since the partial visual appearance is one of the most influential factors that affecting people’s buying decision, we presented a mechanism of part-based feature propagation for discovering auxiliary features that make two items which share a similar part but are different in the rest parts become more similar. The experimental results show that the proposed VPOR based commercial item recommendation can achieve better performance than existing text-based methods (e.g., association rules) and non-VPOR visual-based methods.

There are still limitations in this work. First, it is still a challenging research issue requiring further investigation to automatically generate VPOR images without manual intervention, cf. the first paragraph of Section 3.2. In our future work, we will keep our study along this research direction to better benefit the commercial item recommendation. Second, incorrect shape clustering results may result in inaccuracy in VPOR formulation, as shown in Fig. 14. That is, if the shape of an item image is not so similar to its centroid, the resulting VPOR will become inaccurate, which will influence directly to the performance of the recommended lists of visually similar commercial items. To deal with the problem, we will further investigate the influential factors, such as the clustering size and the variety between different categories, to find out a more suitable method that will make the shapes of the item images be more similar to each other in each cluster. In addition, there are still several inappropriate categories for applying our proposed VPOR technique. For example, the products of consumer electronics, such as computers, mobile phones, and the food products, might not be applicable since the users often care more about their functionality rather than their appearances. For supporting these categories, we will integrate the social-network-clue based methods, such as collaborative filtering and association rules, with the VPOR to formulate a more powerful recommendation method in the future. Moreover, a 3D alternative of the 2D based silhouette matching is an interesting direction for future research. We will explicitly study the issues of automatic 3D model generation and 3D to 3D shape matching in the future. Finally, we will conduct more extensive user studies for determining the weighting scheme of VPOR parts appropriately corresponding to the preference of users.