1 Introduction

In recent years, numerous tools have been developed for progressive interior design, redecorating or remodeling. However, how many of them truly facilitate potential users to furnish his/her house is still a question. TECHGYD.COMFootnote 1 conducted a survey of the Top 10 best interior design apps on the Google Play Store and the Apple Store, and Table 1 compares these 10 apps in terms of four functionalities: 1) Whether the app can inspire users by proving or recommending some photos of the decorated indoor room, or allow users to upload their photos of interior design to share the inspiration to others. 2) Whether the app allows users to imagine how the room looks by pasting 2D images of furniture on a 2D image of the virtual room, and/or by editing the texture/color of the 2D furniture, the walls, the floor, and the ceiling of the room. 3) Whether the app allows users to imagine how the house looks by inserting 3D objects into a 3D virtual room, and/or by editing the texture/color of the 3D furniture models and the 3D room models. 4) Whether the app allows users to browse the decorated room in Virtual Reality (VR) mode or free perspective.

Table 1 The functionalities of the Top 10 Best Interior Design Apps reported by TECHGYD.COM

To understand the user’s practical requirements when making interior design and whether they are satisfied with the existing interior design apps mentioned above, we further conducted semi-structured interviews [10] with twenty Asian people aged from 20 to 40. According to the interviews, we found that when creating a virtual indoor scene or making interior design, most users will consider the style compatibility of all furniture within a given space to make a pleasant overall appearance. Moreover, Virtual Reality (VR) or Augmented Reality (AR) technologies are indispensable for providing the user with better experience in the visual appearance of the space when making adjustments of furniture. However, many users lack inspiration of selecting compatible furniture items from different classes to make the overall space pleasant and harmonic. In the most 3D content editors, users have to spend lots of time for searching and evaluating suitable and/or compatible objects to compose a 3D virtual scene. Therefore, we aim to develop an interior design app having the functionality of compatible furniture recommendation to help people remodel or decorate their indoor space more efficiently.

Two main challenges arise in our work: 1) Style compatibility is a high-level semantic concept which is difficult to be measured precisely because the perception of harmony/compatibility may be various for different users. There have been many works on 3D object retrieval, which measures the similarity between 3D models based on shape, color, or/and texture features [16, 29, 39]. However, finding a proper feature descriptor and similarity metric for comparing the style compatibility of two objects is still an open problem. 2) Two objects with very different geometric appearance might still look great when being placed in the same scene. Therefore, rather than modeling “style similarity” based on geometric features, Liu et al. [28] proposed to model “style compatibility” for bettering the measurement of stylistic harmony. Figure 1 more clearly illustrates the difference between geometry-based “style similarity” (Fig. 1a) and “style compatibility” (Fig. 1b). The bed and the dresser are stylistically similar because the corresponding structure/geometric elements (as highlighted by the stars) have similar geometric features. However, it is difficult to find corresponding structure/geometric elements for a sofa and a lamp. In this case, we should model the “style compatibility” rather than just modeling global or local geometric similarity.

Fig. 1
figure 1

Examples of geometry-based “style similarity” and “style compatibility”. a The bed and the dresser are stylistically similar because the corresponding structure/geometric elements (as highlighted by the stars) have similar geometric features. [30] b The sofa and the lamp are considered compatible even though we can not exactly find the corresponding geometric elements between them

In this work, we try to develop a furniture compatibility recommendation method which can overcome the above two challenges. The main contributions of this work are summarized as follows:

  • Instead of using conventional metric learning methods, we utilize Triplet CNN to measure style compatibility between 3D furniture models even the models belong to different classes without corresponding structure/geometric elements.

  • We collected a dataset containing 420 textured 3D furniture models. Taking the advantage of crowdsourcing, we recruited a group of raters from Amazon Mechanical Turk (AMT) to evaluate the comparative suitability of paired models within the collected dataset. The dataset and the evaluated results were used to train and validate the proposed style compatibility measuring model. Some existing datasets of 3D furniture models have been used in Lun’s [30] and Liu’s [28] works. However, these datasets do not include texture information, which is also important for judging style compatibility.

  • We further develop a furniture recommendation system based on the proposed style compatibility measuring method to help users intuitively design their own scene/room (cf. Fig. 2). The user can utilize a smartphone to take a photo of a furniture item in the current indoor space and upload it to the server for further analyses. Given the uploaded photo, Faster RCNN [35] is applied to detect and classify the furniture item into a furniture class Ci, and our proposed furniture recommendation method will output style-compatible furniture items of different classes Cj’s (ij) based on the trained Triplet CNN model. In the previous works [9, 28, 30], compatibility is only evaluated for a pair of two furniture items belonging to two different classes, e.g. a table and a chair, a sofa and a coffee table. In contrast, our recommendation system provides two recommendation modes: (1) Given a furniture item of class Ci, we generate the Top 3 compatible furniture items for each class Cj (ij) and the user can choose the furniture items by himself/herself from the recommendation lists. (2) Given a furniture item of class Ci, we further find the best combination of furniture items from all Top 3 lists according to the overall style compatibility score. Finally, a set of furniture items (either selected from the recommendation lists by the user or generated automatically by the system) will be rendered in a virtual scene for the users to browse and make further adjustments.

Fig. 2
figure 2

The framework of our furniture recommendation system based on the proposed style compatibility measuring method

The remainder of this paper is organized as follows: Section 2 introduces relevant literature and Section 3 illustrates how we collected crowdsourcing responses of furniture compatibility. Methodology of our work is expounded in Section 4. Section 5 shows the experimental results and discusses the findings of this work. Conclusions and future work are given in Section 6.

2 Related work

2.1 Recommendation system

With the advance of multimedia technology, more and more recommendation tools have been proposed to provide convenient personalization systems. For example, Cheng et al. [8] developed a location-aware music recommendation system, called Just-for-Me, which can integrate both location context and the global music popularity trends to facilitate more accurate and robust music recommendation. Chang et al. [5] proposed a photo recommendation system, which can select representative photos for restaurants based on blog-based restaurant photos. Mei et al. [33] demonstrated a video-driven recommender called VideoReach, which is able to recommend a list of the most relevant videos according to a user’s current viewing without his/her profile. Wang et al. [37] presented a 3D model recommendation system for virtual house furnishing by collaborative filter the user browsing history, user physiological information and furniture content information. The motivation of Wang’s work is similar to ours; however, a large number of user personal logs have to be collected for a period of time to mine the user’s preference. Moreover, the proposed method of visual content analysis was not effective enough for furniture compatibility.

2.2 Computer aided interior design

Researchers in the fields of computer vision and computer graphics have made a lot of efforts on facilitating the process of interior design, redecorating or remodeling. In addition to the apps listed in Table 1, some other works focused on reconstructing 3D scene of the room so that the user can place virtual furniture into the room with less artifact [12, 15, 22, 43]. For example, Fukano et al. [12] proposed a method for understanding a room from a single spherical image, while Zhang et al. [43] introduced a whole-room context model for scene understanding in 360 full-view panoramas. Izadinia et al. [22] also developed a system to automatically reconstruct a 3D CAD as similar as possible to the real scene of a single photograph. Intelligent 3D object arrangement is another important functionality for a convenient interior design system, and different methodologies have been proposed in the past decade. Given a few user-provided examples of preferred arrangements, Fisher et al. [11] synthesized a diverse set of plausible new scenes by learning from a larger scene database. Yu et al. [41] and Merrell et al. [32] optimized furniture arrangements in a given space by defining the cost function according to interior design guidelines. On the other hand, some researches focused on style transformation of indoor objects. For example, Zhu et al. [44] presented a data-driven approach that colorizes 3D furniture models and indoor scenes by leveraging indoor images on the Internet. Given an indoor scene, the system has the ability to recommend the colorization scheme that is consistent with a user-desired color theme. Chen et al. [7] proposed to transform a 3D room into another one that resembles the style of a given 2D photo according to layout and color. Chen et al. [6] developed a system that automatically generates material suggestions for 3D indoor scenes. Local material rules and global aesthetic rules were taken into consideration to describe typical material patterns and account for the harmony among the entire set of colors, respectively. However, the above works can not recommend compatible furniture items for the user.

2.3 Style analysis and metric learning for 3D models

Style analysis has been studied through different methodologies in the past decade. Xu et al. [40] presented a style-content separation method to analyze the objects at the part level and treat the anisotropic part scales as a shape style. Yumer et al. [42] proposed a shape editing method, in which the user can create geometric deformations based on a set of semantic attributes, in other words, manually geometric manipulations is not essential. The scholarly works listed above suggested different strategies to analyze the style property of 3D models within an object class; however, these strategies can not be employed to precisely measure the style property of 3D models across different object classes.

On the other hand, a range of researchers have made efforts on the investigation of style similarity metric learning for both 2D contents and 3D shapes. For 2D contents, style similarity metric learning is utilized to compute the style-based distance in applications for 2D clip art [13, 14], image classification [31], font selection [34], and infographics [36]. In the case of 3D shapes, Huang et al. [19] applied a graph-based semi-supervised classification technique to generate the final classification and jointly learned a distance metric for each class, which can capture the underlying geometric similarity within that class. Inspired by art history literature, Lun et al. [30] measured 3D style similarity based on the similarity between geometric elements. Dev et al. [9] improved geometric similarity metrics by considering the color and texture of 3D shapes, and developed user-guided metrics learning of style similarity.

Evidently, above works mainly considered geometric features (e.g., curvatures, shape diameter, and surface areas) in the computation of the style-based distance based on geometric similarity between specific pairs of 3D objects, especially for the pairs with corresponding geometric elements. Very different from the previous methods of measuring the geometric similarity, our work measures style compatibility between all paired objects of different classes within a scene, which is not only limited to geometric similarity.

Our work is related to Liu’s work [28], in which the researchers introduced a part-aware geometric feature vector and a new asymmetric embedding distance metric to estimate style compatibility between specific paired objects of different classes. However, we aim to utilize deep learning method to measure style compatibility between 3D furniture models.

2.4 Deep metric learning

In previous scholarly works, most researchers attempted to learn and compute the distance function among the given objects [24], which is also known as the conventional metric learning problem. To find a suitable representation of data for distance calculation, some researchers proposed subspace learning algorithm to reduce the semantic gap between the low-level visual features and the high-level semantics [27]. However, the principal shortcoming of conventional metric learning is that the feature representation of the data and the distance metric are not learned jointly. In this decade, researchers tend to explore data by deep learning technology and avoid defining representative features in advance. For example, Li et al. [25] presented a weakly-supervised deep distance metric learning method for social image retrieval by exploiting knowledge from community contributed images associated with user-provided tags. In addition, Li et al. [26] demonstrated a Weakly-supervised Deep Matrix Factorization (WDMF) framework to collaboratively explore the heterogeneous data of social images.

CNN has been proved to be effective for visual data analysis, and more and more frameworks have been developed based on the CNN architecture. For example, Siamese CNN and Triplet CNN were introduced to learn and compute the distance jointly influenced by similar or different data. Siamese CNN is commonly used to train pairs of images and measure their visual similarity. For instance, Bell et al. [3] employed Siamese CNN to achieve some visual search applications like finding stylistically similar products and identifying products in a scene; whereas Bansal et al. [2] also applied Siamese CNN to style estimation of 3D models according to the image and predicted surface normals. Hoffer et al. [17] proposed the first triplet network model, which aims to learn useful representations by distance comparisons. Later on, Balntas et al. [1] proposed a framework to train the network with positive and negative pairs in the triplet form and introduced a new loss function called SoftPN, and focused on learning local descriptors for matching image patches. In Guo et al.’s work [16], they used a triplet network for multi-view 3D object retrieval, which refers to the deep convolutional network jointly supervised by classification loss and triplet loss. However, there is no study particularly focuses on cross-class style compatibility for 3D furniture models based on triplet network.

3 Crowdsourcing responses collection

The concept of crowdsourcing has been widely used to collect ground truth data for supervised machine learning tasks. Liu et al. [28] used crowdsourcing data to measure the style compatibility of 3D furniture models. However, their dataset does not include texture information. In this work, we collect a set of 3D furniture models with texture information, and take advantage of crowdsourcing to determine/label the compatibility of paired furniture items. We collected 22,960 3D models of 7 furniture classes from ShapeNet [4], and Fig. 3 shows some examples for each class. The 7 furniture classes are commonly placed in indoor spaces and structurally different from one another. The number of models in each class is shown in Table 2. We manually selected 48 distinctive models from each class respectively for training.

Fig. 3
figure 3

The whole training models, which were selected from ShapeNet. We consider 7 classes of furniture and manually selected 48 distinctive models from each class respectively for training

Table 2 The number of models in each class

We gather the crowdsourcing data in the triplet form (A, B, C), where object B and object C come from the same class and object A comes from another one. Each triplet represents that the reference object A is more compatible with object B than with object C. Ideally, we have to sample a furniture item A from a furniture class Ci to be the reference, and two furniture items B and C are sampled from another class Cj to compose a triplet data. And then a rater is asked to decide whether B or C is more compatible than the other. However, in this way there will be a great number of triplet data, and it is time consuming to evaluate the compatibility for all triplets. Therefore, we adopted the grid technique proposed by Wilber et al. [38] to collect triplet responses. That is, given a reference object A from a furniture class Ci and 6 candidates from another furniture class Cj, a rater must complete a task by selecting two objects that are more compatible with the reference A among the 6 candidates (as shown in Fig. 4a). As a result, 8 triplets can be generated when the rater completes a single task. However, not all of the 8 triplets are trustworthy since sometimes there are more than two compatible objects in the 6 candidates and the rater might be unsure to select the top 2 compatible ones. Hence, we let multiple raters to complete each task and apply a voting strategy to decide the top 2 compatible objects with respect to the reference A. We set two voting thresholds, T1 and T2 to determine whether a candidate is compatible, incompatible, or unsure with respect to the reference object. A triplet (A, B, C) is generated when the votes of B is larger than T1 and the votes of C is less than T2. We employed Human Intelligence Tasks (HITs) to evaluate style compatibility by a wide range of raters recruited from Amazon Mechanical Turk (AMT). 50 raters are recruited to conduct our tasks, and each furniture is represented in GIF format, which displays 15 different views of the furniture. Figure 4b shows the 15 views of an example furniture.

Fig. 4
figure 4

a An example of a crowdsourcing task. The upper one is a reference object, whereas the other six candidate objects are placed in the bottom part of the question. According to the reference object, the raters are asked to select two most compatible objects out of the six candidate objects. b The 15 views of an example furniture. View (1) to View (15) are the views regularly projected to 15 different rotation angles with a horizontal angle of 15

4 Style compatibility learning based on cross-class triplet CNN

Hand-crafted features such as curvatures, shape diameter, and surface areas have been commonly used in measuring the geometric similarity between 3D models. Compared with geometric similarity, style compatibility is a high-level semantic concept which is much harder to be precisely described by conventional hand-crafted features. Recently, Deep Learning (DL) methods have been proposed to find representative features for a variety of tasks, and its success in promoting accuracy inspires us to apply DL to solve the style compatibility problem. In this section, we introduce Triplet CNN and describe how it is applied to extract representative features for measuring style compatibility between 3D models.

4.1 Triplet CNN

Triplet CNN consists of three identical CNNs trained with triplet inputs. Figure 5 illustrates the triplet network architecture with a triplet input sample. Each triplet (sr, sp, sn) represents that the reference sample sr is more compatible with sp than with sn. The difference of basic triplet CNN and cross-class triplet CNN is shown in Fig. 6. The symbol p is positive data which is compatible with the reference data, while the symbol n is negative data which is incompatible with the reference data. In Fig. 6b, each class indicates different kinds of furniture category. For example, Class 1 might be “table” and Class 2 might be “chair”. The following loss function is commonly used for training a Triplet CNN:

$$ L(s_{r}, s_{p}, s_{n}) = max \left( 0, 1- \frac{D_{n}}{D_{p} + m} \right), $$
(1)

where Dp = ||f(sr) − f(sp)|| is the distance between two compatible models, Dn = ||f(sr) − f(sn)|| is the distance between two non-compatible models, f(s) is the feature vector extracted from a model s, and m is a margin, which defines the minimum difference between Dn and Dp. If the difference between Dn and Dp is larger than margin, it implies the two non-compatible models already has significantly larger distance than the two compatible models, and the loss of this triplet can be considered as 0. Otherwise, the loss function is supposed to minimize the distance between compatible models and maximize the distance between non-compatible ones. In Section 5.4, we investigate the influence of applying different loss functions, and the following loss function is proved to have the best performance:

$$ L(s_{r}, s_{p}, s_{n}) = max (0, m + D_{p} - D_{n}). $$
(2)

As shown in Fig. 7, if the distance between Dp and Dn is larger than margin, it implies the two non-compatible models already had significantly larger distance than the two compatible models. In that case, (m + DpDn) would be a negative value, and the loss of this triplet should be considered as 0. Otherwise, (m + DpDn) would be a positive value, which indicates the loss of this triplet is large. The max function can achieve the above-mentioned objectives.

Fig. 5
figure 5

The triplet network architecture with an triplet input sample

Fig. 6
figure 6

a The basic triplet network. b Our proposed cross-class triplet network

Fig. 7
figure 7

The illustration of the used loss function

4.2 View selection

To learn the style compatibility metric based on CNN, each 3D furniture model is first projected to a 2D viewing plane and the corresponding projected 2D image is taken as the input of the Triplet CNN. With the trained model, the style compatibility between two furniture models can be directly measured by the Euclidean distance between the extracted feature vectors. However, we believe that a certain view of a 3D model could not comprehensively represent the overall appearance of the furniture. Hence, we examine whether using multiple views instead of a single view image can result in better prediction performance. Four view selection strategies are investigated in this work:

  • Using a single default view (i.e. view (1) as shown in Fig. 4b).

  • Using a single view which is voted as the best view for representing the furniture by users of the crowdsourcing platform (i.e. view (3) as shown in Fig. 4b)

  • Using a single view which contains the most details of the 3D model, i.e. has the highest color entropy [21].

  • Using all of the 15 views as shown in Fig. 4b to obtain the prediction result. It is intuitive to consider 3D geometry information directly to compare the compatibility of two 3D models. However, the deep learning based framework cannot be directly applied to 3D meshes due to the unknown order of vertexes (i.e. the topology) in the data representation. Another choice is to consider visual features in different views of 2D projections for evaluating compatibility between 3D models. The stylistic distance D between two furniture models xA and xB is then defined by summing the feature distance in each view:

    $$ D(x_{A}, x_{B})= \sum\limits_{i = 1}^{15} || f(v_{i}(x_{A}))-f(v_{i}(x_{B}))||_{2}, $$
    (3)

    where f(s) is the feature vector extracted from an image view s and vi(x) is the ith image view of the 3D model x.

In the experiments, we will show that using multiple views can achieve better compatibility prediction result. Moreover, we will investigate the influence of applying feature dimension reduction on the output features obtained by the trained CNN.

5 Experimental results

Our Triplet network was built with Caffe [23], a commonly used deep learning framework, and all experiments were run on a 64-bit Windows 7 PC with a GEFORCE GTX 1060 GPU and 6GB RAM. CaffeNet was used as the CNN module in our Triplet network (cf. Fig. 5), and the ImageNet dataset [20] was used to pre-train the Triplet network. Finally, the furniture compatibility ground truth collected from AMT was used to fine-tune the weights of the Triplet network. The base learning rate was set as 0.0001 for the lower layers, while the learning rate was multiplied by 10 in the last two fully-connected layers to speed up the adjustment of weights. The weight decay was fixed to 0.0005, and the margin m was set to 0.2. As for the batch size, we set it to 162 for taking as many triplets into account as possible in one iteration.

The collected crowdsourcing data was divided into the training set (containing 13,788 triplets related to 336 models randomly selected from 7 furniture classes) and the testing set (containing 3,685 triplets related to another 84 furniture models randomly selected from 7 furniture classes). In the training phase, each image was firstly resized to 256×256 pixels. To increase the amount of training data and avoid overfitting, each image was mirrored horizontally, and both the original and the mirrored images were randomly cropped to generate 5 images respectively (i.e. 10 cropped images of the size 224×224 pixels were generated for each input image). In the testing phase, each image was resized to 256×256 pixels and then cropped to 224×224 pixels symmetrically before extracting the CNN features.

For each testing triplet (A, B, C), we measured the style compatibility of pair (A, B) and (A, C) based on the Euclidean distance between extracted feature vectors. The smaller the distance is, the more compatible the pair is. If the style compatibility of the triplet (A, B, C) match the crowdsourcing preference, this triplet is considered as a correct prediction. In the following subsections, we will investigate how different aspects influence the overall correct prediction rate and discuss the findings of this work.

5.1 Influence of crowdsourcing selection thresholds

As mentioned in Section 3, the voting thresholds T1 and T2 are used to determine whether a candidate furniture is compatible, incompatible, or unsure with respect to the reference object. In this experiment, we would like to discuss whether the strict thresholds may affect the accuracy or not. Therefore, we empirically examine the proper setting of the two thresholds. Figure 8 compares the results of using two different training sets generated by setting T1 = 25, T2 = 15 and T1 = T2 = 20. With more strict thresholds, unreliable crowdsourcing responds can be filtered out, and the experiment shows that the results can be effectively improved by using more reliable dataset.

Fig. 8
figure 8

The prediction accuracy of using different training triplet sets generated by setting T1 = 25, T2 = 15 (blue curve) and T1 = T2 = 20 (red curve)

5.2 Influence of feature dimension reduction

The output of the CaffeNet is a 4096-dimensional feature vector, which might involve redundant features for measuring style compatibility. We investigated the influence of reducing the feature dimension of the CNN output layer. Figure 9 demonstrates the prediction accuracy of (1) directly using the 4096-dimensional feature vector obtained from the CaffeNet, (2) adding an additional fully connected layer containing N nodes after the CNN output layer, and (3) applying PCA to reduce the 4096-dimensional CNN output to N features. In this work, we empirically set N = 6, which can achieve the best prediction accuracy. As shown in Fig. 9, adding an additional fully connected layer containing 6 nodes after the CNN output layer can significantly improve the prediction results. Compared to PCA, adding a fully connected layer is an end-to-end training procedure which jointly optimizes the original feature weights and the feature reduction weights. In the following experiments, we used the CaffeNet with another fully connected layer containing 6 nodes as our network architecture. Note that the prediction performance reported in Fig. 9 is based on the multi-view view selection strategy, which was validated as the best approach for predicting style compatibility in Section 5.4.

Fig. 9
figure 9

The prediction accuracy of 1) directly using the 4096-dimensional feature vector, 2) adding an additional fully connected layer, and 3) applying PCA

5.3 Comparison with deep and shallow CNN models

We examined three kinds of CNN models (i.e., CaffeNet, AlexNet, and VGG19) to construct the triplet CNN. An additional fully connected layer containing 6 nodes was added after the output layer of each CNN model. For AlextNet, a similar architecture with CaffeNet, the base learning rate was set as 0.0001 for the lower layers, while the learning rate was multiplied by 10 in the last two fully-connected layers to speed up the adjustment of weights. The weight decay was fixed to 0.0005. As for the batch size, we set it to 162 for taking as many triplets into account as possible in one iteration. For VGG19, the learning rate and weight decay were set the same as the other two CNN networks, and the only difference is the batch size was set to 30. Figure 10 shows the compatibility prediction accuracy of using these three kinds of CNN architectures. In this experiment, we found that the shallow CNN model can obtain better accuracy than the deep CNN model, which indicates that shallower CNN model might be closer to the brain network of human perception for style compatibility. Since CaffeNet and AlexNet have similar performance, we chose CaffeNet to evaluate our proposed method in the following experiments.

Fig. 10
figure 10

The prediction accuracy of different CNN models

5.4 Loss function and testing view selection

For training the neural network, a loss function should be properly designed to measure the discrepancy between the desired output and the prediction of the network. Table 3 compares the results of training the CNN models with four different loss functions and testing with four different view selection strategies (cf. Section 4.2). The results indicate that applying loss function (4) to train the Triplet network and considering all of the 15 views in the testing phase would achieve the best prediction accuracy (89.77%). We found that given a certain view of a 3D model, human brain can imagine the occlusion part of it according to previous experience, but it is difficult for the machine to make up unknown information. Besides, when only considering a certain view to measure the style compatibility between two furniture models, using the crowdsourcing view would have better prediction performance than the other two methods in most cases. It implies that we do not have to take extra efforts to compute the entropy of all the views as suggested in [21]. In the following experiments, we applied loss function (4) to train our Triplet networks and measured style compatibility by considering all of the 15 views. Moreover, we further investigated suitable margin m for loss function (4). Figure 11 shows that when the margin m was set as 0.2, the model can achieve the highest accuracy.

Fig. 11
figure 11

The prediction accuracy of different margin m

Table 3 Accuracy (%) of training the CNN models with different loss functions and testing with different view selection methods

5.5 Comparison with hand-crafted features

We further compared our deep metric learning method with another two methods proposed by Liu et al. [28] and Lun et al. [30], respectively. Their methods use hand-crafted geometric features to learn the compatibility metric. We applied our method on the two datasets used by Liu and Lun, which do not have texture/color information. In Liu’s experiment, they firstly formed the test set by merely considering the triplets where raters had strong agreements. Their evaluation result was obtained by the following procedure. For each testing triplet (A, B, C), they checked whether model A or B or C appears in any triplet of the training set. Any training triplet involving A or B or C will be removed from the training set, and the remaindered triplets were used to learn a new compatibility metric for this testing triplet. Therefore, given K testing triplets, we need to train K compatibility metrics, which is time-consuming. In this experiment, we directly divided their dataset into the training set and the testing set to make sure there is no overlapping models between these two sets. For the living room scene, there are 2800 triplets used for training and 180 triplets used for testing. For the dining room scene, there are 920 triplets used for training and 63 triplets used for testing. As shown in Table 4, our method has higher accuracy than Liu’s. In Lun’s experiment, their dataset consists of 7 structurally diverse categories: buildings, furniture, lamps, coffee sets, architectural columns (pillars), cutlery and dishes. We only applied our method on the furniture set and evaluated the performance based on the 10-fold cross-validation method as they used. Table 4 shows that our proposed method also achieved higher accuracy in comparison with and Lun’s method. The results imply that our method can be also applied to 3D models without texture and color information.

Table 4 The accuracy of our proposed method in comparison with Liu’s [28] and Lun’s [30] method

5.6 Application on furniture recommendation

To investigate the feasibility of the proposed method, we further developed a furniture recommendation system to find the compatible furniture set within a scene. Our dataset contains 7 furniture classes as mentioned in Section 3. When a model was selected as a reference furniture, we computed the distances between the reference and all the others from target classes to find 3 most compatible furniture for each target classes. Here, all classes except for the reference one were selected as target classes. Thus, we could obtain 36 combinations, each one consists of 1 reference and 6 target furniture from each class. For each combination, we measured the overall compatibility with the summation of distances between all pairs of the 7 furniture and recommended the combination set which has the minimum distance. Figure 12 shows six results generated by our system, which are two results with minimum distances ((a) and (b)), two results with maximum distances ((c) and (d)), and two results randomly generated from our system ((e) and (f)). Among them, the sofa is the reference furniture in Fig. 12a, c and e, and the chair is the reference furniture in Fig. 12b, d and f. Note that we manually arrange the placement of each furniture to facilitate the comparison. For automatic arrangement, the details can be found in our previous work [18]. In Fig. 12a, we found that the color in each furniture belongs to the same color tone, and the shape and texture are simple and modern. In Fig. 12b, all the furniture have harmonic hues, and most of them have thin and long legs in the structure. Oppositely, in Fig. 12c and d, it is difficult to find the corresponding structure elements between all furniture and even the color harmony in the scene is imperfect.

Fig. 12
figure 12

Examples of furniture recommendation results generated by our system, which are two results with minimum distances ((a) and (b)), two results with maximum distances ((c) and (d)), and two results randomly generated from our system ((e) and (f))

6 Conclusions and future work

In this paper, a method for measuring cross-class 3D furniture style compatibility is proposed. We developed this method based on Triplet CNN to extract more representative features to describe the style compatibility between the paired 3D furniture models of different object classes. A dataset containing 420 textured 3D furniture models was collected and voted by a group of raters who were recruited from Amazon Mechanical Turk (AMT) to evaluate the comparative suitability of paired models within the collected dataset. We trained and evaluated the proposed method based on three datasets, including a collected textured dataset and another two non-textured datasets. The experimental results indicate that our method outperforms the state-of-the-art work, which learned a metric with pre-extracted geometric features. Furthermore, the proposed method is used for developing a furniture recommendation system, which can be used to help potential users to do interior design. In the future, we expect to improve the reliability of the trained model by enlarging our dataset. In addition, we would like to apply our proposed method to build a virtual reality interior design tool so that users can directly experience the designed house with our recommendation results.