Skip to main content
Log in

Embedded heterogeneous feature selection for conjoint analysis: A SVM approach using L1 penalty

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper presents a novel embedded feature selection approach for Support Vector Machines (SVM) in a choice-based conjoint context. We extend the L1-SVM formulation and adapt the RFE-SVM algorithm to conjoint analysis to encourage sparsity in consumer preferences. This sparsity can be attributed to consumers being selective about the attributes they consider when evaluating alternatives in choice tasks. Given limited individual data in choice-based conjoint, we control for heterogeneity by pooling information across consumers and shrinking the individual weights of the relevant attributes towards a population mean. We tested our approach through an extensive simulation study that shows that the proposed approach can capture the sparseness implied by irrelevant attributes. We also illustrate the characteristics and use of our approach on two real-world choice-based conjoint data sets. The results show that the proposed method has better predictive accuracy than competitive approaches, and that it provides additional information at an individual level. Implications for product design decisions are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abernethy J, Evgeniou T, Toubia O, Vert J (2008) Eliciting consumer preferences using robust adaptive choice questionnaires. IEEE Trans Knowl Data Eng 20(2):145–155

    Article  Google Scholar 

  2. Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272

    Article  Google Scholar 

  3. Arora N, Huber J (2001) Improving parameter estimates and model prediction by aggregate customization in choice experiments. J Consum Res 28:273–283

    Article  Google Scholar 

  4. Bi J, Bennett K, Embrechts M, Breneman C, Song M (2003) Dimensionality reduction via sparse support vector machines. J Mach Learn Res 3:1229–1243

    MATH  Google Scholar 

  5. Blum A, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97 (1-2):245–271

    Article  MathSciNet  MATH  Google Scholar 

  6. Bradley P, Mangasarian O (1998) Feature selection via concave minimization and support vector machines. In: Shavlik J (ed) Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), Morgan Kaufmann, San Francisco, California, pp 82–90

  7. Cerrada M, Sánchez R V, Pacheco F, Cabrera D, Zurita G, Li C (2016) Hierarchical feature selection based on relative dependency for gear fault diagnosis. Appl Intell 44(3):687–703

    Article  Google Scholar 

  8. Chapelle O, Harchaoui Z (2005) A machine learning approach to conjoint analysis. Adv Neural Inf Proces Syst 17:257–264

    Google Scholar 

  9. Cui D, Curry D (2005) Prediction in marketing using the support vector machine. Mark Sci 24(4):595–615

    Article  Google Scholar 

  10. Djuric N, Lan L, Vucetic S, Wang Z (2013) Budgetedsvm: A toolbox for scalable svm approximations. J Mach Learn Res 14:3813–3817

    MathSciNet  MATH  Google Scholar 

  11. Evgeniou T, Boussios C, Zacharia G (2005) Generalized robust conjoint estimation. Mark Sci 24 (3):415–429

    Article  Google Scholar 

  12. Evgeniou T, Pontil M, Toubia O (2007) A convex optimization approach to modeling heterogeneity in conjoint estimation. Mark Sci 26(6):805–818

    Article  Google Scholar 

  13. Gao S, Ye Q, Ye N (2011) 1-norm least squares twin support vector machines. Neurocomputing 74 (17):35903597

    Google Scholar 

  14. Gelman A, Pardoe I (2006) Bayesian measures of explained variance and pooling in multilevel (hierarchical) models. Technometrics 48(2):241–251

    Article  MathSciNet  Google Scholar 

  15. Green P E, Rao V R (1971) Conjoint measurement for quantifying judgmental data. J Mark Res 8:355–363

    Article  Google Scholar 

  16. Green P E, Krieger A M, Wind Y (2001) Thirty years of conjoint analysis: Reflections and prospects. Interfaces 31(3):S56–S73

    Article  Google Scholar 

  17. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  18. Guyon I, Gunn S, Nikravesh M, Zadeh L A (2006) Feature extraction foundations and applications. Springer, Berlin

    Book  MATH  Google Scholar 

  19. Hensher D A, Rose J M, Greene W H (2012) Inferring attribute non-attendance from stated choice data: implications for willingness to pay estimates and a warning for stated choice experiment design. Transportation 39 (2):235–245

    Article  Google Scholar 

  20. Hsu C W, Chang CC, Lin C (2010) A practical guide to support vector classification

  21. Le Thi H, Pham Dinh T, Thiao M (2016) Efficient approaches for l2-l0 regularization and applications to feature selection in svm. Applied Intelligence In press 45(2):549–565

    Article  Google Scholar 

  22. Maldonado S, Weber R, Basak J (2011) Kernel-penalized svm for feature selection. Inf Sci 181(1):115–128

    Article  Google Scholar 

  23. Maldonado S, Flores A, Verbraken T, Baesens R B W (2015a) Profit-based feature selection using support vector machines - general framework and an application for customer churn prediction. Appl Soft Comput 35:740–748

  24. Maldonado S, Montoya R, Weber R (2015b) Advanced conjoint analysis using feature selection via support vector machines. Eur J Oper Res 241(2):564–574

  25. Orme B (2005) The cbc/hb system for hierarchical bayes estimation

  26. Pan X, Xu Y (2016) Two effective sample selection methods for support vector machine. J Intell Fuzzy Syst 30:659–670

    Article  Google Scholar 

  27. Rao V R (2014) Applied conjoint analysis. Springer

  28. Rossi P E, Allenby G M, McCulloch R (2005) Bayesian statistics and marketing. Wiley, New York

    Book  MATH  Google Scholar 

  29. Shen Q, Jensen R (2008) Approximation-based feature selection and application for algae population estimation. Appl Intell 28(2):167–181

    Article  Google Scholar 

  30. Toubia O, Evgeniou T, Hauser J (2007a) Optimization-based and machine-learning methods for conjoint analysis: Estimation and question design. Conjoint Measurement p 231

  31. Toubia O, Hauser J, Garcia R (2007b) Probabilistic polyhedral methods for adaptive choice-based conjoint anaysis. Mark Sci 26(5):596–610

  32. Tsai H C, Hsiao S W (2004) Evaluation of alternatives for product customization using fuzzy logic. Inf Sci 158:233–262

    Article  Google Scholar 

  33. Vapnik V, Chervonenkis A (1991) The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognit Image Anal 1(3):283–305

    Google Scholar 

  34. Weston J, Elisseeff A, BakIr G, Sinz F (2005) The spider machine learning toolbox. Software available at http://www.kyb.tuebingen.mpg.de/bs/people/spider/

  35. Zhu J, Rosset S, Hastie T, Tibshirani R (2003) 1-norm support vector machines. In: Neural Information Processing Systems, MIT Press, pp 16–23

Download references

Acknowledgments

The authors thank Olivier Toubia and Bryan Orme for providing the data for the two empirical applications. The first author was supported by FONDECYT projects 1140831 and 1160738. The second author was supported by FONDECYT project 1151395. The third author was supported by FONDECYT project 1160894 and CONICYT Anillo ACT1106. This research was partially funded by the Complex Engineering Systems Institute, ISCI (ICM-FIC: P05-004-F, CONICYT: FB0816).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastián Maldonado.

Appendices

Appendix A: HB mixed logit estimation

1.1 Prior and full conditional distributions

We denote by 𝜃 i the set of random-effect parameters.

1.2 Priors

Random-effect parameters 𝜃 i

$$\begin{array}{@{}rcl@{}} \boldsymbol{\theta}_{i} &\sim& N(\boldsymbol{\mu}_{\theta},\mathbf{\Sigma}_{\theta}) \Rightarrow P(\boldsymbol{\theta}_{i}) \propto \exp\left( \frac{1}{2}(\boldsymbol{\theta}_{i}\,-\,\boldsymbol{\mu}_{\theta})^{\top}\mathbf{\Sigma}_{\theta}^{-1}(\boldsymbol{\theta}_{i} \,-\,\boldsymbol{\mu}_{\theta})\right) \\ \boldsymbol{\mu}_{\theta} &\sim& N(\boldsymbol{\mu}_{0},\mathbf{V}_{0}) \!\Rightarrow\! P(\boldsymbol{\mu}_{\theta}) \propto \exp\left( \frac{1}{2}(\boldsymbol{\mu}_{\theta}\,-\,\boldsymbol{\mu}_{0})^{\top}\mathbf{V}_{0}^{-1}(\boldsymbol{\mu}_{\theta} \,-\, \boldsymbol{\mu}_{0})\right) \\ \mathbf{\Sigma}_{\theta}^{-1} &\sim& W(df_{0},\mathbf{S}_{0}) \end{array} $$

1.3 Likelihood

$$\begin{array}{@{}rcl@{}} L(\text{data},\{\boldsymbol{\theta}_{i}\},\boldsymbol{\mu}_{\theta},\mathbf{\Sigma}_{\theta})\,=\,P(\text{data}|\{\boldsymbol{\theta}_{i}\}) P(\{\boldsymbol{\theta}_{i}\}|\boldsymbol{\mu}_{\theta},\mathbf{\Sigma}_{\theta})P(\boldsymbol{\mu}_{\theta})P(\mathbf{\Sigma}_{\theta}), \end{array} $$

where P(data|{𝜃 i }) corresponds to the Multinomial Logit model.

1.4 Full conditionals

$$\begin{array}{@{}rcl@{}} P(\boldsymbol{\theta}_{i}|\boldsymbol{\mu}_{\theta},\mathbf{\Sigma}_{\theta},\text{data}_{i}) \propto \exp\left( \,-\,\frac{1}{2}(\boldsymbol{\theta}_{i}\,-\,\boldsymbol{\mu}_{\theta})^{\top}\mathbf{\Sigma}_{\theta}^{-1}(\boldsymbol{\theta}_{i} \,-\, \boldsymbol{\mu}_{\theta})\right)P(\text{data}_{i}|\boldsymbol{\theta}_{i}) \end{array} $$
$$\boldsymbol{\mu}_{\theta} \sim N(\boldsymbol{\mu}_{i},\mathbf{V}_{i}), \mathbf{\Sigma}_{\theta}^{-1} \sim W(df_{1},\mathbf{S}_{1}) $$

where

$$\begin{array}{@{}rcl@{}} \mathbf{V}_{i}^{-1}&=&[\mathbf{V}_{0}^{-1}+N\mathbf{\Sigma}_{\theta}^{-1}]\\ \boldsymbol{\mu}_{i}&=&\mathbf{V}_{i}[\boldsymbol{\mu}_{0}\mathbf{V}_{0}^{-1}+N\bar{\theta}\mathbf{\Sigma}_{\theta}^{-1}]\\ df_{1} &=&df_{0}+N\\ \mathbf{S}_{1} &=& \sum\limits_{i=1}^{N}(\boldsymbol{\theta}_{i}-\boldsymbol{\mu}_{\theta})(\boldsymbol{\theta}_{i}-\boldsymbol{\mu}_{\theta})^{\top} +\mathbf{S}_{0}^{-1}. \end{array} $$

The MCMC procedure generates a sequence of draws from the posterior distribution of the model’s parameters. Since the full conditionals for 𝜃 i do not have a closed form, the Metropolis-Hastings (M-H) algorithm is used to draw the samples. In particular, we use a Gaussian random-walk M-H where the proposal vector of parameters φ (t) for 𝜃 i at iteration t is drawn from N(φ (t−1),σ 2Δ) and accepted using the M-H acceptance ratio. The tuning parameters σ and Δ are chosen adaptively to yield an acceptance rate of approximately 20 %.

We use the following uninformative prior hyperparameters: μ 0=0, V 0=103 I N 𝜃×N 𝜃 , d f 0 = N 𝜃+5, S 0 = d f 0 C, where N is the number of individuals, and C is an N 𝜃×N 𝜃 matrix with 2 on the diagonal and 1 off the diagonal for the levels of each attribute. We assume that the parameters are a priori uncorrelated across attributes (see e.g. [25]).

Appendix A: HB mixed logit estimation

In the proposed models, three parameters need to be calibrated: regularization parameter C, threshold 𝜖, and shrinkage 𝜃. We analyze how the performance of each model varies as a function of each parameter. For illustration purposes, we show the procedure used for the Camera data set. Similar analyses were conducted for the other data sets. Our goal was to assess whether the results are stable along different values of these parameters. A less rigorous validation strategy can be used in such a case. In contrast, a high variance in the performance requires an exhaustive model selection procedure such as LOOCV in order to find the best combination of parameters.

Figure 1 depicts the LOOCV hit rates as a function of C, 𝜖, and 𝜃 for the proposed feature selection approach.

Fig. 1
figure 1

Leave-one-out validation hit rates for l 1-SVM for different values of C, 𝜖, and 𝜃 (Camera data set)

Figure 1 reveals the influence of parameters C, 𝜖, and 𝜃 in the predictive performance (Leave-one-out validation hit rate). Results are relatively stable for small values of 𝜃 and 𝜖, and values of C around the unit, although we observe an important influence of these parameters in the final outcome of the proposed method.

Performing an adequate grid search is highly recommended, varying the parameters C, 𝜖, and 𝜃 along the suggested values in order to obtain the desired results. Additionally, the fact that the optimal values for these parameters are always above zero confirms the importance of feature selection and shrinkage to control for potential overfitting when a relatively small number of respondents is present.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Maldonado, S., Montoya, R. & López, J. Embedded heterogeneous feature selection for conjoint analysis: A SVM approach using L1 penalty. Appl Intell 46, 775–787 (2017). https://doi.org/10.1007/s10489-016-0852-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-016-0852-5

Keywords

Navigation