Archetypal analysis: Contributions for estimating boundary cases in multivariate accommodation problem

https://doi.org/10.1016/j.cie.2012.12.011Get rights and content

Abstract

The use of archetypal analysis is proposed in order to determine a set of representative cases that entail a certain percentage of the population, in the accommodation problem. A well-known anthropometric database has been used in order to compare our methodology with the common used PCA-approach, showing the advantages of our methodology: the level of accommodation is reached unlike the PCA approach, no more adjustments are necessary, the user can decide the number of archetypes to consider or leave the selection by a criterion. Unlike PCA, the objective of the archetypal analysis is obtaining extreme individuals, so it is the appropriate statistical technique for solving this type of problem. Archetypes cannot be obtained with PCA even if we consider all the components, as we show in the application.

Highlights

► Archetypal analysis finds extreme individuals. ► With our methodology the level of accommodation is reached unlike the PCA approach. ► Archetypes cannot be obtained with PCA even if we consider all the components. ► The user can decide the number of archetypes or leave the selection by a criterion. ► The software for computing them is free and open.

Introduction

Products intended to “fit” their users must be designed with careful consideration of the size and shape of the user population. In ergonomic design and evaluation, a small group of human models which represents the anthropometric variability of the target population is commonly used. Use of a small group of human models provides designers an efficient way to develop and evaluate a product design. In the multivariate accommodation problem, a set of representative cases (human models) are searched in order to cover a certain percentage of the user population. The appropriate selection of this small group is critical if we want to accommodate a certain percentage of the population.

Two strategies can be considered in searching the human models according to the characteristics of product being designed: searching on a boundary or a set of grids. If the product being designed is a one-size product (one-size to accommodate people within a designated percentage of the population) such as a bus operator’s workstation or a helicopter cockpit, the cases are selected on an accommodation boundary. However, if we are designating a multiple-size product (n sizes to fit n groups of people within a designated percentage of the population), being clothing the most apparent example, the cases are selected over a set of grids formed in the distribution of anthropometric dimensions (Jung, Kwon, & You, 2010). In this work, we center on the first situation: one-size product.

It has long been demonstrated that the use of percentiles is not appropriate, due to the fact that, with the exception of 50th percentiles, percentile values are not additive (Moroney and Smith, 1972, Robinette and McConville, 1981, Zehner et al., 1993). Different alternatives have been proposed using different statistical techniques such as regression (Flannagan et al., 1998, Manary et al., 1998, Robinette and McConville, 1981) or cluster analysis (Kim, Kim, Lee, Lee, & Kim, 2004). However, the most common approach is based on the use of principal component analysis (PCA) (Bittner et al., 1987, Friess and Bradtmiller, 2003, Gordon et al., 1989, Hudson et al., 1998, Robinson et al., 1992, Zehner et al., 1993). The idea of this approximation consists in considering the first principal components and selecting several extreme points in an ellipse (or in a circle if they are standardized) which covers a certain percentage of the data (95%, for example). If a workspace is designed to enable all these cases to operate efficiently, then all other less extreme body types and sizes in the target population (within the circle) should also be well accommodated.

Friess in Friess (2005) makes an excellent analysis of the PCA-approach, where his comparison reveals that PCA approach have many limits: (1) in its simplest variant it can lead to enormous portions of the population (nearly 50%) being left out; (2) an improved version of it requires the use of a great number of components (if not all) and the contribution of octant points to the determination of multivariate boundaries remains unclear. Still, even this version did not achieve the level of accommodation it set out to reach.

Note that the PCA-approach followed for example in Zehner et al., 1993, Hudson et al., 1998, Robinson et al., 1992 has several drawbacks. As it only chooses the first components, part of the data variation is removed (according to the variation explained by the first components. In addition, not considered variation may represent cases difficult to accommodate). Therefore, when building the ellipse, the true covered percentage is not the 95%. Furthermore, with two and three components the selected cases are respectively, eight and fourteen, so the number of cases would increase if we would want to represent more than three components in order to consider more variation. It may not be practical to select too many cases. Moreover, if we restrict ourselves to the chosen components, there might be combinations of variables which were not collected by the principal components (even considering all the possible components) and which correspond to extreme data, since the goal of PCA is not the calculation of extreme data. This final consideration will be shown in Section 3.

Therefore, an alternative to the previous methodology, is proposed: the archetypal analysis (Cutler & Breiman, 1994). We propose a methodology with which we can assure the covering of a certain proportion of the population. Archetypal analysis assumes that there are several “pure” individuals who are on the “edges” of the data, and all others individuals are considered to be mixtures of these pure types. Archetypal analysis (AA) estimates the convex hull of a data set, as such AA favors features that constitute representative “corners” of the data, i.e. archetypes. Archetypes are almost always easy to interpret as they represent extreme combinations of features. In the original paper on AA (Cutler & Breiman, 1994) the method was demonstrated useful in the analysis of air pollution and head shape and later also for tracking spatio-temporal dynamics. Recently, AA has found use in benchmarking and market research (Li, Wang, Louviere, & Carson, 2003) and in particular, for identifying typically extreme practices, rather than just good practices (Porzio, Ragozini, & Vistocco, 2008), as well as in the analysis of astronomy spectra (Chan, Mitchell, & Cram, 2003) as an approach for the end-member extraction problem (Plaza, Martı´nez, Pérez, & Plaza, 2004). Ref. (Eugster, 2012) is another interesting contribution in which archetypal athletes are determined for American basketball and European soccer, according to the data from their most representative leagues. AA has been shown to be relevant also for a large variety of machine learning problems and for high-dimensional data arising from video-taped images (Mørup and Hansen, 2010, Mørup and Hansen, 2012, Stone and Olson, 1999). A recent application of AA for comparing different species of bats is found in D’Esposito, Palumbo, and Ragozini (2012).

Archetypes can be computed easily by means of a library of free software R (Eugster and Leisch, 2009, R Development Core Team, 2009). The code developed to calculate them from our data is freely available and it can be seen in Appendix A. The outline of the paper is as follows: Section 2 describes the data set and the methodology used in this paper. The application of our procedure is given in Section 3. Conclusions and possible further developments conclude the paper in Section 4.

Section snippets

Data

Our data set comes from the 1967 United States Air Force (USAF) Survey (available from http://www.dtic.mil/dtic/, and as supplemental material for be readded with our code). The 1967 USAF Survey was conducted during the first three months of 1967 under the direction of the Anthropology Branch of the Aerospace Medical Research Laboratory, located in Ohio. Subjects were measured at 17 Air Force bases across the United States of America. A total of 202 variables (including body dimensions and

Archetypes for 1967 USAF

We have computed the archetypes from k = 1 to k = 10 (remember that for k = 1 the mean of each variable is obtained). Fig. 2 displays the percentile value of each variable for each archetype, from k = 2 (a) to k = 10 (j). The percentiles of each archetype are represented by each set of bars, where a bar represents a different variable, from dark gray (Thumb Tip Reach) to light gray (Shoulder Height Sitting). For example, in Fig. 2a, the first archetype is low in all variables, whereas the second

Conclusions

We have proposed an alternative to determine test cases based on archetypal analysis. This technique effectively considers a certain percentage of the population for accommodation, not as the classical PCA where the percentage of accommodation is determined without consider all the variability, and therefore it does not consider effectively the accommodation percentage desired previously. We have applied the technique to a classical database and we have compared it with the methodology based on

Acknowledgements

This work has been partially supported by Grants CICYT TIN2009-14392-C02-01, CICYT TIN2009-14392-C02-02, MTM2009-14500-C02-02, GV/2011/004 and Bancaixa-UJI P11A2009-02.

References (26)

  • K. Jung et al.

    Evaluation of the multivariate accommodation performance of the grid method

    Applied Ergonomics

    (2010)
  • M. Mørup et al.

    Archetypal analysis for machine learning and data mining

    Neurocomputing

    (2012)
  • Bittner, A., Glenn, F., Harris, R., Iavecchia, H., & Wherry, R. (1987). CADRE: a family of mannikins for workstation...
  • B. Chan et al.

    Archetypal analysis of galaxy spectra

    Monthly Notices of the Royal Astronomical Society

    (2003)
  • A. Cutler et al.

    Archetypal analysis

    Technometrics

    (1994)
  • M.R. D’Esposito et al.

    Interval archetypes: A new tool for interval data analysis

    Statistical Analysis and Data Mining

    (2012)
  • M.J.A. Eugster

    Performance profiles based on archetypal athletes

    International Journal of Performance Analysis in Sport

    (2012)
  • M.J. Eugster et al.

    From spider-man to hero – Archetypal analysis in R

    Journal of Statistical Software

    (2009)
  • Flannagan, C. A., Manary, M. A., Schneider, L. W., & Reed, M. P. (1998). An improved seating accommodation model with...
  • Friess, M. (2005). Multivariate accommodation models using traditional and 3D anthropometry. In SAE technical...
  • Friess, M., & Bradtmiller, B. (2003). 3D head models for protective helmet development. In Proceedings of the SAE...
  • Gordon, C. C., Churchill, T., Clauser, C. E., Bradtmiller, B., McConville, J. T., Tebbetts, I., et al. (1989). 1988...
  • J.A. Hudson et al.

    The USAF multivariate accommodation method

    Proceedings of the Human Factors and Ergonomics Society Annual Meeting

    (1998)
  • Cited by (28)

    • Archetypal analysis for ordinal data

      2021, Information Sciences
      Citation Excerpt :

      Furthermore, the flexibility of AA and ADA is higher than that of CLA, since the observations are approximated as a mixture (a convex combination) of archetypoids or archetypes for ADA and AA, respectively. ADA and AA have been applied to many different fields, such as anthropometry [13,14], astronomy [15], climate [16], computer vision [17], finance [18], genetics [19], human development [20], industrial engineering [21–23], machine learning [10,24], nanotechnology [25], neuroscience [26] and sports [27,28]. Both AA and ADA were developed for multivariate continuous numerical data, and cannot be directly applied to ordinal data, since the distances between the categories are not known.

    • Archetype analysis: A new subspace outlier detection approach

      2021, Knowledge-Based Systems
      Citation Excerpt :

      This is the idea of the proposed method: first to project the data into the relevant subspaces and then to use proximity-based techniques to detect outliers in those subspaces. AA was defined by [10] and has been applied in a broad spectrum of fields, such as biology [11], developmental psychology [12], didactics [13], engineering [14–20], finance [21], genetics [22], global development [23], image processing [24], machine learning problems [25], market research [26], multi-document summarization [27], neuroscience [28,29] and sports [30–32]. With AA we can see all samples by looking at a few based on extreme profiles, but these extreme profiles should not be outliers.

    • Functional archetype and archetypoid analysis

      2016, Computational Statistics and Data Analysis
      Citation Excerpt :

      ADA is available in the R package Anthropometry (Vinué et al., 2015a). The fields of application include, for instance, market research (Li et al., 2003; Porzio et al., 2008; Midgley and Venaik, 2013), biology (D’Esposito et al., 2012), genetics (Thøgersen et al., 2013), sports (Eugster, 2012), industrial engineering (Epifanio et al., 2013; Vinué et al., 2015a), the evaluation of scientists (Seiler and Wohlrabe, 2013), astrophysics (Chan et al., 2003; Richards et al., 2012), e-learning (Theodosiou et al., 2013), multi-document summarization (Canhasi and Kononenko, 2013, 2014) and different machine learning problems (Mørup and Hansen, 2012; Stone, 2002). In the seminal paper by Cutler and Breiman (1994), one of the illustrative examples worked with functional observations, i.e., data consisting of a set of functions, although they converted them into a matrix by considering a set of values of each curve (after being smoothed) at certain points.

    View all citing articles on Scopus
    View full text