Latent class models for mixed variables with applications in Archaeometry

https://doi.org/10.1016/j.csda.2004.03.001Get rights and content

Abstract

Latent class models are used in social sciences for classifying individuals or objects into distinct groups/classes based on responses to a set of observed indicators. The latent class model for mixed binary and metric variables (Br. J. Math. Statist. Psych. 49 (1996) 313) is extended to accommodate any type of data (including ordinal and nominal) and its use in Archaeometry for classifying archaeological findings/objects into groups is discussed. The models proposed are estimated using a full maximum like-lihood with the EM algorithm. Two data sets from archaeological findings are used to illustrate the methodology.

Introduction

One of the main problems in Archaeology is classification of objects found in excavations such as ceramic sherds, artefacts, etc. The criterion for grouping in the context we are interested in is the origin of the objects. Provenance is an important issue for Archaeology researchers, as this is a first step to derive conclusions about the structure of the communities in ancient years. Conclusions are drawn with respect to civilization, level of the manufacturing techniques used, and also import–export of goods, ability of move and relations between them.

The most widely used classification techniques by Archaeometry scientists is hierarchical clustering. This class of procedures involves choosing a measure of (dis)similarity between pairs of cases in a sample to be clustered, and choosing an algorithm for clustering cases hierarchically on the basis of the (dis)similarity coefficient. Both choices can be made in many different ways, leading to a large number of possible ways of clustering a data set.

Such approaches are essentially heuristic, and have been contrasted with model-based approaches to clustering (see Fraley and Raftery, 1999). Fraley and Raftery (1999) use a mixture of multivariate normals and that can be considered a special case of the latent class model for mixed variables that will be developed here. Heuristic methods dominate archaeometric practice. Papageorgiou et al. (2001) and Baxter (2001) have compared some approaches to grouping data used in archaeometry that are model-based. A model-based method is understood to be one in which explicit assumptions are made about the form of the probability density function describing the population from which the observed data are considered to be a random sample. Clustering and inferences about the numbers of clusters and cluster membership are based on estimation of the unknown parameters in the probability model used.

One could summarize the potential merits of model-based methodologies in contrast to distribution-free methodologies.

  • (1)

    Cases are assigned to clusters based on probabilities estimated from a model. Within that process outliers can be identified. Initial assignment to a cluster can be based on archaeological rather than statistical grounds and model-based statistical methods may then be used to assess whether or not such a group is also chemically coherent. A variant of this (Glascock, 1992) is to determine initial groups statistically using an heuristic method, and then to ‘refine’ these using probabilistic calculations that assume the groups are multivariate normal.

  • (2)

    In many compositional studies variables may be highly correlated within groups leading to clusters that are ellipsoidal in p-dimensional space. Heuristic clustering methods typically impose spherical structure on the data and can fail to recognize the true structure. One common method of cluster analysis, Ward's method, often used in an heuristic manner, can be shown to be a special case of a model-based method that not only assumes that clusters are spherical but also of equal size (volume). This difficulty is well-known but resolving it is not easy (Harbottle, 1976). Krzanowski and Marriott (1995) observe that ‘most methods not specifically distribution-based are inefficient at finding strongly elliptical clusters’. In principle, therefore, distribution- or model-based methods provide a way of addressing a problem that has been an issue ever since multivariate methods began to be applied to the analysis of compositional data.

  • (3)

    The output from many heuristic methods of cluster analysis is typically presented in the form of a dendrogram. Judgements about the number of clusters are usually made on the basis of subjective interpretation of the dendrogram. Apart from the subjectivity involved there are several difficulties here. The appearance of a dendrogram is affected by the scale of the data, choice of (dis)similarity measure and clustering algorithm used, and is often not easy to interpret. Furthermore, a clear separation into distinct clusters on a dendrogram does not guarantee that they are genuinely distinct (Baxter, 1994, p. 161). A potential advantage of model-based approaches is that they allow for tests on the number of clusters represented in a sample.

  • (4)

    Perhaps the strongest advocates of a model-based approach in statistical archaeology have been those who adopt a Bayesian approach to statistical analysis, e.g. Litton and Buck (1995) and Buck et al. (1996). The prime difference between the Bayesian approach and non-Bayesian model-based methods is the way in which archaeological prior knowledge is incorporated into a model. The clearest demonstration of the success of this approach lies in applications to radiocarbon dating problems. In particular, much sharper bounds for calendar dates derived from radiocarbon dates can be obtained using archaeological knowledge of stratigraphy, in a way that is not possible using non-Bayesian methods, whether model-based or heuristic. From the Bayesian perspective the ability to incorporate prior knowledge into an analysis is a strong motivation for adopting a modelling approach.

In this paper, we explore the potential of using latent class models for classifying archaeological objects into homogeneous clusters. Latent class models also known as finite mixture models are used in social sciences for classification purposes. For example, students can be classified into ‘masters’ and ‘non-masters’ according to their performance to a number of tests or patients can be classified into ‘healthy’ and ‘unhealthy’ according to the symptoms they possess.

The aims of latent class analysis is first to identify the number of classes required to explain the associations among the observed variables and second to allocate respondents/objects to latent classes. Therefore, latent class analysis has a lot of things in common with classification methods for multivariate data such as cluster analysis, multidimensional scaling and correspondence analysis. The main differences with the above-mentioned techniques are that latent class analysis is a model based approach used for any type of data that gives the possibility of testing the appropriateness of the model statistically. The other methods are mainly based on measures of distances and similarities and as we have mentioned above in some cases they have limited practical use.

Latent class models were introduced by Lazarsfeld (1950) and since then there have been significant contributions in terms of estimation methods, types of data and complexity of the models by Goodman (1974), Haberman (1979), Hagenaars (1990), Vermunt (1997) and Vermunt and Magisdon (2000). The last four researchers have put the latent class models into a log-linear framework that requires all variables to be of categorical nature. Latent class models in their more classical form have been discussed for clustering binary, nominal or metric variables in Bartholomew and Knott (1999) and for clustering mixed mode data in Everitt (1988), Everitt and Merette (1990) and Moustaki (1996). The work done by Everitt and Merette assumes that categorical responses are manifestations of unobserved continuous variables. The estimation of the model by maximum likelihood requires the evaluation of multidimensional integrals and that restricts the number of categorical variables to one or two. The work done by Moustaki (1996) does not require the existence of underlying variables and it is computationally very efficient. That work will be extended here to accommodate ordinal and nominal variables and the use of the model in archaeology will be made clear through two real applications.

The aim of the paper is to show the potential that the latent class model for mixed variables has on the identification of groups/clusters in Archaeometry where data of mixed type often occur.

Section snippets

Latent class model

Suppose that (x1,x2,…,xp) denotes a vector of p manifest variables where each variable has a conditional distribution in the exponential family such as Bernoulli, Poisson, Multinomial, Normal.

There is no restriction that the x variables should all be of same type. Let xih be the value of the hth sample element/object for the ith variable, (h=1,2,…,n). The row vector xh′=(x1h,…,xph) is referred to as the response pattern of the hth object.

The methodology presented here allows for a single

Applications

The methodology developed will be tested on two data sets from Archaeology. Both data sets consists of metric and categorical variables. More specifically, the metric variables, 25 in total, measure the chemical composition of the ceramic, obtained with the latest methodologies available such as neutron activation analysis (NAA).

The categorical variables aim to derive information regarding the provenance of the objects. During the last two decades, the use of ceramic petrology has increased

Conclusion

The paper discusses a general methodology for analyzing mixed outcomes with a latent class model. The methodology presented applies to mixtures of categorical (nominal and ordinal) and metric variables. The aim of the analysis is to classify objects into distinct classes based on responses to a set of variables.

The model presented is applied to two data sets from Archaeological research and the results show that model-based clustering techniques for mixed variables can be very useful for

Acknowledgements

We would like to thank Dr. Miguel Angel Cau and Dr. Ioannos Iliopoulos for providing us with the two data sets and for the helpful discussions we had on the analysis. We also like to thank the three reviewers for their comments and suggestions that improved the clarity and presentation of the paper.

References (33)

  • B. Everitt

    A finite mixture model for the clustering of mixed-mode data

    Statist. Probab. Lett

    (1988)
  • P. Rice et al.

    Cluster analysis of mixed-level datapottery provenience as an example

    J. Archaeol. Sci

    (1982)
  • Alaimo, R., Bultrini, G., Fragala, I., Giarrusso, R., Iliopoulos, I., Montana, G., 2004. Archaeometry of sicilian...
  • Bartholomew, D.J., Knott, M., 1999. Latent Variable Models and Factor Analysis, Kendall Library of Statistics, Vol. 7,...
  • D.J. Bartholomew et al.

    The goodness-of-fit of latent trait models in attitude measurement

    Sociol. Methods Res

    (1999)
  • M. Baxter

    Exploratory Multivariate Analysis in Archaeology

    (1994)
  • M. Baxter

    Statistical modelling of artefact compositional data

    J. Amer. Statist. Assoc

    (2001)
  • Beardah, C., Baxter, M., Papageorgiou, I., Cau, M.A., 2002. Approaches to petrographic data analysis using S-plus. In:...
  • Beardah, C., Baxter, M., Papageorgiou, I., Cau, M.A., 2003. ‘Mixed-mode’ approaches to the grouping of ceramic...
  • C. Buck et al.

    Bayesian Approach to Interpreting Archaeological Data

    (1996)
  • Cau, M.A., 1999. Importaciones de ceramicas tardorromanas de cocina en les illes balears: el caso de can sora...
  • Cau, M.A., Day, P., Iliopoulos, I., Montana, G., Nodarou, E., 1999. Standardisation of petrographic descriptions....
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. Roy. Statist. Soc. Ser. B

    (1977)
  • B. Everitt et al.

    The clustering of mixed-mode dataa comparison of possible approaches

    J. Appl. Stat

    (1990)
  • C. Fraley et al.

    MCLUST: Software for model-based cluster and discriminant analysis

    J. Classification

    (1999)
  • Glascock, M., 1992. Characterization of archaeological ceramics at MURR by neutron activation analysis and multivariate...
  • Cited by (39)

    • Risk factors associated with truck-involved fatal crash severity: Analyzing their impact for different groups of truck drivers

      2021, Journal of Safety Research
      Citation Excerpt :

      Level 1 indicates less than two equivalent fatalities (85.9%), level 2 indicates greater than two but smaller than three equivalent fatalities (8.6%), and level 3 indicates greater than three equivalent fatalities (5.5%). A latent class clustering (LCC) was adopted to assign truck drivers to several groups with different underlying probability distributions by maximally internally homogeneous and externally heterogeneous (Abenoza, Cats, & Susilo, 2017; Chang & Yeh, 2007; Feng et al., 2016; Guo & Fang, 2013; Moustaki & Papageorgiou, 2005). Compared with traditional clustering techniques, such as K-means or hierarchical clustering algorithms, LCC is suitable for larger datasets without large memory demands, allowing the researcher to include variables of different scales such as nominal, continuous, and ordinal variables, in determining the number of clusters based on several statistical criteria, such as the Akaike’s information criterion (AIC) and Bayesian information criterion (BIC) (Abenoza et al., 2017; Chang & Yeh, 2007; Feng et al., 2016; Guo & Fang, 2013).

    • Safety assessment in megaprojects using artificial intelligence

      2019, Safety Science
      Citation Excerpt :

      The most significant advantage of LCCA is that this technique allows a mixture of variables such as categorical, ordinal or continuous (Moustaki and Papageorgiou, 2005) to be included in the dataset. For details about the Latent class models and analysis with different variables, see (Moustaki and Papageorgiou, 2005; Vermunt and Magidson, 2002). The incident data collected consists of a high level of heterogeneity.

    • A tractable multi-partitions clustering

      2019, Computational Statistics and Data Analysis
    • Consensus theory for mixed response formats

      2018, Journal of Mathematical Psychology
    View all citing articles on Scopus
    View full text