Searching for a common pooling pattern among several samples

https://doi.org/10.1016/j.csda.2013.04.015Get rights and content

Abstract

The grades of a Spanish university access exam involving 10 graders are analyzed. The interest focuses on finding the greatest group of graders showing similar grading patterns or, equivalently, on detecting if there are graders whose grades exhibit significant deviations from the pattern determined by the remaining graders. Due to differences in background of the involved students and graders, homogeneity is too strong to be considered as a realistic null model. Instead, the weaker similarity model, which seems to be more appropriate in this setting, is considered. To handle this problem, a statistical procedure designed to search for a hidden main pattern is developed. The procedure is based on the detection and deletion of the graders that are significantly non-similar to (the pooled mixture of) the others. This is performed through the use of a probability metric, a bootstrap approach and a stepwise search algorithm. Moreover, the procedure also allows one to identify which part of the grades of each grader makes her/him different from the others.

Introduction

We develop a methodology to deal with the following problem. In Spain, students that want to go to university must pass a global accessing exam called Selectividad. The Selectividad consists of several exams on different topics. Each topic at each university has a coordinator who is charged, among others, with the task of distributing the hundreds (or even thousands) of exams among several graders.

A main concern at the universities is to guarantee that students get equal treatment, the ideal situation being that a given exam would obtain the same grade independently of the grader who handles it. In order to achieve this goal, a subject coordinator provides guidelines to the graders. Broadly speaking, those guidelines are more strict in scientific subjects and less strict in those topics in which subjectivity plays a more important role.

A university had some concerns about the grades given in a particular topic (this paper deals with real data, however, a confidentiality agreement forbids the authors to disclose the university, the topic involved or the year of the exam) partially because of the subjectivity of the subject and partially because the graders in this topic had quite different profiles. Usually, the graders in every topic include university professors and high-school teachers, and graders with different backgrounds. But, in this particular topic, the differences in backgrounds were higher than in others. In fact, this subject had 10 graders and their grades showed obvious differences (look, for instance, at Grader 1 in Fig. 1). The application of standard homogeneity two-sample tests could detect a significant lack of homogeneity between many pairs of them. The university was interested in identifying graders with a grading pattern significantly non-similar to the ‘general trend’. These graders would possibly be excluded from future grading processes.

It is important to take into account that, apart from the differences between graders, there are several other major sources of heterogeneity involved. The most relevant is that students come from different schools which follow their own (different) syllabus. The problem was to exclude the fact that the above mentioned differences between graders could be caused, for instance, by the heterogeneity of the students given to each grader.

In order to solve this problem, we decided to take advantage of the idea of similarity of two samples, which was introduced in  Álvarez-Esteban et al. (2012) with the aim of assessing whether, perhaps not fully, but a fundamental part of two random generators coincides (we give a more detailed description of the similarity model later). Once we had the pairwise similarities between graders, we could carry out a cluster analysis. For our data, this seemed to allow one to conclude that in fact, there were some graders (including Grader 1) exhibiting a pattern of grades quite different from that of the majority. However, we also needed a measure of the significance of those differences. With this aim we developed a new methodology which allowed us to assess that significance and, more importantly, to discover a further non-similar grader who had remained hidden in the previous analysis.

A main difficulty when trying to assess that graders exhibit a similar grading pattern is that, in principle, there is no pattern to be taken as a reference for comparisons. Instead, we need to find a (big) group of graders giving similar grades. The grades given by these graders constitute the pattern for this topic. Then, we compare the remaining graders to this pattern. Both processes are carried out simultaneously in a sequential way.

Thus, the main difference between the new methodology and cluster analysis is that the latter method is based on pairwise differences, while the first one considers the differences between each grader and the rest of the graders after possibly excluding some previously detected anomalous graders.

At the end, we are involved with a k-sample problem where k>2. These problems are one of the classical topics in Statistics. Usually, they focus on testing whether k samples share the same random generator (the hypothesis of homogeneity). Among the different approaches designed to handle this problem we recall, in the non-parametric setting, the classical Kolmogorov–Smirnov, Cramér–von Mises and Anderson–Darling k-sample tests (Kiefer, 1959, Scholz and Stephens, 1987), those based on rank procedures such as the Kruskal–Wallis, Fisher–Yates or Mood tests (see, e.g.,  Hájek et al., 1999), on the likelihood (see, e.g.,  Zhang and Wu, 2007), or, more recently, the data-driven k-sample tests introduced in Wyłupek (2010). We refer to the last paper for an updated list of references on the topic. In all these cases there exists a simpler, two-sample version and the formulation of the null hypothesis for the k-sample case is straightforward. As a density estimation based approach we should mention, e.g., Cao and Van Keilegom (2006) for the two-sample case and the works of Martínez Camblor et al. (2008) and Martínez Camblor and de Uña Álvarez (2009), in the k-sample problem.

As an alternative to the homogeneity hypothesis, the similarity of two samples was introduced in Álvarez-Esteban et al. (2012). We say that two probabilities, P1 and P2, are α-similar if they are (slightly) contaminated versions of a common pattern, namely, if {P1=(1α)P0+αP1P2=(1α)P0+αP2 for some probabilities P0,P1 and P2. The similarity problem, that is assessing whether model (1) holds, is of interest in a variety of practical situations. In particular, in order to compare the grades given by two Selectividad graders, we could fix an acceptable value for α and try to assess whether the given grades fit model (1). The fixed value of α can be interpreted as the maximal proportion of students with different backgrounds which could be given to the graders, that is, speaking a bit loosely, if we delete a (suitably chosen) α proportion of students assigned to each grader, the remaining ones can be considered homogeneous.

Similarity was introduced to compare two samples. Here, however, we are concerned with k>2 samples, which have been obtained from k (random) generators, and we are interested in identifying those samples, if any, that exhibit a substantial deviation from a general pattern given by most of the other samples. This main pattern would be given by component samples which should exhibit some internal degree of similarity. Going back to the Selectividad problem, we would be interested in detecting whether there are graders whose grades exhibit a significant deviation from a generalized pattern.

As stated, a homogeneous common pattern is too strong to be considered as a realistic one and it seems more appropriate to define the general pattern in terms of similarity.

The similarity model (1) could suggest that we consider the following definition. For α(0,1), we say that probabilities P1,,Pk share a core pattern of level 1α,P0, if there exist probabilities P1,,Pk such that Pi=(1α)P0+αPi,i=1,,k. We could then refer to P0 as a (common) core pattern. However, this definition is not optimal because an object that is not similar to any of several others can be similar to a pooled version of them. For instance, it may happen that P1 and P2 are non-similar, while the inclusion of a new probability P3 could lead to a set {P1,P2,P3} such that every probability is similar to the mean of the other two.

With these cautions in mind and with the goal of detecting if one or several samples are (significantly) non-similar to the sample given by the others, that is, to the pooled sample, we introduce the following definition.

Definition 1.1

Given α[0,1), a set of probabilities {P1,,Pr} such that Pl is α-similar (in the sense of (1)) to 1r1jlPj,l=1,,r, will be called α-similarly pooled. We will refer to the pooled probability P̄1,,r1rj=1rPj as the pooling pattern.

Note that if the probabilities P1,,Pr share a core pattern of level 1α, then they are α-similarly pooled, but the converse is not true. This phenomenon resembles the effect that the inclusion/exclusion of some auxiliary variables can produce on variable selection in regression problems. As in that setting, our ideal goal would be to select the best set, in our case a maximal similarly pooled set with the greatest possible number of probabilities.

We can introduce, in Definition 1.1, a vector of weights (w1,,wr) (we assume wi0,w1++wr=1) to allow the different probabilities to have a different relative importance (then {P1,,Pr} would be α-similarly pooled if Pl is α-similar to 11wljlwjPj,l=1,,r). This is natural and convenient in the case of empirical probabilities. For example, if Xi,j,j=1,,ni are independent random variables with the same law Pi=L(Xi,j), for every i=1,,k, the choice of weights (n1,,nk)/n, with n=i=1kni, means that we will compare the empirical distribution of the i-th sample to the empirical distribution on the pooled sample (the combination of the other k1 samples).

To solve the Selectividad problem, we develop a statistical procedure designed to search for a main pooling sample. In fact, our procedure is based on detecting samples that are significantly non-similar with respect to the pool of the others. This is achieved through the use of a probability metric–the Wasserstein distance–and a bootstrap approach developed in Álvarez-Esteban et al. (2012). The procedure is completed with a stepwise search argument, looking for a maximal set of α-similarly pooled samples, corresponding to a maximal pooled pattern for the k-samples. To the best of our knowledge, this problem has not been considered before.

An important feature of our method is that it allows one to identify which fraction of a given sample accounts for the possible deviation from the main pooling pattern. More precisely, if a sample does not contribute to the maximal pooled sample, the procedure allows one to identify the subsample which is closest to the main pooling pattern, providing a better insight into the essential deviations between the sample and the main trend. In the Selectividad problem, it would allow one to discover that a given grader is giving too low (or too spread, …) grades when compared with the majority of the graders.

The remaining sections of this paper are organized as follows. In Section  2 we give some background on trimmed distributions, similarity and technical tools involved in the analysis of the method in order to make this paper self-contained. This material is extracted or easily deduced from the works of Álvarez-Esteban et al., 2008, Álvarez-Esteban et al., 2011, Álvarez-Esteban et al., 2012. In Section  2.2 we introduce our stepwise search methodology for the problem. Section  3 explores the performance of our procedure through a simulation study. In Section  4 we apply the procedure to the Selectividad data and explore some features of our approach for data analysis purposes.

Section snippets

Similarity and trimming

Our procedure for finding a (maximal) α-similarly pooled set of samples relies on the connection between the similarity model (1) and the sets of trimmings of a probability. This connection was explored in Álvarez-Esteban et al. (2012), where a test for the similarity model (1) in a two sample setup was introduced. For the sake of readability we summarize here the main facts about trimmings and the similarity model.

Definition 2.1

Given α[0,1) and a probability measure, P, an α-trimming of P is any

Simulation study

In order to illustrate the behavior of the algorithm described in Section  2.2 we have randomly generated samples of size ni,i=1,,10, for different values of ni(=30,100,300) from 10 distributions: P1,P2 and P3N(0,1); P40.95N(0,1)+0.05N(3,1); P50.90N(0,1)+0.10N(3,1); P60.80N(0,1)+0.20N(3,1); P70.60N(0,1)+0.40N(3,1); P80.90N(0,1)+0.10N(0,3); P9N(2,1) and P10N(3,1). Fig. 2 shows the corresponding densities.

We try to find and discard those samples, if any, that are not similar to the

Analysis of the Selectividad graders

In this section we analyze the Selectividad data. Our dataset corresponds to 1550 exams on a particular subject, received by the coordinator who, in turn, distributed them among 10 graders. Each grader received roughly the same amount of exams (a number between 152 and 156). As stated, the main aim is to find if there are graders whose grades deviate non-reasonably from the others.

Perhaps the only conclusion which can be drawn from the box-plots in Fig. 1 is that Grader 1 assigns grades that,

Acknowledgments

The authors wish to thank the referees and the Editor for their suggestions and corrections that have considerably improved the paper. This research has been partially supported by the Spanish Ministerio de Ciencia e Innovación, Grant MTM2011-28657-C02-01 and 02.

References (14)

There are more references available in the full text version of this article.

Cited by (0)

View full text