A predictive estimator of finite population proportion despite missing data

https://doi.org/10.1016/j.amc.2014.01.128Get rights and content

Abstract

This paper considers the problem of estimating a finite population proportion when there are missing values. The prediction approach is used to define a new estimator that presents desirable efficiency properties. Simulation studies are considered to evaluate the performance of the proposed estimator via empirical relative bias and empirical relative efficiency, and favourable results are achieved.

Introduction

The use of auxiliary population information, provided by one or several auxiliary variables, at the estimation stage is a commonly-used technique that offers many advantages [6], [19], [20], [4], [17], [23], etc. However, in many practical situations, instead of auxiliary variables there exist certain auxiliary attributes which are correlated with the study variable. Abd-Elfattah et al. [1], Grover and Kaur [10], Koyuncu [13] and Singh and Solanki [18] proposed a set of estimators for the mean using information on a single auxiliary attribute, in simple random sampling, and this was later extended by Malik and Singh [14] to the case of two attributes. All these papers were formulated assuming that there is no lack of response.

Information on auxiliary attributes can also be used to deal with missing values, a problem that commonly arises in survey research and which often poses severe problems. A variety of methods have been developed to compensate for missing data in a general purpose way so that the survey data file can be analysed irrespective of the missing data (see e.g., [19, chapter 12]).

When sample observations are missing, the simplest solution is to eliminate the incomplete observations, but this can produce biases in the estimations and increase sampling variance. Another solution is to employ imputation techniques to replace the missing observations (see e.g., [12], [5]). However, this practice may invalidate the inferences and can often have serious consequences. Considering that the missing observations may contain valuable information, a third option is to attempt to improve the precision of the estimators by including all cases available for their calculation. Some authors have defined indirect estimators for means and variances when the sample is drawn according to the procedure of simple random sampling without replacement when some observations are missing (see e.g., [27], [24], [25], [26], [21], [22], [16]).

However, the estimation of a population proportion in the presence of missing data is a problem that has received little research attention. Álvarez et al. [3] recently defined a general class of estimators of a population proportion on the basis of a random sample drawn according to any sampling design, and assuming an auxiliary attribute whose population proportion is known from a census or estimated without sampling errors.

The estimation of a single proportion is a commonly used statistic in many practical and research situations (biopharmaceutical experiments, clinical research, marketing research, opinion surveys, polls, etc.). These surveys often contain auxiliary information on several variables (including numeric and binary attributes). In this study, we seek to build a new estimator that makes use of the information in the sample for the study and auxiliary variables (quantitative or attribute), to estimate the population proportion, on the basis of a logistic regression superpopulation model.

In Section 2, we introduce the problem of the estimation of a proportion when there are missing values. We define a new estimator of the population proportion in the case of a general sampling design, assuming that two auxiliary variables (quantitative or attribute) are available. Assuming different scenarios, the proposed point estimators are evaluated empirically in Section 3, and we report that the conclusions obtained are consistent with the theoretical properties derived in the previous sections.

Section snippets

Proportion estimators in the presence of missing values

Let U={1,2,,N} be a population of N identifiable elements. We consider the problem of estimating the population proportion PA=N-1iUAi, where Ai is an attribute indicator for unit i, i.e., Ai=1 if unit i has the attribute of interest A, and Ai=0 otherwise. PA is the parameter of interest, which needs to be estimated. For this purpose, a random sample s, of size n, is selected from U according to a given sampling design. The first- and the second-order inclusion probabilities associated with

Properties of the proposed estimator

A model-based estimator for the population mean has been defined in Section 2. We now study several properties of this estimator, which may be important in practice.

  • P^TB is linear in the Y’s.P^TB=s1s3(fs1+fs3)n-cYi+B=s1s3wiYi+Bthe weights wi are independent of Yi and B depends on the sample s only through the variables x and z.

  • P^TB is data intensive in the sense that we must know the values of x for all the units of the population. This assumption is usual in social surveys with information

Simulation study

The theoretical comparison of the proposed alternative estimators is not a simple issue because they rely on different principles: on the one hand, prediction (or model-based) theory, and on the other, probability sampling (or design-based) theory. Little [11] examined some aspects of the debate between design-based and model-based inference for sample surveys. Model-based estimators often have a smaller variance than design-based competitors, especially for small samples where the latter

Discussion

The estimation of a proportion is a commonly used statistic for summarising data. The customary proportion estimator does not involve auxiliary information at the estimation stage, and so the aim of this paper is to add this auxiliary information in the presence of missing data, and to do so in an efficient way.

The proposed estimator shows very good behaviour in simulation studies versus estimators pˆAH and pˆgd, achieving increased efficiency. We note that the estimator pˆgd was compared in [3]

Acknowledgements

This work is partially supported by Ministerio de Educación y Ciencia (contract Nos. MTM2009-10055 and MTM2012-35650).

References (27)

  • J.C. Deville et al.

    Calibration estimators in survey sampling

    J. Am. Stat. Assoc.

    (1992)
  • P. Duchesne

    Estimation of a proportion with survey data

    J. Stat. Edu.

    (2003)
  • F.R. Fernández et al.

    Ejercicios y Prácticas de Muestreo en Poblaciones Finitas

    (1996)
  • Cited by (1)

    View full text