Abstract
Many government agencies still rely on the grouped data as the main source of information for calculation of the Gini index. Previous research showed that the Gini index based on the grouped data suffers the first and second-order correction bias compared to the Gini index computed based on the individual data. Since the accuracy of the estimated correction bias is subject to many underlying assumptions, we propose a new method and name it D-Gini, which reduces the bias in Gini coefficient based on grouped data. We investigate the performance of the D-Gini method on an open-ended tail interval of the income distribution. The results of our simulation study showed that our method is very effective in minimizing the first and second order-bias in the Gini index and outperforms other methods previously used for the bias-correction of the Gini index based on grouped data. Three data sets are used to illustrate the application of this method.
Similar content being viewed by others
References
Abounoori E, McCloughan P (2003) A simple way to calculate the Gini coefficient for grouped as well as ungrouped data. Appl Econ Lett 10(8):505–509
Alabdulmohsin IM (2018) Summability calculus, a comprehensive theory of fractional finite sums. Springer, Cham
Alfons A, Templ M, Filzmoser P (2010) Applications of statistical simulation in the case of EU-SILC: using the R package simframe. J Stat Softw 37(3):17
Bound J, Brown C, Mathiowetz N (2001) Measurement error in survey data. In: Handbook of econometrics, vol 5. Elsevier, New York, pp 3705–3843
Chen Y, Miljkovic T (2019) From grouped to de-grouped data: a new approach in distribution fitting for grouped data. J Stat Comput Simul 89(2):272–291
Cowell F (2011) Measuring inequality. Oxford University Press, Oxford
Cowell FA (1977) Measuring inequality. Philip Allan. Oxford, UK models Moscow-Izhevsk: RHD 12:13
Cowell FA, Mehta F (1982) The estimation and interpolation of inequality measures. Rev Econ Stud 49(2):273–290
Deltas G (2003) The small-sample bias of the Gini coefficient: results and implications for empirical research. Rev Econ Stat 85(1):226–234
Drechsler J, Kiesl H (2016) Beat the heap: an imputation strategy for valid inferences from rounded income data. J Surv Stat Methodol 4(1):22–42
European Union (2020) Income taxes abroad. https://europa.eu/youreurope/citizens/work/taxes/income-taxes-abroad/austria/index_en.htm. Accessed Apr 25, 2020
Eurostat (2018) Gini coefficient of equivalised disposable income—EU-SILC survey. https://ec.europa.eu/eurostat/, Accessed Aug 20, 2020
Fabrizi E, Trivisano C (2016) Small area estimation of the Gini concentration coefficient. Comput Stat Data Anal 99:223–234
Heitjan DF (1989) Inference from grouped continuous data: a review. Stat Sci 4(2):164–179
Heitjan DF (1994) Ignorability in general incomplete-data models. Biometrika 81(4):701–708
Kakwani N, Wagstaff A, Van Doorslaer E (1997) Socioeconomic inequalities in health: measurement, computation, and statistical inference. J Econometr 1:87–103
Kobayashi G, Kakamu K (2019) Approximate Bayesian computation for Lorenz curves from grouped data. Comput Stat 34(1):253–279
Lerman R, Yitzhaki S (1989) Improving the accuracy of estimates of Gini coefficients. J Econometr 42(1):43–47
Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Lyon M, Cheung LC, Gastwirth JL (2016) The advantages of using group means in estimating the Lorenz curve and Gini index from grouped data. Am Stat 70(1):25–32
Milanovic B (1994) The Gini-type functions: an alternative derivation. Bull Econ Res 46(1):81–90
Nishino H, Kakamu K (2011) Grouped data estimation and testing of Gini coefficients using lognormal distributions. Sankhya B 73(2):193–210
Pyatt G, Chen C, Fei J (1995) The distribution of income by factor components. Q J Econ 95(3):451–473
Rubin DB (1978) Multiple imputations in sample surveys-a phenomenological bayesian approach to nonresponse. In: Proceedings of the survey research methods section of the American Statistical Association, vol 1. American Statistical Association, pp 20–34
Schenker N, Raghunathan TE, Chiu PL, Makuc DM, Zhang G, Cohen AJ (2006) Multiple imputation of missing income data in the national health interview survey. J Am Stat Assoc 101(475):924–933
Schneeweiß H, Komlos J, Ahmad AS (2010) Symmetric and asymmetric rounding: a review and some new results. AStA Adv Stat Anal 94(3):247–271
Stefanski LA (2000) Measurement error models. J Am Stat Assoc 95(452):1353–1358
The World Bank, Developemnt Research Group (2017) Gini index (world bank estimate). https://data.worldbank.org/. Accessed Aug 20, 2020
Tillé Y, Langel M (2012) Histogram-based interpolation of the Lorenz curve and gini index for grouped data. Am Stat 66(4):225–231
US Census Bureau (2018) 2018 annual social and economic supplement. https://www.census.gov/. Accessed Apr 3, 2019
Van Ourti T, Clarke P (2011) A simple correction to remove the bias of the Gini coefficient due to grouping. Rev Econ Stat 93(3):982–994
Wodon Q, Yitzhaki S (2003) The effect of using grouped data on the estimation of the Gini income elasticity. Econ Lett 78(2):153–159
Acknowledgements
The authors are grateful to two anonymous reviewers whose comments and suggestions significantly improved the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proof of Lemma 1
Proof
Let \(Y_1\), \(Y_2\) be two independent copies of Y. Define the random variable \(U=[f(Y_2)-f(Y_1)][h(Y_2)-h(Y_1)]\). Note that \(U \ge 0\) almost surely. Indeed, if \(Y_2\ge Y_1\), then because \( f\) and \( h\) are non-decreasing functions, \(f(Y_2)-f(Y_1)\ge 0\) and \(h(Y_2)-h(Y_1)\ge 0\), thus \(U \ge 0\). If \(Y_2 \le Y_1\), then \(f(Y_2)-f(Y_1) \le 0\) and \(h(Y_2)-h(Y_1)\le 0\), thus \(U \ge 0\). Since \(Y_1\) and \(Y_2\) are independent and have the same distribution, \({\mathbb {E}}[f(Y_1)h(Y_1)]= {\mathbb {E}}[f(Y_2)h(Y_2)]={\mathbb {E}}[f(Y)h(Y)]\), \({\mathbb {E}}[f(Y_1)]={\mathbb {E}}[f(Y_2)]={\mathbb {E}}[f(Y)]\), and \({\mathbb {E}}[h(Y_1)]={\mathbb {E}}[h(Y_2)]={\mathbb {E}}[h(Y)]\). Therefore,
Dividing both sides of this inequality by 2 and adding \({\mathbb {E}}[f(Y)]{\mathbb {E}}[h(Y)]\) to both sides gives Lemma 1. \(\square \)
Proof of Lemma 2
Proof
Let \(D \subset \{1,\ldots ,n\}\) be a non-empty set and let \( f\), \( h\) be numbers indexed by \(i\in D\), if \(f_i\le f_j\) and \(h_i \le h_j\) for all \(i,j\in D\) with \(i<j\), then Lemma 1 implies \(f(i)=f_i\) and \(h(i)=h_i\). Here, the random variable is \(Y=i\) and it is uniformly distributed on D. This yields that
Recall that \(P_1, \ldots , P_K\) are equal size subsets of \(\{1,\ldots , n\}\) such that \(|P_g| =m\) for \(1\le g\le K\), \(\phi _g=\frac{1}{|P_g|}\sum _{i\in P_g}y^{*}_i\), and \(R^{K}_g=\frac{1}{|P_g|}\sum _{i\in P_g}R^{*}_i\), and \({\bar{\phi }} = \frac{1}{K}\sum _{g=1}^{K}\phi _g\). Since \(y^{*}_1\le \cdots \le y^{*}_n\) and \(R^{*}_1 \le \cdots \le R^{*}_n\), Eq. (13) leads to
Therefore,
\(\square \)
Proof of Lemma 3
Proof
The maximum order statistic, \(Y_{(n)}\) has the following density function
The expected value of the maximum order statistic is
Since \(0< e^{-\frac{(y-L)}{\lambda }} < 1\) for \(y > L\), by using the binomial series expansion we have
The quantity inside summation is absolutely integrable and by interchanging the summation with integration we obtain
where both summands represent harmonic sequences. The first summand is equal to 1 and the second summands represent the n-harmonic sequence. Since the sum of the reciprocals of the first n natural numbers is \(\sum _{j=1}^{n}\frac{1}{j}\), the sum of n-harmonic sequence diverges slowly due to the logarithmic growth and it is approximated by \(\gamma \)-Euler–Mascheroni Constant and a small error term \(\epsilon _n\approx \frac{1}{2n}\) that vanishes as n goes to infinity [refer to Alabdulmohsin (2018) for more details about these terms]. Based on well known facts about the sum of the n-harmonic sequence, we obtain the final result
\(\square \)
Rights and permissions
About this article
Cite this article
Miljkovic, T., Chen, YJ. A new computational approach for estimation of the Gini index based on grouped data. Comput Stat 36, 2289–2311 (2021). https://doi.org/10.1007/s00180-021-01082-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-021-01082-7