A new computational approach for estimation of the Gini index based on grouped data

Miljkovic, Tatjana; Chen, Ying-Ju

doi:10.1007/s00180-021-01082-7

A new computational approach for estimation of the Gini index based on grouped data

Original paper
Published: 25 February 2021

Volume 36, pages 2289–2311, (2021)
Cite this article

Computational Statistics Aims and scope Submit manuscript

390 Accesses
3 Citations
Explore all metrics

Abstract

Many government agencies still rely on the grouped data as the main source of information for calculation of the Gini index. Previous research showed that the Gini index based on the grouped data suffers the first and second-order correction bias compared to the Gini index computed based on the individual data. Since the accuracy of the estimated correction bias is subject to many underlying assumptions, we propose a new method and name it D-Gini, which reduces the bias in Gini coefficient based on grouped data. We investigate the performance of the D-Gini method on an open-ended tail interval of the income distribution. The results of our simulation study showed that our method is very effective in minimizing the first and second order-bias in the Gini index and outperforms other methods previously used for the bias-correction of the Gini index based on grouped data. Three data sets are used to illustrate the application of this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximate Bayesian computation for Lorenz curves from grouped data

Article 22 August 2018

On the Negative Bias of the Gini Coefficient due to Grouping

Article Open access 03 October 2018

Decomposition of the Gini index by income source for aggregated data and its applications

Article 31 January 2021

References

Abounoori E, McCloughan P (2003) A simple way to calculate the Gini coefficient for grouped as well as ungrouped data. Appl Econ Lett 10(8):505–509
Article Google Scholar
Alabdulmohsin IM (2018) Summability calculus, a comprehensive theory of fractional finite sums. Springer, Cham
MATH Google Scholar
Alfons A, Templ M, Filzmoser P (2010) Applications of statistical simulation in the case of EU-SILC: using the R package simframe. J Stat Softw 37(3):17
Article Google Scholar
Bound J, Brown C, Mathiowetz N (2001) Measurement error in survey data. In: Handbook of econometrics, vol 5. Elsevier, New York, pp 3705–3843
Chen Y, Miljkovic T (2019) From grouped to de-grouped data: a new approach in distribution fitting for grouped data. J Stat Comput Simul 89(2):272–291
Article MathSciNet Google Scholar
Cowell F (2011) Measuring inequality. Oxford University Press, Oxford
Book Google Scholar
Cowell FA (1977) Measuring inequality. Philip Allan. Oxford, UK models Moscow-Izhevsk: RHD 12:13
Cowell FA, Mehta F (1982) The estimation and interpolation of inequality measures. Rev Econ Stud 49(2):273–290
Article Google Scholar
Deltas G (2003) The small-sample bias of the Gini coefficient: results and implications for empirical research. Rev Econ Stat 85(1):226–234
Article Google Scholar
Drechsler J, Kiesl H (2016) Beat the heap: an imputation strategy for valid inferences from rounded income data. J Surv Stat Methodol 4(1):22–42
Article Google Scholar
European Union (2020) Income taxes abroad. https://europa.eu/youreurope/citizens/work/taxes/income-taxes-abroad/austria/index_en.htm. Accessed Apr 25, 2020
Eurostat (2018) Gini coefficient of equivalised disposable income—EU-SILC survey. https://ec.europa.eu/eurostat/, Accessed Aug 20, 2020
Fabrizi E, Trivisano C (2016) Small area estimation of the Gini concentration coefficient. Comput Stat Data Anal 99:223–234
Article MathSciNet Google Scholar
Heitjan DF (1989) Inference from grouped continuous data: a review. Stat Sci 4(2):164–179
Heitjan DF (1994) Ignorability in general incomplete-data models. Biometrika 81(4):701–708
Article MathSciNet Google Scholar
Kakwani N, Wagstaff A, Van Doorslaer E (1997) Socioeconomic inequalities in health: measurement, computation, and statistical inference. J Econometr 1:87–103
Article Google Scholar
Kobayashi G, Kakamu K (2019) Approximate Bayesian computation for Lorenz curves from grouped data. Comput Stat 34(1):253–279
Article MathSciNet Google Scholar
Lerman R, Yitzhaki S (1989) Improving the accuracy of estimates of Gini coefficients. J Econometr 42(1):43–47
Article Google Scholar
Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Book Google Scholar
Lyon M, Cheung LC, Gastwirth JL (2016) The advantages of using group means in estimating the Lorenz curve and Gini index from grouped data. Am Stat 70(1):25–32
Article MathSciNet Google Scholar
Milanovic B (1994) The Gini-type functions: an alternative derivation. Bull Econ Res 46(1):81–90
Article Google Scholar
Nishino H, Kakamu K (2011) Grouped data estimation and testing of Gini coefficients using lognormal distributions. Sankhya B 73(2):193–210
Article MathSciNet Google Scholar
Pyatt G, Chen C, Fei J (1995) The distribution of income by factor components. Q J Econ 95(3):451–473
Article Google Scholar
Rubin DB (1978) Multiple imputations in sample surveys-a phenomenological bayesian approach to nonresponse. In: Proceedings of the survey research methods section of the American Statistical Association, vol 1. American Statistical Association, pp 20–34
Schenker N, Raghunathan TE, Chiu PL, Makuc DM, Zhang G, Cohen AJ (2006) Multiple imputation of missing income data in the national health interview survey. J Am Stat Assoc 101(475):924–933
Article MathSciNet Google Scholar
Schneeweiß H, Komlos J, Ahmad AS (2010) Symmetric and asymmetric rounding: a review and some new results. AStA Adv Stat Anal 94(3):247–271
Article MathSciNet Google Scholar
Stefanski LA (2000) Measurement error models. J Am Stat Assoc 95(452):1353–1358
Article MathSciNet Google Scholar
The World Bank, Developemnt Research Group (2017) Gini index (world bank estimate). https://data.worldbank.org/. Accessed Aug 20, 2020
Tillé Y, Langel M (2012) Histogram-based interpolation of the Lorenz curve and gini index for grouped data. Am Stat 66(4):225–231
Article MathSciNet Google Scholar
US Census Bureau (2018) 2018 annual social and economic supplement. https://www.census.gov/. Accessed Apr 3, 2019
Van Ourti T, Clarke P (2011) A simple correction to remove the bias of the Gini coefficient due to grouping. Rev Econ Stat 93(3):982–994
Article Google Scholar
Wodon Q, Yitzhaki S (2003) The effect of using grouped data on the estimation of the Gini income elasticity. Econ Lett 78(2):153–159
Article Google Scholar

Download references

Acknowledgements

The authors are grateful to two anonymous reviewers whose comments and suggestions significantly improved the quality of this paper.

Author information

Authors and Affiliations

Miami University, 100 Bishop Circle, Oxford, OH, 45056, USA
Tatjana Miljkovic
University of Dayton, 300 College Park, Dayton, OH, 45469, USA
Ying-Ju Chen

Authors

Tatjana Miljkovic
View author publications
You can also search for this author in PubMed Google Scholar
Ying-Ju Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tatjana Miljkovic.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Lemma 1

Proof

Let $Y_1$, $Y_2$ be two independent copies of Y. Define the random variable $U=[f(Y_2)-f(Y_1)][h(Y_2)-h(Y_1)]$. Note that $U \ge 0$ almost surely. Indeed, if $Y_2\ge Y_1$, then because $ f$ and $ h$ are non-decreasing functions, $f(Y_2)-f(Y_1)\ge 0$ and $h(Y_2)-h(Y_1)\ge 0$, thus $U \ge 0$. If $Y_2 \le Y_1$, then $f(Y_2)-f(Y_1) \le 0$ and $h(Y_2)-h(Y_1)\le 0$, thus $U \ge 0$. Since $Y_1$ and $Y_2$ are independent and have the same distribution, ${\mathbb {E}}[f(Y_1)h(Y_1)]= {\mathbb {E}}[f(Y_2)h(Y_2)]={\mathbb {E}}[f(Y)h(Y)]$, ${\mathbb {E}}[f(Y_1)]={\mathbb {E}}[f(Y_2)]={\mathbb {E}}[f(Y)]$, and ${\mathbb {E}}[h(Y_1)]={\mathbb {E}}[h(Y_2)]={\mathbb {E}}[h(Y)]$. Therefore,

$$\begin{aligned} {\mathbb {E}}[U]= & {} {\mathbb {E}}[(f(Y_2)-f(Y_1))(h(Y_2)-h(Y_1))]\\= & {} {\mathbb {E}}[f(Y_2)h(Y_2)] - {\mathbb {E}}[f(Y_2)]{\mathbb {E}}[h(Y_1)]-{\mathbb {E}}[f(Y_1)]{\mathbb {E}}[h(Y_2)]+ {\mathbb {E}}[f(Y_1)h(Y_1)]\\= & {} 2{\mathbb {E}}[f(Y)h(Y)]-2{\mathbb {E}}[f(Y)]{\mathbb {E}}[h(Y)]\ge 0. \end{aligned}$$

Dividing both sides of this inequality by 2 and adding ${\mathbb {E}}[f(Y)]{\mathbb {E}}[h(Y)]$ to both sides gives Lemma 1. $\square $

Proof of Lemma 2

Proof

Let $D \subset \{1,\ldots ,n\}$ be a non-empty set and let $ f$, $ h$ be numbers indexed by $i\in D$, if $f_i\le f_j$ and $h_i \le h_j$ for all $i,j\in D$ with $i<j$, then Lemma 1 implies $f(i)=f_i$ and $h(i)=h_i$. Here, the random variable is $Y=i$ and it is uniformly distributed on D. This yields that

$$\begin{aligned} \Big (\frac{1}{|D|}\sum _{i\in D} f_i\Big )\Big (\frac{1}{|D|}\sum _{i\in D} h_i\Big )={\mathbb {E}}[f(Y)]{\mathbb {E}}[h(Y)]\le {\mathbb {E}}[f(Y)h(Y)]= \Big (\frac{1}{|D|}\sum _{i\in D} f_ih_i\Big ). \end{aligned}$$

(13)

Recall that $P_1, \ldots , P_K$ are equal size subsets of $\{1,\ldots , n\}$ such that $|P_g| =m$ for $1\le g\le K$, $\phi _g=\frac{1}{|P_g|}\sum _{i\in P_g}y^{*}_i$, and $R^{K}_g=\frac{1}{|P_g|}\sum _{i\in P_g}R^{*}_i$, and ${\bar{\phi }} = \frac{1}{K}\sum _{g=1}^{K}\phi _g$. Since $y^{*}_1\le \cdots \le y^{*}_n$ and $R^{*}_1 \le \cdots \le R^{*}_n$, Eq. (13) leads to

$$\begin{aligned} \phi _gR^{K}_g=\left( \frac{1}{|P_g|}\sum _{i\in P_g}y^{*}_i\right) \left( \frac{1}{|P_g|}\sum _{i\in P_g}R^{*}_i\right) \le \frac{1}{|P_g|}\sum _{i\in P_g}y^{*}_iR^{*}_i=\frac{1}{m}\sum _{i\in P_g}y^{*}_iR^{*}_i. \end{aligned}$$

(14)

Therefore,

$$\begin{aligned} G^{*}_n= & {} \frac{2 \sum _{i=1}^{n} y^{*}_i R^{*}_i}{n \bar{\phi }} - 1 =\frac{2 \sum _{g=1}^{K}(\sum _{i \in P_i} y^{*}_i R^{*}_i)}{n \bar{\phi }} - 1\\= & {} \frac{2 \sum _{g=1}^{K}\left( \sum _{i \in P_i} y^{*}_i R^{*}_i\right) }{mK \bar{\phi }} - 1 =\frac{2 \sum _{g=1}^{K}\left( \frac{1}{m}\sum _{i\in P_i} y^{*}_i R^{*}_i\right) }{K \bar{\phi }} - 1\\\ge & {} \frac{2 \sum _{g=1}^{K}\phi _g R^{K}_g}{K \bar{\phi }} - 1= G^{K}_n. \end{aligned}$$

$\square $

Proof of Lemma 3

Proof

The maximum order statistic, $Y_{(n)}$ has the following density function

$$\begin{aligned} g_{(n)}(y)= \frac{n}{\lambda }[1-e^{-\frac{(y-L)}{\lambda }}]^{n-1} e^{-\frac{(y-L)}{\lambda }}, \quad L< y < \infty . \end{aligned}$$

(15)

The expected value of the maximum order statistic is

$$\begin{aligned} {\mathbb {E}}[Y_{(n)}]=\frac{n}{\lambda } \int _L^{\infty }y [1-e^{-\frac{(y-L)}{\lambda }}]^{n-1} e^{-\frac{(y-L)}{\lambda }}dy. \end{aligned}$$

(16)

Since $0< e^{-\frac{(y-L)}{\lambda }} < 1$ for $y > L$, by using the binomial series expansion we have

$$\begin{aligned} {\mathbb {E}}[Y_{(n)}]=\frac{n}{\lambda } \int _L^{\infty } y \sum _{j=0}^{n-1}(-1)^j \left( {\begin{array}{c}n-1\\ j\end{array}}\right) e^{-\frac{(y-L)}{\lambda }(j+1)}dy. \end{aligned}$$

The quantity inside summation is absolutely integrable and by interchanging the summation with integration we obtain

$$\begin{aligned} {\mathbb {E}}[Y_{(n)}]= & {} \frac{n}{\lambda }\sum _{j=0}^{n-1}(-1)^j \left( {\begin{array}{c}n-1\\ j\end{array}}\right) \int _L^{\infty }y e^{-\frac{(y-L)}{\lambda }(j+1)}dy\\= & {} L\sum _{j=0}^{n-1}(-1)^j \left( {\begin{array}{c}n-1\\ j\end{array}}\right) \frac{n}{(j+1)} + \lambda \sum _{j=0}^{n-1}(-1)^{j} \left( {\begin{array}{c}n-1\\ j\end{array}}\right) \frac{1}{(j+1)^2},\\ \end{aligned}$$

where both summands represent harmonic sequences. The first summand is equal to 1 and the second summands represent the n-harmonic sequence. Since the sum of the reciprocals of the first n natural numbers is $\sum _{j=1}^{n}\frac{1}{j}$, the sum of n-harmonic sequence diverges slowly due to the logarithmic growth and it is approximated by $\gamma $-Euler–Mascheroni Constant and a small error term $\epsilon _n\approx \frac{1}{2n}$ that vanishes as n goes to infinity [refer to Alabdulmohsin (2018) for more details about these terms]. Based on well known facts about the sum of the n-harmonic sequence, we obtain the final result

$$\begin{aligned} {\mathbb {E}}[Y_{(n)}]= L + \lambda [\ln (n)+\gamma +\epsilon _n]\le L + \lambda [\ln (n) + 1]. \end{aligned}$$

(17)

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Miljkovic, T., Chen, YJ. A new computational approach for estimation of the Gini index based on grouped data. Comput Stat 36, 2289–2311 (2021). https://doi.org/10.1007/s00180-021-01082-7

Download citation

Received: 29 April 2020
Accepted: 28 January 2021
Published: 25 February 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s00180-021-01082-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new computational approach for estimation of the Gini index based on grouped data

Abstract

Access this article

Similar content being viewed by others

Approximate Bayesian computation for Lorenz curves from grouped data

On the Negative Bias of the Gini Coefficient due to Grouping

Decomposition of the Gini index by income source for aggregated data and its applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Proof of Lemma 1

Proof

Proof of Lemma 2

Proof

Proof of Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A new computational approach for estimation of the Gini index based on grouped data

Abstract

Access this article

Similar content being viewed by others

Approximate Bayesian computation for Lorenz curves from grouped data

On the Negative Bias of the Gini Coefficient due to Grouping

Decomposition of the Gini index by income source for aggregated data and its applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Proof of Lemma 1

Proof

Proof of Lemma 2

Proof

Proof of Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation