Skip to main content
Log in

A new computational approach for estimation of the Gini index based on grouped data

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Many government agencies still rely on the grouped data as the main source of information for calculation of the Gini index. Previous research showed that the Gini index based on the grouped data suffers the first and second-order correction bias compared to the Gini index computed based on the individual data. Since the accuracy of the estimated correction bias is subject to many underlying assumptions, we propose a new method and name it D-Gini, which reduces the bias in Gini coefficient based on grouped data. We investigate the performance of the D-Gini method on an open-ended tail interval of the income distribution. The results of our simulation study showed that our method is very effective in minimizing the first and second order-bias in the Gini index and outperforms other methods previously used for the bias-correction of the Gini index based on grouped data. Three data sets are used to illustrate the application of this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Abounoori E, McCloughan P (2003) A simple way to calculate the Gini coefficient for grouped as well as ungrouped data. Appl Econ Lett 10(8):505–509

    Article  Google Scholar 

  • Alabdulmohsin IM (2018) Summability calculus, a comprehensive theory of fractional finite sums. Springer, Cham

    MATH  Google Scholar 

  • Alfons A, Templ M, Filzmoser P (2010) Applications of statistical simulation in the case of EU-SILC: using the R package simframe. J Stat Softw 37(3):17

    Article  Google Scholar 

  • Bound J, Brown C, Mathiowetz N (2001) Measurement error in survey data. In: Handbook of econometrics, vol 5. Elsevier, New York, pp 3705–3843

  • Chen Y, Miljkovic T (2019) From grouped to de-grouped data: a new approach in distribution fitting for grouped data. J Stat Comput Simul 89(2):272–291

    Article  MathSciNet  Google Scholar 

  • Cowell F (2011) Measuring inequality. Oxford University Press, Oxford

    Book  Google Scholar 

  • Cowell FA (1977) Measuring inequality. Philip Allan. Oxford, UK models Moscow-Izhevsk: RHD 12:13

  • Cowell FA, Mehta F (1982) The estimation and interpolation of inequality measures. Rev Econ Stud 49(2):273–290

    Article  Google Scholar 

  • Deltas G (2003) The small-sample bias of the Gini coefficient: results and implications for empirical research. Rev Econ Stat 85(1):226–234

    Article  Google Scholar 

  • Drechsler J, Kiesl H (2016) Beat the heap: an imputation strategy for valid inferences from rounded income data. J Surv Stat Methodol 4(1):22–42

    Article  Google Scholar 

  • European Union (2020) Income taxes abroad. https://europa.eu/youreurope/citizens/work/taxes/income-taxes-abroad/austria/index_en.htm. Accessed Apr 25, 2020

  • Eurostat (2018) Gini coefficient of equivalised disposable income—EU-SILC survey. https://ec.europa.eu/eurostat/, Accessed Aug 20, 2020

  • Fabrizi E, Trivisano C (2016) Small area estimation of the Gini concentration coefficient. Comput Stat Data Anal 99:223–234

    Article  MathSciNet  Google Scholar 

  • Heitjan DF (1989) Inference from grouped continuous data: a review. Stat Sci 4(2):164–179

  • Heitjan DF (1994) Ignorability in general incomplete-data models. Biometrika 81(4):701–708

    Article  MathSciNet  Google Scholar 

  • Kakwani N, Wagstaff A, Van Doorslaer E (1997) Socioeconomic inequalities in health: measurement, computation, and statistical inference. J Econometr 1:87–103

    Article  Google Scholar 

  • Kobayashi G, Kakamu K (2019) Approximate Bayesian computation for Lorenz curves from grouped data. Comput Stat 34(1):253–279

    Article  MathSciNet  Google Scholar 

  • Lerman R, Yitzhaki S (1989) Improving the accuracy of estimates of Gini coefficients. J Econometr 42(1):43–47

    Article  Google Scholar 

  • Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    Book  Google Scholar 

  • Lyon M, Cheung LC, Gastwirth JL (2016) The advantages of using group means in estimating the Lorenz curve and Gini index from grouped data. Am Stat 70(1):25–32

    Article  MathSciNet  Google Scholar 

  • Milanovic B (1994) The Gini-type functions: an alternative derivation. Bull Econ Res 46(1):81–90

    Article  Google Scholar 

  • Nishino H, Kakamu K (2011) Grouped data estimation and testing of Gini coefficients using lognormal distributions. Sankhya B 73(2):193–210

    Article  MathSciNet  Google Scholar 

  • Pyatt G, Chen C, Fei J (1995) The distribution of income by factor components. Q J Econ 95(3):451–473

    Article  Google Scholar 

  • Rubin DB (1978) Multiple imputations in sample surveys-a phenomenological bayesian approach to nonresponse. In: Proceedings of the survey research methods section of the American Statistical Association, vol 1. American Statistical Association, pp 20–34

  • Schenker N, Raghunathan TE, Chiu PL, Makuc DM, Zhang G, Cohen AJ (2006) Multiple imputation of missing income data in the national health interview survey. J Am Stat Assoc 101(475):924–933

    Article  MathSciNet  Google Scholar 

  • Schneeweiß H, Komlos J, Ahmad AS (2010) Symmetric and asymmetric rounding: a review and some new results. AStA Adv Stat Anal 94(3):247–271

    Article  MathSciNet  Google Scholar 

  • Stefanski LA (2000) Measurement error models. J Am Stat Assoc 95(452):1353–1358

    Article  MathSciNet  Google Scholar 

  • The World Bank, Developemnt Research Group (2017) Gini index (world bank estimate). https://data.worldbank.org/. Accessed Aug 20, 2020

  • Tillé Y, Langel M (2012) Histogram-based interpolation of the Lorenz curve and gini index for grouped data. Am Stat 66(4):225–231

    Article  MathSciNet  Google Scholar 

  • US Census Bureau (2018) 2018 annual social and economic supplement. https://www.census.gov/. Accessed Apr 3, 2019

  • Van Ourti T, Clarke P (2011) A simple correction to remove the bias of the Gini coefficient due to grouping. Rev Econ Stat 93(3):982–994

    Article  Google Scholar 

  • Wodon Q, Yitzhaki S (2003) The effect of using grouped data on the estimation of the Gini income elasticity. Econ Lett 78(2):153–159

    Article  Google Scholar 

Download references

Acknowledgements

The authors are grateful to two anonymous reviewers whose comments and suggestions significantly improved the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tatjana Miljkovic.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Lemma 1

Proof

Let \(Y_1\), \(Y_2\) be two independent copies of Y. Define the random variable \(U=[f(Y_2)-f(Y_1)][h(Y_2)-h(Y_1)]\). Note that \(U \ge 0\) almost surely. Indeed, if \(Y_2\ge Y_1\), then because \( f\) and \( h\) are non-decreasing functions, \(f(Y_2)-f(Y_1)\ge 0\) and \(h(Y_2)-h(Y_1)\ge 0\), thus \(U \ge 0\). If \(Y_2 \le Y_1\), then \(f(Y_2)-f(Y_1) \le 0\) and \(h(Y_2)-h(Y_1)\le 0\), thus \(U \ge 0\). Since \(Y_1\) and \(Y_2\) are independent and have the same distribution, \({\mathbb {E}}[f(Y_1)h(Y_1)]= {\mathbb {E}}[f(Y_2)h(Y_2)]={\mathbb {E}}[f(Y)h(Y)]\), \({\mathbb {E}}[f(Y_1)]={\mathbb {E}}[f(Y_2)]={\mathbb {E}}[f(Y)]\), and \({\mathbb {E}}[h(Y_1)]={\mathbb {E}}[h(Y_2)]={\mathbb {E}}[h(Y)]\). Therefore,

$$\begin{aligned} {\mathbb {E}}[U]= & {} {\mathbb {E}}[(f(Y_2)-f(Y_1))(h(Y_2)-h(Y_1))]\\= & {} {\mathbb {E}}[f(Y_2)h(Y_2)] - {\mathbb {E}}[f(Y_2)]{\mathbb {E}}[h(Y_1)]-{\mathbb {E}}[f(Y_1)]{\mathbb {E}}[h(Y_2)]+ {\mathbb {E}}[f(Y_1)h(Y_1)]\\= & {} 2{\mathbb {E}}[f(Y)h(Y)]-2{\mathbb {E}}[f(Y)]{\mathbb {E}}[h(Y)]\ge 0. \end{aligned}$$

Dividing both sides of this inequality by 2 and adding \({\mathbb {E}}[f(Y)]{\mathbb {E}}[h(Y)]\) to both sides gives Lemma 1. \(\square \)

Proof of Lemma 2

Proof

Let \(D \subset \{1,\ldots ,n\}\) be a non-empty set and let \( f\), \( h\) be numbers indexed by \(i\in D\), if \(f_i\le f_j\) and \(h_i \le h_j\) for all \(i,j\in D\) with \(i<j\), then Lemma 1 implies \(f(i)=f_i\) and \(h(i)=h_i\). Here, the random variable is \(Y=i\) and it is uniformly distributed on D. This yields that

$$\begin{aligned} \Big (\frac{1}{|D|}\sum _{i\in D} f_i\Big )\Big (\frac{1}{|D|}\sum _{i\in D} h_i\Big )={\mathbb {E}}[f(Y)]{\mathbb {E}}[h(Y)]\le {\mathbb {E}}[f(Y)h(Y)]= \Big (\frac{1}{|D|}\sum _{i\in D} f_ih_i\Big ). \end{aligned}$$
(13)

Recall that \(P_1, \ldots , P_K\) are equal size subsets of \(\{1,\ldots , n\}\) such that \(|P_g| =m\) for \(1\le g\le K\), \(\phi _g=\frac{1}{|P_g|}\sum _{i\in P_g}y^{*}_i\), and \(R^{K}_g=\frac{1}{|P_g|}\sum _{i\in P_g}R^{*}_i\), and \({\bar{\phi }} = \frac{1}{K}\sum _{g=1}^{K}\phi _g\). Since \(y^{*}_1\le \cdots \le y^{*}_n\) and \(R^{*}_1 \le \cdots \le R^{*}_n\), Eq. (13) leads to

$$\begin{aligned} \phi _gR^{K}_g=\left( \frac{1}{|P_g|}\sum _{i\in P_g}y^{*}_i\right) \left( \frac{1}{|P_g|}\sum _{i\in P_g}R^{*}_i\right) \le \frac{1}{|P_g|}\sum _{i\in P_g}y^{*}_iR^{*}_i=\frac{1}{m}\sum _{i\in P_g}y^{*}_iR^{*}_i. \end{aligned}$$
(14)

Therefore,

$$\begin{aligned} G^{*}_n= & {} \frac{2 \sum _{i=1}^{n} y^{*}_i R^{*}_i}{n \bar{\phi }} - 1 =\frac{2 \sum _{g=1}^{K}(\sum _{i \in P_i} y^{*}_i R^{*}_i)}{n \bar{\phi }} - 1\\= & {} \frac{2 \sum _{g=1}^{K}\left( \sum _{i \in P_i} y^{*}_i R^{*}_i\right) }{mK \bar{\phi }} - 1 =\frac{2 \sum _{g=1}^{K}\left( \frac{1}{m}\sum _{i\in P_i} y^{*}_i R^{*}_i\right) }{K \bar{\phi }} - 1\\\ge & {} \frac{2 \sum _{g=1}^{K}\phi _g R^{K}_g}{K \bar{\phi }} - 1= G^{K}_n. \end{aligned}$$

\(\square \)

Proof of Lemma 3

Proof

The maximum order statistic, \(Y_{(n)}\) has the following density function

$$\begin{aligned} g_{(n)}(y)= \frac{n}{\lambda }[1-e^{-\frac{(y-L)}{\lambda }}]^{n-1} e^{-\frac{(y-L)}{\lambda }}, \quad L< y < \infty . \end{aligned}$$
(15)

The expected value of the maximum order statistic is

$$\begin{aligned} {\mathbb {E}}[Y_{(n)}]=\frac{n}{\lambda } \int _L^{\infty }y [1-e^{-\frac{(y-L)}{\lambda }}]^{n-1} e^{-\frac{(y-L)}{\lambda }}dy. \end{aligned}$$
(16)

Since \(0< e^{-\frac{(y-L)}{\lambda }} < 1\) for \(y > L\), by using the binomial series expansion we have

$$\begin{aligned} {\mathbb {E}}[Y_{(n)}]=\frac{n}{\lambda } \int _L^{\infty } y \sum _{j=0}^{n-1}(-1)^j \left( {\begin{array}{c}n-1\\ j\end{array}}\right) e^{-\frac{(y-L)}{\lambda }(j+1)}dy. \end{aligned}$$

The quantity inside summation is absolutely integrable and by interchanging the summation with integration we obtain

$$\begin{aligned} {\mathbb {E}}[Y_{(n)}]= & {} \frac{n}{\lambda }\sum _{j=0}^{n-1}(-1)^j \left( {\begin{array}{c}n-1\\ j\end{array}}\right) \int _L^{\infty }y e^{-\frac{(y-L)}{\lambda }(j+1)}dy\\= & {} L\sum _{j=0}^{n-1}(-1)^j \left( {\begin{array}{c}n-1\\ j\end{array}}\right) \frac{n}{(j+1)} + \lambda \sum _{j=0}^{n-1}(-1)^{j} \left( {\begin{array}{c}n-1\\ j\end{array}}\right) \frac{1}{(j+1)^2},\\ \end{aligned}$$

where both summands represent harmonic sequences. The first summand is equal to 1 and the second summands represent the n-harmonic sequence. Since the sum of the reciprocals of the first n natural numbers is \(\sum _{j=1}^{n}\frac{1}{j}\), the sum of n-harmonic sequence diverges slowly due to the logarithmic growth and it is approximated by \(\gamma \)-Euler–Mascheroni Constant and a small error term \(\epsilon _n\approx \frac{1}{2n}\) that vanishes as n goes to infinity [refer to Alabdulmohsin (2018) for more details about these terms]. Based on well known facts about the sum of the n-harmonic sequence, we obtain the final result

$$\begin{aligned} {\mathbb {E}}[Y_{(n)}]= L + \lambda [\ln (n)+\gamma +\epsilon _n]\le L + \lambda [\ln (n) + 1]. \end{aligned}$$
(17)

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Miljkovic, T., Chen, YJ. A new computational approach for estimation of the Gini index based on grouped data. Comput Stat 36, 2289–2311 (2021). https://doi.org/10.1007/s00180-021-01082-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-021-01082-7

Keywords

Navigation