Description of methods for estimating Gini indexes, variance estimates and confidence intervals using GiniVarInterval

The aim of this vignette is to describe the various methods for estimating the Gini index, for both infinite and finite populations, as well as the methods for estimating its variance, as implemented in the giniVarCI package. Different confidence intervals for the Gini index are also explained.

To exemplify the use of the different functions, we assume that inequality is measured for a nonnegative and continuous random variable Y. A popular formulation of the Gini index (G) is defined by (see David, 1968; Kendall and Stuart, 1977; Qin et al., 2010): $$ G = \frac{1}{2 \mu_{Y}} \int_{0}^{+\infty} \int_{0}^{+\infty} |x-y|dF_{Y}(x)dF_{Y}(y), $$ where μY = E[Y] = ∫0+∞yf(y)dy = ∫0+∞ydFY(y), is the mean of Y, and FY(y) = P(Y ≤ y) and f(y) are the cumulative distribution function and the probability density function of Y, respectively.

In practice, the value of G is estimated by means of a sample S with size n, which can be selected from either infinite or finite populations (Berger and Gedik Balay, 2020; Muñoz et al., 2023).

Infinite populations

Estimators of the Gini index

For infinite populations, {Yi : i ∈ S} denotes a sequence, with size n, of nonnegative random variables with the same distribution as the variable of interest Y. The Gini index (G) is estimated using the observation of individuals selected in the sample, which are denoted as {yi : i ∈ S}. A popular estimator of the Gini index is (see Langel and Tille, 2013; Giorgi and Gigliarano, 2017; Muñoz et al., 2023): $$\widehat{G} = \displaystyle \frac{2}{\overline{y}n^2}\sum_{i \in S}iy_{(i)} - \frac{n+1}{n}, $$ where $\overline{y}=n^{-1}\sum_{i=1}^{n}y_i$, and y(i) are the ordered values (in non-decreasing order) of the sample observations yi. This is the expression computed by the functions iginindex() (method = 5) and igini() when bias.correction = FALSE.

The estimator can be biased for small sample sizes (Deltas, 2003). The bias corrected (bc) version of is: $$\widehat{G}^{bc} = \displaystyle \frac{2}{\overline{y}n(n-1)}\sum_{i \in S}iy_{(i)} - \frac{n+1}{n-1},$$ which corresponds to the Gini index bias correction version computed by iginindex() (method = 5) and igini() when bias.correction = TRUE.

In the first example, a sample with size n=100 is generated using the gsample() function from the standard logNormal distribution (distribution = "lognormal") with true Gini index is G = 0.5 (gini = 0.5) and the Gini index is estimated using bias correction.

library(giniVarCI)
set.seed(123)
y <- gsample(n = 100, gini = 0.5, distribution = "lognormal")
igini(y)
#> [1] 0.4671929

iginindex() can be used to estimate the Gini index using various estimation methods and both R and C++ codes. See help(iginindex) for a detailed description of the various estimation methods. Efficiency comparisons between both implementations and with other functions available in other packages, such as laeken, DescTools, ineq or REAT, can be made using, for instance, the function microbenchmark():



#Comparing the computation time for the various estimation methods using R 
microbenchmark::microbenchmark(
iginindex(y, method = 1,  useRcpp = FALSE),
iginindex(y, method = 2,  useRcpp = FALSE),
iginindex(y, method = 3,  useRcpp = FALSE),
iginindex(y, method = 4,  useRcpp = FALSE),
iginindex(y, method = 5,  useRcpp = FALSE),
iginindex(y, method = 6,  useRcpp = FALSE),
iginindex(y, method = 7,  useRcpp = FALSE),
iginindex(y, method = 8,  useRcpp = FALSE),
iginindex(y, method = 9,  useRcpp = FALSE),
iginindex(y, method = 10, useRcpp = FALSE) 
)
#> Unit: microseconds
#>                                        expr      min        lq       mean
#>   iginindex(y, method = 1, useRcpp = FALSE)  147.905  153.7470  163.08382
#>   iginindex(y, method = 2, useRcpp = FALSE)   13.074   16.0850   19.48625
#>   iginindex(y, method = 3, useRcpp = FALSE)   11.101   13.1000   16.11174
#>   iginindex(y, method = 4, useRcpp = FALSE)   14.507   18.3290   21.78797
#>   iginindex(y, method = 5, useRcpp = FALSE)   13.716   16.5810   19.65793
#>   iginindex(y, method = 6, useRcpp = FALSE)   28.142   37.2500   50.39609
#>   iginindex(y, method = 7, useRcpp = FALSE)  742.776  765.5785  785.38177
#>   iginindex(y, method = 8, useRcpp = FALSE)  709.183  736.2335  884.36971
#>   iginindex(y, method = 9, useRcpp = FALSE)  527.064  550.8425  594.65901
#>  iginindex(y, method = 10, useRcpp = FALSE) 8968.572 9101.4355 9931.17492
#>     median         uq       max neval
#>   159.3675   168.6795   211.725   100
#>    17.7685    20.2230    65.752   100
#>    14.6070    16.4105    53.280   100
#>    20.2780    22.9985    69.890   100
#>    18.4595    20.9840    64.010   100
#>    44.9040    53.2295   484.073   100
#>   779.5295   800.9295   872.808   100
#>   752.5190   777.7410  3381.810   100
#>   562.9505   583.7340  3142.804   100
#>  9239.8735 10827.6005 14842.992   100


# Comparing the computation time for the various estimation methods using Rcpp
microbenchmark::microbenchmark(
iginindex(y, method = 1),
iginindex(y, method = 2),
iginindex(y, method = 3),
iginindex(y, method = 4),
iginindex(y, method = 5),
iginindex(y, method = 6),
iginindex(y, method = 7),
iginindex(y, method = 8),
iginindex(y, method = 9),
iginindex(y, method = 10) )
#> Unit: microseconds
#>                       expr      min        lq       mean    median         uq
#>   iginindex(y, method = 1)   45.014   45.7005   46.98846   46.1410    46.9275
#>   iginindex(y, method = 2)    9.177   10.3090   12.02072   10.7755    12.0580
#>   iginindex(y, method = 3)    9.177   10.0135   11.36852   10.5500    11.3960
#>   iginindex(y, method = 4)    9.548   10.3845   11.57021   10.8255    11.5970
#>   iginindex(y, method = 5)    8.195    8.7970    9.58509    9.1020     9.5675
#>   iginindex(y, method = 6)    8.326    8.7565   10.00369    9.2070    10.0185
#>   iginindex(y, method = 7)   76.202   76.9590   78.56232   77.7000    78.5520
#>   iginindex(y, method = 8)   44.764   45.5050   47.28924   45.8950    46.8725
#>   iginindex(y, method = 9)   39.153   39.8945   41.39766   40.3900    41.7580
#>  iginindex(y, method = 10) 8887.872 9000.2510 9594.67304 9111.9850 10338.4330
#>        max neval
#>     57.226   100
#>     67.616   100
#>     22.372   100
#>     22.021   100
#>     22.131   100
#>     22.883   100
#>     98.494   100
#>     67.927   100
#>     51.676   100
#>  11997.664   100


# Comparing the computation time for estimates of the Gini index in various R packages.

microbenchmark::microbenchmark(
igini(y),
laeken::gini(y),
DescTools::Gini(y),
ineq::Gini(y),
REAT::gini(y))
#> Registered S3 methods overwritten by 'DescTools':
#>   method   from
#>   lines.Lc ineq
#>   plot.Lc  ineq
#> Unit: microseconds
#>                expr    min      lq       mean  median      uq        max neval
#>            igini(y)  8.746 11.8320   13.42974 12.9095 13.8955     41.017   100
#>     laeken::gini(y) 31.809 35.9420  151.47034 37.4700 39.1880  11368.220   100
#>  DescTools::Gini(y) 52.147 55.3885 6453.94998 56.9060 59.0450 597008.995   100
#>       ineq::Gini(y) 39.273 43.8370   71.51889 44.9535 46.5415   2552.803   100
#>       REAT::gini(y) 86.332 90.0880  111.39978 91.8865 95.0120   1876.992   100

Variance estimation and confidence intervals

Variance estimators and confidence intervals are described using different methods for the estimator of the non-bias corrected version of Gini index , since as $$\widehat{G}^{bc} = \frac{n}{n-1}\widehat{G},$$ the variance estimators and confidence intervals based on bc can be straightforwardly derived. In particular, $$\widehat{V}(\widehat{G}^{bc})=\frac{n^2}{(n-1)^2}\widehat{V}(\widehat{G}).$$ Let [L, U] the lower and upper limits of a confidence interval for G based on . The confidence interval based on bc can be computed as: $$ \left[ \frac{n}{n-1}L, \frac{n}{n-1}U\right].$$

Bootstrap

The argument interval = pbootstrap in the function igini() returns the confidence interval for the Gini index using the percentile bootstrap method. Let {y1*(b), …, yn*(b)} be the bth bootstrap sample taken from the original sample {y1, …, yn} by simple random sampling with replacement, and *(b) denotes the estimator computed from the bth bootstrap sample, with b = {1, …, B}, being B the total number of bootstrap samples. For a confidence level 1 − α, the percentile bootstrap confidence interval is defined as (see Qin et al., 2010): [(α/2)*, (1 − α/2)*], where (a)* is the ath quantile of the bootstrapped coefficients *(b). A variance estimator of the Gini index based on bootstrap is defined as $$\widehat{V}_{B}(\widehat{G})= \displaystyle \frac{1}{B-1}\sum_{b=1}^{B}\left(\widehat{G}^{*}(b) - \overline{G}^{*} \right)^2,$$ where $$\overline{G}^{*}=\frac{1}{B}\sum_{b=1}^{B}\widehat{G}^{*}(b).$$


# Gini index estimation and confidence interval using 'pbootstrap',

igini(y, interval = "pbootstrap")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4004204 0.5135315
#> 
#> $Variance
#> [1] 0.0008333577

interval = "BCa" computes the bias-corrected and accelerated bootstrap interval (Davison and Hinkley, 1997). The idea of this confidence interval is to correct for bias due to the skewness in the distribution of bootstrap estimates. The "BCa" confidence interval is defined as: [(α1)*, (α2)*], where $$\alpha_{1}=\phi\left( \widehat{Z}_{0} + \frac{\widehat{Z}_{0} + Z_{\alpha}}{1-\widehat{a} (\widehat{Z}_{0} + \widehat{Z}_{\alpha}) } \right),$$

$$\alpha_{2}=\phi\left( \widehat{Z}_{0} + \frac{\widehat{Z}_{0} + Z_{1-\alpha}}{1-\widehat{a} (\widehat{Z}_{0} + \widehat{Z}_{1-\alpha}) } \right),$$ ϕ(⋅) is the cumulative distribution function of the standard Normal distribution, and Za is the ath quantile of the standard Normal distribution. The bias-correction factor is defined as $$\widehat{Z}_{0}=\phi^{-1}\left(\#\frac{\widehat{G}^{*}(b) - \widehat{G}}{B}\right),$$ and the acceleration factor is given by $$\widehat{a}=\frac{\sum_{i \in S}\left(\overline{G} -\widehat{G}_{-i} \right)^3}{6\left\{\sum_{i \in S}\left(\overline{G} -\widehat{G}_{-i} \right)^2\right\}^{3/2}},$$ where i are the jackknife estimates defined in the following section, and $$\overline{G} = \frac{1}{n}\sum_{i \in S}\widehat{G}_{-i}.$$


# Gini index estimation and confidence interval using 'Bca'.

igini(y, interval = "BCa")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4178437 0.5280127
#> 
#> $Variance
#> [1] 0.0008051247

Jackknife

The "zjackknife" and "tjackknife" methods compute the variance of the Gini index using the Ogwang Jackknife procedure (Ogwang, 2000; Langel and Tille, 2013). This variance si given by $$\widehat{V}_{J}(\widehat{G})= \displaystyle \frac{n-1}{n}\sum_{i \in S}\left(\widehat{G}_{-i}- \overline{G} \right)^2,$$ where $$\widehat{G}_{-i}=\widehat{G}+\frac{2}{\sum_{j \in S}y_j - y_{(i)} }\left[ \frac{y_{(i)} \sum_{j \in S}jy_{(j)}}{n\sum_{j \in S}y_j}+\frac{\sum_{j \in S}jy_{(j)}}{n(n-1)} - \frac{\sum_{j \in S}y_j-\sum_{j=1}^{i}y_{(j)} +iy_{(i)} }{n-1} \right]-\frac{1}{n(n-1)}, $$ with i = {1, …, n} being the jackknife estimates, i.e., i is the estimation of the Gini index when the unit i is removed from the sample. For a confidence level 1 − α, the "zjackknife" confidence interval is defined as $$\left[\widehat{G} - Z_{1-\alpha/2}\sqrt{\widehat{V}_{J}(\widehat{G})}, \widehat{G} + Z_{1-\alpha/2}\sqrt{\widehat{V}_{J}(\widehat{G})} \right],$$ where Z1 − α/2 is the (1 − α/2)th quantile of the standard Normal distribution.


# Gini index estimation and confidence interval using 'zjackknife'.

igini(y, interval = "zjackknife")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4103563 0.5240296
#> 
#> $Variance
#> [1] 0.0008409313

"tjackknife" sustitutes the critical value Z1 − α/2 by critical values computed from the studentized bootstrap. This confidence interval is given by
$$\left[\widehat{G} - t_{J;1-\alpha/2}^{*}\sqrt{\widehat{V}_{J}(\widehat{G})}, \widehat{G} - t_{J;\alpha/2}^{*}\sqrt{\widehat{V}_{J}(\widehat{G})} \right],$$ where tJ; a* is the ath quantile of the values $$ t^{*}_{J}(b)=\frac{\widehat{G}^{*}(b) - \widehat{G}}{\sqrt{\widehat{V}_{J}\left[\widehat{G}^{*}(b)\right]}}$$ computed using the bootstrap technique, where J[*(b)] is the estimated Ogwang Jackknife variance of *(b) for the bth bootstrap sample.


# Gini index estimation and confidence interval using 'tjackknife'.

igini(y, interval = "tjackknife")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4108575 0.5349601
#> 
#> $Variance
#> [1] 0.0008409313

Linearization

The linearization technique for variance estimation (Deville, 1999) has been applied to the following estimators of the Gini index: $$\widehat{G}^{a} = \displaystyle \frac{1}{2\overline{y}n^{2}}\sum_{i \in S}\sum_{j\in S} |y_i-y_j|$$ and $$\widehat{G}^{b} = \displaystyle \frac{2}{\overline{y}n}\sum_{i \in S}y_{i}\widehat{F}_{n}(y_{i}) - 1,$$ where $$\widehat{F}_{n}(y_i)=\frac{1}{n}\sum_{j \in S}\delta(y_j \leq y_i)$$ and δ(⋅) is the indicator variable that takes the value 1 when its argument is true and 0 otherwise. For a given estimator and a linearizated variable z, the confidence interval, with confidence level 1 − α, is defined as:
$$\left[\widehat{G} - Z_{1-\alpha/2}\sqrt{\widehat{V}_{L}(\widehat{G})}, \widehat{G} + Z_{1-\alpha/2}\sqrt{\widehat{V}_{L}(\widehat{G})} \right],$$

where the variance estimator of the Gini index is given by: $$\widehat{V}_{L}(\widehat{G})= \displaystyle \frac{1}{n(n-1)}\sum_{i \in S}\left(z_{i} - \overline{z}\right)^2,$$ and $$\overline{z}=\frac{1}{n}\sum_{i \in S}z_{i}.$$

On the one hand, interval = "zalinearization" linearizates the estimator a, and the corresponding pseudo-values are (see Langel anf Tillé 2013):

$$z_{(i)}^{a}=\frac{1}{\overline{y}}\left[ \frac{2i}{n}\left( y_{(i)} - \widehat{\overline{Y}}_{(i)}\right) + \overline{y} - y_{(i)} - \widehat{G}^{a}\left(\overline{y} + y_{(i)}\right) \right],$$ where $$\widehat{\overline{Y}}_{(i)} = \displaystyle \frac{1}{i}\sum_{j = 1}^{i}y_{(j)}.$$

On the other hand, interval = "zblinearization" linearizates the estimator b, and the corresponding pseudo values are (see Berger, 2008):

$$z_i^{b}=\frac{1}{\overline{y}}\left[ 2y_i\widehat{F}_{n}(y_i) - (\widehat{G}^{b}+1)(y_i+\overline{y})+2\frac{\sum_{j \in S}y_j\delta(y_j \geq y_i)}{n} \right].$$


# Gini index estimation and confidence interval using 'zalinearization'.

igini(y, interval = "zalinearization")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4125876 0.5217982
#> 
#> $Variance
#> [1] 0.0007762



# Gini index estimation and confidence interval using 'zblinearization'.

igini(y, interval = "zblinearization")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4107537 0.5236321
#> 
#> $Variance
#> [1] 0.0008292117

Intervals "talinearization" and "tblinearization" substitute the critical value Z1 − α/2 by critical values computed from the Studentized bootstrap. This confidence interval is given by
$$\left[\widehat{G} - t_{L;1-\alpha/2}^{*}\sqrt{\widehat{V}_{L}(\widehat{G})}, \widehat{G} - t_{L;\alpha/2}^{*}\sqrt{\widehat{V}_{L}(\widehat{G})} \right],$$ where tL; a* is the ath quantile of the values $$ t^{*}_{L}(b)=\frac{\widehat{G}^{*}(b) - \widehat{G}}{\sqrt{\widehat{V}_{L}\left[\widehat{G}^{*}(b)\right]}}.$$ L(⋅) is computed using the pseudo-values z(i)a when interval = "zalinearization", and using the pseudo-values zib when interval = "zblinearization".


# Gini index estimation and confidence interval using 'talinearization'.

igini(y, interval = "talinearization")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4195142 0.5253662
#> 
#> $Variance
#> [1] 0.0007762


# Gini index estimation and confidence interval using 'tblinearization'.

igini(y, interval = "tblinearization")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4224278 0.5329734
#> 
#> $Variance
#> [1] 0.0008292117

Empirical likelihood

Intervals "ELchisq" and "ELboot" compute the empirical likelihood (EL) method, a nonparametric technique that provides desirable inferences under skewed distributions. The shape of the EL confidence intervals are determined by the data-driven likelihood ratio function (Owen, 2001). interval = "ELchisq" obtains the EL confidence interval, with confidence level 1 − α, for the Gini index as defined by Qin et al. (2010): $$\left\{ \theta|-2R(\theta) \leq \frac{\chi^2_{1;1-\alpha}}{k}\right\}$$

where R(θ) = −∑i ∈ Slog{1 + λZ(yi, θ)} is the log-EL ratio statistic for θ = G, Z(yi, θ) = {2n(yi) − 1}yi − θyi, λ is the solution to $$ \frac{1}{n}\sum_{i \in S}\frac{Z(y_i,\theta)}{1+Z(y_i,\theta)}=0,$$ k = σ̂22/σ̂12 is the scaling factor, $$ \widehat{\sigma}_{j}^{2}=\frac{1}{n-1}\sum_{i \in S}\left(u_{ji} - \overline{u}_{j} \right)^2,$$ with j = {1, 2}, $$ \overline{u}_{j} = \frac{1}{n}\sum_{i \in S}u_{ji},$$ and χ1; 1 − α2 is the (1 − α)th quantile of Chi-Squared distribution with one degree of freedom.


# Gini index estimation and confidence interval using 'ELchisq'.

igini(y, interval = "ELchisq")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4216374 0.5319404
#> 
#> $Variance
#> [1] 0.0008292117

interval = "ELboot" substitutes the critical value based on the Chi-Squared distribution by an empirical critical value based on bootstrap. "ELboot" computes the EL confidence interval (Qin et al., 2010): {θ|−2R(θ) ≤ C1 − α}, where C1 − α is the (1 − α)th quantile of the values {−R1*(), …, −RB*()}, and where Rb*() denotes the value of R(θ) computed from the bth bootstrap sample.


# Gini index estimation and confidence interval using 'ELboot'.

igini(y, interval = "ELboot")
#> $Gini
#> [1] 0.4671929
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.4118394 0.5413343
#> 
#> $Variance
#> [1] 0.0008343201

The function icompareCI() compares the various confidence intervals for the scenario of a sample derived from an infinite population. The argument plotCI = TRUE plots the results derived from the various available methods for constructing confidence intervals.


# Comparisons of variance estimators and confidence intervals.

icompareCI(y, plotCI = FALSE)
#>           interval    bc gini lowerlimit upperlimit var.gini
#> 1       zjackknife FALSE 0.46       0.41       0.52    8e-04
#> 2       zjackknife  TRUE 0.47       0.41       0.52    8e-04
#> 3       tjackknife FALSE 0.46       0.41       0.53    8e-04
#> 4       tjackknife  TRUE 0.47       0.42       0.54    8e-04
#> 5  zalinearization FALSE 0.46       0.41       0.52    8e-04
#> 6  zalinearization  TRUE 0.47       0.41       0.52    8e-04
#> 7  talinearization FALSE 0.46       0.41       0.52    8e-04
#> 8  talinearization  TRUE 0.47       0.42       0.52    8e-04
#> 9  zblinearization FALSE 0.46       0.41       0.52    8e-04
#> 10 zblinearization  TRUE 0.47       0.41       0.52    8e-04
#> 11 tblinearization FALSE 0.46       0.42       0.53    8e-04
#> 12 tblinearization  TRUE 0.47       0.42       0.53    8e-04
#> 13      pbootstrap FALSE 0.46       0.40       0.51    8e-04
#> 14      pbootstrap  TRUE 0.47       0.40       0.51    8e-04
#> 15             BCa FALSE 0.46       0.42       0.53    7e-04
#> 16             BCa  TRUE 0.47       0.42       0.53    8e-04
#> 17         ELchisq FALSE 0.46       0.42       0.53    8e-04
#> 18         ELchisq  TRUE 0.47       0.42       0.53    8e-04
#> 19          ELboot FALSE 0.46       0.41       0.53    7e-04
#> 20          ELboot  TRUE 0.47       0.42       0.54    7e-04

Finite populations

Estimators of the Gini index

For a finite population U, {Yi : i ∈ U} denotes a sequence, with size N, of nonnegative random variables with the same distribution as the variable of interest Y, and {yi : i ∈ U} are the population values of the variable of interest. A sample S is selected from U by using a sampling design with survey weights wi, with i ∈ S. For example, the survey weights can be wi = πi−1, where πi = P(i ∈ S) are the inclusion probabilities (Muñoz et al., 2023). The Gini index (G) is estimated using the observations of individuals selected in the sample {yi : i ∈ S}, and the corresponding survey weights {wi : i ∈ S}. The different methods for estimating the Gini index are (see also Muñoz et al., 2023):

  • method = 1 (Langel and Tillé, 2013).

$$\widehat{G}_{w1}= \displaystyle \frac{1}{2\widehat{N}^{2}\overline{y}_{w}}\sum_{i \in S}\sum_{j \in S}w_{i}w_{j}|y_{i}-y_{j}|,$$ where  = ∑i ∈ Swi and $$\overline{y}_{w}= \frac{1}{\widehat{N}}\sum_{i \in S}w_{i}y_{i}.$$

  • method = 2 (Alfons and Templ, 2012; Langel and Tillé, 2013).

$$\widehat{G}_{w2} =\displaystyle \frac{2\sum_{i \in S}w_{(i)}^{+}\widehat{N}_{(i)}y_{(i)} -\sum_{i \in S}w_{i}^{2}y_{i} }{\widehat{N}^{2}\overline{y}_{w}}-1,$$ where y(i) are the values yi sorted in increasing order, w(i)+ are the values wi sorted according to the increasing order of the values yi, and $\widehat{N}_{(i)}=\sum_{j=1}^{i}w_{(j)}^{+}$. Note that Langel and Tillé (2013) show that w1 = w2.

  • method = 3 (Berger, 2008).

$$\widehat{G}_{w3} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}}\sum_{i \in S}w_{i}y_{i}\widehat{F}_{w}^{\ast}(y_{i})-1, $$ where $$\widehat{F}_{w}^{\ast}(t) = \displaystyle \frac{1}{\widehat{N}}\sum_{i \in S}w_{i}[\delta(y_i < t) + 0.5\delta(y_i = t)]$$ is the smooth (mid-point) distribution function.

  • method = 4 (Berger and Gedik-Balay, 2020).

$$\widehat{G}_{w4} = 1 - \displaystyle \frac{\overline{v}_{w}}{\overline{y}_{w}},$$ where $\overline{v}_{w}=\widehat{N}^{-1}\sum_{i \in S}w_{i}v_{i}$ and $$v_{i} = \displaystyle \frac{1}{\widehat{N} - w_{i}}\sum_{ \substack{j \in S\\ j\neq i}}\min(y_{i},y_{j}).$$

  • method = 5 (Lerman and Yitzhaki, 1989).

$$\widehat{G}_{w5} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}} \sum_{i \in S} w_{(i)}^{+}[y_{(i)} - \overline{y}_{w}]\left[ \widehat{F}_{w}^{LY}(y_{(i)}) - \overline{F}_{w}^{LY} \right], $$ where $$\widehat{F}_{w}^{LY}(y_{(i)}) = \displaystyle \frac{1}{\widehat{N}}\left(\widehat{N}_{(i-1)} + \frac{w_{(i)}^{+}}{2} \right)$$ and $$\overline{F}_{w}^{LY}=\frac{1}{\widehat{N}}\sum_{i \in S}w_{(i)}^{+}\widehat{F}_{w}^{LY}(y_{(i)}).$$

In the finite population example, income and weights from the 2006 Austrian EU-SILC data set (laeken package) are used to estimate the Gini index in the Austrian region of Burgenland. The Gini index is estimated using fgini() and method = 2 (the default method).

data(eusilc, package="laeken")
y <- eusilc$eqIncome[eusilc$db040 == "Burgenland"]
w <- eusilc$rb050[eusilc$db040 == "Burgenland"]
fgini(y, w)
#> [1] 0.3205489

fginindex() can be used to estimate the Gini index using various estimation methods and both R and C++ codes. Efficiency comparisons between both implementations and with other functions available in other packages, such as laeken, DescTools, ineq or REAT, can be made using, for example, the function microbenchmark():


#Comparing the computation time for the various estimation methods and using R
microbenchmark::microbenchmark(
fginindex(y, w, method = 1,  useRcpp = FALSE),
fginindex(y, w, method = 2,  useRcpp = FALSE),
fginindex(y, w, method = 3,  useRcpp = FALSE),
fginindex(y, w, method = 4,  useRcpp = FALSE),
fginindex(y, w, method = 5,  useRcpp = FALSE)
)
#> Unit: microseconds
#>                                          expr      min        lq       mean
#>  fginindex(y, w, method = 1, useRcpp = FALSE) 1810.628 1845.0520 1961.34120
#>  fginindex(y, w, method = 2, useRcpp = FALSE)   50.684   56.7860   64.37533
#>  fginindex(y, w, method = 3, useRcpp = FALSE) 3269.090 3315.3655 4081.59382
#>  fginindex(y, w, method = 4, useRcpp = FALSE) 6974.813 7110.8965 8061.91984
#>  fginindex(y, w, method = 5, useRcpp = FALSE)   62.296   70.9275   82.86917
#>     median        uq       max neval
#>  1854.9155 1873.5055  4575.116   100
#>    61.8950   70.2205   109.835   100
#>  3346.0880 5803.8480  6745.895   100
#>  7194.2875 9733.1090 10871.743   100
#>    81.3215   91.5760   157.805   100


# Comparing the computation time for the various estimation methods and using Rcpp
microbenchmark::microbenchmark(
fginindex(y, w, method = 1),
fginindex(y, w, method = 2),
fginindex(y, w, method = 3),
fginindex(y, w, method = 4),
fginindex(y, w, method = 5)
)
#> Unit: microseconds
#>                         expr      min        lq       mean    median        uq
#>  fginindex(y, w, method = 1) 1127.804 1129.0615 1135.04679 1130.2740 1137.4675
#>  fginindex(y, w, method = 2)   44.994   46.4770   49.81497   47.4835   49.8185
#>  fginindex(y, w, method = 3) 1132.243 1133.8260 1146.05683 1136.8315 1145.3770
#>  fginindex(y, w, method = 4) 1140.959 1142.8775 1150.87898 1146.2235 1152.3155
#>  fginindex(y, w, method = 5)   48.691   50.3135   53.26006   52.0875   54.0755
#>       max neval
#>  1318.741   100
#>   144.820   100
#>  1356.150   100
#>  1332.215   100
#>    90.078   100

# Comparing the computation time for estimates of the Gini index in various R packages.

# Comparing 'method = 2', used also by the laeken package. 

microbenchmark::microbenchmark(
fgini(y,w),
laeken::gini(y,w)
)
#> Unit: microseconds
#>                expr    min      lq     mean  median     uq     max neval
#>         fgini(y, w) 32.540 33.7335 36.18876 34.8805 35.797  98.344   100
#>  laeken::gini(y, w) 47.529 49.2870 52.56107 51.2405 52.618 121.406   100


# Comparing 'method = 5', used also by the DescTools and REAT packages. 

microbenchmark::microbenchmark(
fgini(y,w, method = 5),
DescTools::Gini(y,w),
REAT::gini(y, weighting = w)
)
#> Unit: microseconds
#>                          expr     min       lq      mean   median       uq
#>       fgini(y, w, method = 5)  36.889  40.0650  43.63459  42.5540  44.1825
#>         DescTools::Gini(y, w)  80.901  85.4745  89.55896  87.8890  90.5390
#>  REAT::gini(y, weighting = w) 187.139 190.9460 196.29949 193.8865 196.8420
#>      max neval
#>   98.103   100
#>  137.016   100
#>  287.226   100

Variance estimation and confidence intervals

Jackknife and linearization tecniques compute pseudo-values (named as zi, with i ∈ S) that require the use of an expression for the variance estimation. The function fgini() can compute the following type variance estimators using the argument varformula:

  1. The Horvitz-Thompson ("HT") type variance estimator (Hortvitz and Thompson, 1952).

HT(w) = ∑i ∈ Sj ∈ SΔ̆ijwiwjzizj, which is computed when varformula = "HT", where $$\breve{\Delta}_{ij}=\displaystyle \frac{\pi_{ij}-\pi_{i}\pi_{j}}{\pi_{ij}}.$$

  1. The Sen-Yates-Grundy ("SYG") type variance estimator (Sen, 1953; Yates and Grundy, 1953).

$$\widehat{V}_{SYG}(\widehat{G}_{w}) = - \displaystyle \frac{1}{2}\sum_{i\in S}\sum_{j\in S}\breve{\Delta}_{ij}(w_{i}z_i-w_{j}z_{j})^{2},$$ which is computed when varformula = "SYG".

  1. The Hartley-Rao ("HR") type variance estimator (Hartley and Rao, 1962).

$$\widehat{V}_{HR}(\widehat{G}_{w}) = \displaystyle \frac{1}{n-1}\sum_{i\in S}\sum_{\substack{j \in S\\ j < i}}\left(1-\pi_i-\pi_j + \frac{1}{n}\sum_{k\in U}\pi_{k}^{2} \right)(w_{i}z_i-w_{j}z_{j})^{2},$$ which is computed when varformula = "HR".

Note that the "HT" variance estimator may give negative values, and the "SYG" variance estimator is suitable for fixed-size sampling designs. This implies that "SYG" should not be used under Poisson sampling. Fortunately, "HT" always give positive values under this sampling design. We observe that both Horvitz-Thompson and Sen-Yates-Grundy variance estimators depend on second (joint) inclusion probabilities (argument Pij). The Hàjek (1964) approximation $$\pi_{ij}\cong \pi_{i}\pi_{j}\left[1- \displaystyle \frac{(1-\pi_{i})(1-\pi_{j})}{\sum_{i \in S}(1-\pi_{i})} \right]$$ is used when the second (joint) inclusion probabilities are not available (Pij = NULL). Note that the Hàjek approximation is suggested for large-entropy sampling designs, large samples, and large populations (see Tille 2006; Berger and Tille, 2009; Haziza et al., 2008; Berger, 2011). For instance, this approximation is not recomended for highly-stratified samples (Berger, 2005). The Hartley-Rao variance estimator requires the first inclusion probabilities at the population level (argument PiU).

Bootstrap

For complex sampling designs, the rescaled bootstrap (Rao el al., 1992; Rust and Rao, 1996) can be used for variance estimation and construction of confidence intervals. interval = "pbootstrap" returns the confidence interval for the Gini index using the rescaled bootstrap with confidence limits obtained by the percentile method. For a given estimator w and a confidence level 1 − α, this confidence interval is given by [w; α/2*, w; 1 − α/2*],

where w; a* is the ath quantile of the bootstrapped coefficients w*(b), with b = {1, …, B}, and which are obtained by using the expression w after substituting the original survey weights wi by the bootstrap weights $$w_{i}^{*}(b)=w_{i}\frac{r_{i}n}{n-1},$$ where ri is the number of times that iyh unit is selected by the bootstrap procedure. A variance estimator of the Gini index based on the rescaled bootstrap is defined as: $$\widehat{V}_{B}(\widehat{G}_{w})= \displaystyle \frac{1}{B-1}\sum_{b=1}^{B}\left(\widehat{G}^{*}_{w}(b) - \overline{G}^{*}_{w} \right)^2,$$ where $$\overline{G}^{*}_{w}=\frac{1}{B}\sum_{b=1}^{B}\widehat{G}^{*}_{w}(b).$$


# Gini index estimation and confidence interval using 'pbootstrap'.

fgini(y, w, interval = "pbootstrap")
#> $Gini
#> [1] 0.3205489
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.2935952 0.3453333
#> 
#> $Variance
#> [1] 0.0001664895

Jackknife

The "zjackknife" method computes the variance of the Gini index using the jackknife technique. For a given estimator w, the pseudo-values for variance estimation are defined as (see Berger, 2008): $$z_{i}=\displaystyle \frac{1}{w_{i}}\left(1-\frac{w_{i}}{\widehat{N}}\right)\left(\widehat{G}_{w} - \widehat{G}_{w;-i}\right),$$
where w; −i denotes the estimator w computed from S \ {i}, i.e., from the sample S after removing the ith unit. For a confidence level 1 − α, the "zjackknife" confidence interval is defined as $$\left[\widehat{G}_{w} - Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_{w})}, \widehat{G}_{w} + Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_{w})} \right],$$ where the variance (w) is computed using the pseudo-values zi and any of the aforementioned type variance estimators (Horvitz-Thompson; Sen-Yates-Grundy; or Harley-Rao).


# Gini index estimation and confidence interval using 'zjackknife'.

fgini(y, w, interval = "zjackknife")
#> $Gini
#> [1] 0.3205489
#> 
#> $Interval
#>          lower    upper
#> [1,] 0.2945728 0.346525
#> 
#> $Variance
#> [1] 0.0001756514

Linearization

The linearization technique for variance estimation (Deville, 1999) has been applied to the following estimators of the Gini index: $$\widehat{G}_{w}^{a}= \displaystyle \frac{1}{2\widehat{N}^{2}\overline{y}_{w}}\sum_{i \in S}\sum_{j \in S}w_{i}w_{j}|y_{i}-y_{j}|,$$

and $$\widehat{G}_{w}^{b} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}}\sum_{i \in S}w_{i}y_{i}\widehat{F}_{w}(y_{i})-1, $$

where $$\widehat{F}_{w}(t)=\frac{1}{\widehat{N}}\sum_{i \in S}w_i\delta(y_i \leq t)$$ For a given estimator w and a linearizated variable z, the confidence interval, with confidence level 1 − α, is defined as
$$\left[\widehat{G}_w - Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_w)}, \widehat{G}_w + Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_w)} \right],$$

where the variance (w) is computed using the corresponding pseudo-values and any of the aforementioned type variance estimators (Horvitz-Thompson; Sen-Yates-Grundy, or Harley-Rao).

On the one hand, interval = "zalinearization" linearizates the estimator wa, and the corresponding pseudo-values are defined as (Langel anf Tillé 2013):

$$z_{(i)}^{a}=\frac{1}{\widehat{N}^{2}\overline{y}_w}\left[ 2\widehat{N}_{(i)}\left( y_{(i)} - \widehat{\overline{Y}}_{(i)}\right) + \widehat{N}\left\{ \overline{y}_{w} - y_{(i)} - \widehat{G}_{w}^{a}\left(\overline{y}_{w} + y_{(i)} \right) \right\} \right],$$ where $$\widehat{\overline{Y}}_{(i)} = \displaystyle \frac{1}{\widehat{N}_{(i)}}\sum_{j=1}^{i}w_{(j)}^{+}y_{(j)}.$$

On the other hand, interval = "zblinearization" linearizates the estimator wb, and the corresponding pseudo values are (see Berger, 2008):

$$z_i^{b}=\frac{1}{\hat{N}\overline{y}_{w}}\left[ 2y_i\widehat{F}_{w}(y_i) - (\widehat{G}_{w}^{b}+1)(y_i+\overline{y}_{w})+\frac{2}{\hat{N}}\sum_{j \in S}w_jy_j\delta(y_j \geq y_i) \right],$$ where $$\widehat{F}_{w}(t) = \displaystyle \frac{1}{\widehat{N}}\sum_{i \in S}w_{i}\delta(y_i \leq t).$$


# Gini index estimation and confidence interval using:
 ## a: The method 2 for point estimation. 
 ## b: The method 'zalinearization' for variance estimation. 
 ## c: The Sen-Yates-Grundy type variance estimator. 
 ## d: The Hàjek approximation for the joint inclusion probabilities. 
fgini(y, w, interval = "zalinearization")
#> $Gini
#> [1] 0.3205489
#> 
#> $Interval
#>          lower    upper
#> [1,] 0.2946057 0.346492
#> 
#> $Variance
#> [1] 0.0001752056

# Gini index estimation and confidence interval using:
 ## a: The method 3 for point estimation. 
 ## b: The method 'zblinearization' for variance estimation. 
 ## c: The Sen-Yates-Grundy type variance estimator. 
 ## d: The Hàjek approximation for the joint inclusion probabilities. 
fgini(y, w, method = 3, interval = "zblinearization")
#> $Gini
#> [1] 0.3205489
#> 
#> $Interval
#>          lower     upper
#> [1,] 0.2944802 0.3466175
#> 
#> $Variance
#> [1] 0.0001769051

References

Alfons, A., and Templ, M. (2012). Estimation of social exclusion indicators from complex surveys: The R package laeken. KU Leuven, Faculty of Business and Economics Working Paper.

Berger, Y. G. (2005). Variance estimation with highly stratified sampling designs with unequal probabilities. Australian & New Zealand Journal of Statistics, 47, 365–373.

Berger, Y. G. (2008). A note on the asymptotic equivalence of jackknife and linearization variance estimation for the Gini Coefficient. Journal of Official Statistics, 24(4), 541-555.

Berger, Y. G. (2011). Asymptotic consistency under large entropy sampling designs with unequal probabilities. Pakistan Journal of Statistics, 27, 407–426.

Berger, Y. G. and Tille, Y. (2009). Sampling with unequal probabilities. In Sample Surveys: Design, Methods and Applications (eds. D. Pfeffermann and C. R. Rao), 39–54. Elsevier, Amsterdam

Berger, Y., and Gedik Balay, İ. (2020). Confidence intervals of Gini coefficient under unequal probability sampling. Journal of Official Statistics, 36(2), 237-249.

David, H.A. (1968). Gini’s mean difference rediscovered. Biometrika, 55, 573–575.

Davison, A. C., and Hinkley, D. V. (1997). Bootstrap Methods and Their Application (Cambridge Series in Statistical and Probabilistic Mathematics, No 1)–Cambridge University Press.

Deville, J.C. (1999). Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques. Survey Methodology, 25, 193–203.

Deltas, G. (2003). The small-sample bias of the Gini coefficient: results and implications for empirical research. Review of Economics and Statistics, 85(1), 226-234.

Giorgi, G. M., and Gigliarano, C. (2017). The Gini concentration index: a review of the inference literature. Journal of Economic Surveys, 31(4), 1130-1148.

Hàjek, J. (1964). Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35, 4, 1491–1523.

Hartley, H. O., and Rao, J. N. K. (1962). Sampling with unequal probabilities and without replacement. The Annals of Mathematical Statistics, 350-374.

Haziza, D., Mecatti, F. and Rao, J. N. K. (2008). Evaluation of some approximate variance estimators under the Rao-Sampford unequal probability sampling design. Metron, LXVI, 91–108.

Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685.

Kendall, M., and Stuart, A. (1977). The advanced theory of statistics. Vol. 1: Distribution Theory. London: Griffin.

Langel, M., and Tillé, Y. (2013). Variance estimation of the Gini index: revisiting a result several times published. Journal of the Royal Statistical Society: Series A (Statistics in Society), 176(2), 521-540.

Lerman, R. I., and Yitzhaki, S. (1989). Improving the accuracy of estimates of Gini coefficients. Journal of econometrics, 42(1), 43-47.

Muñoz, J. F., Moya-Fernández, P. J., and Álvarez-Verdejo, E. (2023). Exploring and Correcting the Bias in the Estimation of the Gini Measure of Inequality. Sociological Methods & Research. https://doi.org/10.1177/00491241231176847

Ogwang, T. (2000). A convenient method of computing the Gini index and its standard error. Oxford Bulletin of Economics and Statistics, 62(1), 123-123.

Owen, A. B. (2001). Empirical likelihood. CRC press.

Qin, Y., Rao, J. N. K., and Wu, C. (2010). Empirical likelihood confidence intervals for the Gini measure of income inequality. Economic Modelling, 27(6), 1429-1435.

Rao, J. N. K., Wu, C. F. J., and Yue, K. (1992). Some recent work on resampling methods for complex surveys. Survey methodology, 18(2), 209-217.

Rust, K. F., and Rao, J. N. K. (1996). Variance estimation for complex surveys using replication techniques. Statistical methods in medical research, 5(3), 283-310.

Sen, A. R. (1953). On the estimate of the variance in sampling with varying probabilities. Journal of the Indian Society of Agricultural Statistics, 5, 119–127.

Tillé, Y. (2006). Sampling Algorithms. Springer, New York.

Yates, F., and Grundy, P. M. (1953). Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society B, 15, 253–261.