The aim of this vignette is to describe the various methods for estimating the Gini index, for both infinite and finite populations, as well as the methods for estimating its variance, as implemented in the giniVarCI package. Different confidence intervals for the Gini index are also explained.
To exemplify the use of the different functions, we assume that inequality is measured for a nonnegative and continuous random variable Y. A popular formulation of the Gini index (G) is defined by (see David, 1968; Kendall and Stuart, 1977; Qin et al., 2010): $$ G = \frac{1}{2 \mu_{Y}} \int_{0}^{+\infty} \int_{0}^{+\infty} |x-y|dF_{Y}(x)dF_{Y}(y), $$ where μY = E[Y] = ∫0+∞yf(y)dy = ∫0+∞ydFY(y), is the mean of Y, and FY(y) = P(Y ≤ y) and f(y) are the cumulative distribution function and the probability density function of Y, respectively.
In practice, the value of G is estimated by means of a sample S with size n, which can be selected from either infinite or finite populations (Berger and Gedik Balay, 2020; Muñoz et al., 2023).
For infinite populations, {Yi : i ∈ S}
denotes a sequence, with size n, of nonnegative random variables
with the same distribution as the variable of interest Y. The Gini index (G) is estimated using the
observation of individuals selected in the sample, which are denoted as
{yi : i ∈ S}.
A popular estimator of the Gini index is (see Langel and Tille, 2013;
Giorgi and Gigliarano, 2017; Muñoz et al., 2023): $$\widehat{G} = \displaystyle
\frac{2}{\overline{y}n^2}\sum_{i \in S}iy_{(i)} - \frac{n+1}{n},
$$ where $\overline{y}=n^{-1}\sum_{i=1}^{n}y_i$, and
y(i) are
the ordered values (in non-decreasing order) of the sample observations
yi. This
is the expression computed by the functions iginindex()
(method = 5
) and igini() when
bias.correction = FALSE
.
The estimator Ĝ can be
biased for small sample sizes (Deltas, 2003). The bias corrected
(bc) version of Ĝ is:
$$\widehat{G}^{bc} = \displaystyle
\frac{2}{\overline{y}n(n-1)}\sum_{i \in S}iy_{(i)} -
\frac{n+1}{n-1},$$ which corresponds to the Gini index bias
correction version computed by iginindex()
(method = 5
) and igini() when
bias.correction = TRUE
.
In the first example, a sample with size n=100
is
generated using the gsample() function from the standard
logNormal distribution (distribution = "lognormal"
) with
true Gini index is G = 0.5
(gini = 0.5
) and the Gini index is estimated using bias
correction.
library(giniVarCI)
set.seed(123)
y <- gsample(n = 100, gini = 0.5, distribution = "lognormal")
igini(y)
#> [1] 0.4671929
iginindex() can be used to estimate the Gini index using
various estimation methods and both R and
C++ codes. See help(iginindex)
for a
detailed description of the various estimation methods. Efficiency
comparisons between both implementations and with other functions
available in other packages, such as laeken,
DescTools, ineq or
REAT, can be made using, for instance, the function
microbenchmark():
#Comparing the computation time for the various estimation methods using R
microbenchmark::microbenchmark(
iginindex(y, method = 1, useRcpp = FALSE),
iginindex(y, method = 2, useRcpp = FALSE),
iginindex(y, method = 3, useRcpp = FALSE),
iginindex(y, method = 4, useRcpp = FALSE),
iginindex(y, method = 5, useRcpp = FALSE),
iginindex(y, method = 6, useRcpp = FALSE),
iginindex(y, method = 7, useRcpp = FALSE),
iginindex(y, method = 8, useRcpp = FALSE),
iginindex(y, method = 9, useRcpp = FALSE),
iginindex(y, method = 10, useRcpp = FALSE)
)
#> Unit: microseconds
#> expr min lq mean
#> iginindex(y, method = 1, useRcpp = FALSE) 147.905 153.7470 163.08382
#> iginindex(y, method = 2, useRcpp = FALSE) 13.074 16.0850 19.48625
#> iginindex(y, method = 3, useRcpp = FALSE) 11.101 13.1000 16.11174
#> iginindex(y, method = 4, useRcpp = FALSE) 14.507 18.3290 21.78797
#> iginindex(y, method = 5, useRcpp = FALSE) 13.716 16.5810 19.65793
#> iginindex(y, method = 6, useRcpp = FALSE) 28.142 37.2500 50.39609
#> iginindex(y, method = 7, useRcpp = FALSE) 742.776 765.5785 785.38177
#> iginindex(y, method = 8, useRcpp = FALSE) 709.183 736.2335 884.36971
#> iginindex(y, method = 9, useRcpp = FALSE) 527.064 550.8425 594.65901
#> iginindex(y, method = 10, useRcpp = FALSE) 8968.572 9101.4355 9931.17492
#> median uq max neval
#> 159.3675 168.6795 211.725 100
#> 17.7685 20.2230 65.752 100
#> 14.6070 16.4105 53.280 100
#> 20.2780 22.9985 69.890 100
#> 18.4595 20.9840 64.010 100
#> 44.9040 53.2295 484.073 100
#> 779.5295 800.9295 872.808 100
#> 752.5190 777.7410 3381.810 100
#> 562.9505 583.7340 3142.804 100
#> 9239.8735 10827.6005 14842.992 100
# Comparing the computation time for the various estimation methods using Rcpp
microbenchmark::microbenchmark(
iginindex(y, method = 1),
iginindex(y, method = 2),
iginindex(y, method = 3),
iginindex(y, method = 4),
iginindex(y, method = 5),
iginindex(y, method = 6),
iginindex(y, method = 7),
iginindex(y, method = 8),
iginindex(y, method = 9),
iginindex(y, method = 10) )
#> Unit: microseconds
#> expr min lq mean median uq
#> iginindex(y, method = 1) 45.014 45.7005 46.98846 46.1410 46.9275
#> iginindex(y, method = 2) 9.177 10.3090 12.02072 10.7755 12.0580
#> iginindex(y, method = 3) 9.177 10.0135 11.36852 10.5500 11.3960
#> iginindex(y, method = 4) 9.548 10.3845 11.57021 10.8255 11.5970
#> iginindex(y, method = 5) 8.195 8.7970 9.58509 9.1020 9.5675
#> iginindex(y, method = 6) 8.326 8.7565 10.00369 9.2070 10.0185
#> iginindex(y, method = 7) 76.202 76.9590 78.56232 77.7000 78.5520
#> iginindex(y, method = 8) 44.764 45.5050 47.28924 45.8950 46.8725
#> iginindex(y, method = 9) 39.153 39.8945 41.39766 40.3900 41.7580
#> iginindex(y, method = 10) 8887.872 9000.2510 9594.67304 9111.9850 10338.4330
#> max neval
#> 57.226 100
#> 67.616 100
#> 22.372 100
#> 22.021 100
#> 22.131 100
#> 22.883 100
#> 98.494 100
#> 67.927 100
#> 51.676 100
#> 11997.664 100
# Comparing the computation time for estimates of the Gini index in various R packages.
microbenchmark::microbenchmark(
igini(y),
laeken::gini(y),
DescTools::Gini(y),
ineq::Gini(y),
REAT::gini(y))
#> Registered S3 methods overwritten by 'DescTools':
#> method from
#> lines.Lc ineq
#> plot.Lc ineq
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> igini(y) 8.746 11.8320 13.42974 12.9095 13.8955 41.017 100
#> laeken::gini(y) 31.809 35.9420 151.47034 37.4700 39.1880 11368.220 100
#> DescTools::Gini(y) 52.147 55.3885 6453.94998 56.9060 59.0450 597008.995 100
#> ineq::Gini(y) 39.273 43.8370 71.51889 44.9535 46.5415 2552.803 100
#> REAT::gini(y) 86.332 90.0880 111.39978 91.8865 95.0120 1876.992 100
Variance estimators and confidence intervals are described using different methods for the estimator of the non-bias corrected version of Gini index Ĝ, since as $$\widehat{G}^{bc} = \frac{n}{n-1}\widehat{G},$$ the variance estimators and confidence intervals based on Ĝbc can be straightforwardly derived. In particular, $$\widehat{V}(\widehat{G}^{bc})=\frac{n^2}{(n-1)^2}\widehat{V}(\widehat{G}).$$ Let [L, U] the lower and upper limits of a confidence interval for G based on Ĝ. The confidence interval based on Ĝbc can be computed as: $$ \left[ \frac{n}{n-1}L, \frac{n}{n-1}U\right].$$
The argument interval = pbootstrap
in the function
igini() returns the confidence interval for the Gini index
using the percentile bootstrap method. Let {y1*(b), …, yn*(b)}
be the bth bootstrap sample
taken from the original sample {y1, …, yn}
by simple random sampling with replacement, and Ĝ*(b) denotes
the estimator Ĝ computed from
the bth bootstrap sample, with
b = {1, …, B}, being
B the total number of
bootstrap samples. For a confidence level 1 − α, the percentile bootstrap
confidence interval is defined as (see Qin et al., 2010): [Ĝ(α/2)*, Ĝ(1 − α/2)*],
where Ĝ(a)*
is the ath quantile of the
bootstrapped coefficients Ĝ*(b). A
variance estimator of the Gini index based on bootstrap is defined as
$$\widehat{V}_{B}(\widehat{G})= \displaystyle
\frac{1}{B-1}\sum_{b=1}^{B}\left(\widehat{G}^{*}(b) - \overline{G}^{*}
\right)^2,$$ where $$\overline{G}^{*}=\frac{1}{B}\sum_{b=1}^{B}\widehat{G}^{*}(b).$$
# Gini index estimation and confidence interval using 'pbootstrap',
igini(y, interval = "pbootstrap")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4004204 0.5135315
#>
#> $Variance
#> [1] 0.0008333577
interval = "BCa"
computes the bias-corrected and
accelerated bootstrap interval (Davison and Hinkley, 1997). The idea of
this confidence interval is to correct for bias due to the skewness in
the distribution of bootstrap estimates. The "BCa"
confidence interval is defined as: [Ĝ(α1)*, Ĝ(α2)*],
where $$\alpha_{1}=\phi\left( \widehat{Z}_{0}
+ \frac{\widehat{Z}_{0} + Z_{\alpha}}{1-\widehat{a} (\widehat{Z}_{0} +
\widehat{Z}_{\alpha}) } \right),$$
$$\alpha_{2}=\phi\left( \widehat{Z}_{0} + \frac{\widehat{Z}_{0} + Z_{1-\alpha}}{1-\widehat{a} (\widehat{Z}_{0} + \widehat{Z}_{1-\alpha}) } \right),$$ ϕ(⋅) is the cumulative distribution function of the standard Normal distribution, and Za is the ath quantile of the standard Normal distribution. The bias-correction factor is defined as $$\widehat{Z}_{0}=\phi^{-1}\left(\#\frac{\widehat{G}^{*}(b) - \widehat{G}}{B}\right),$$ and the acceleration factor is given by $$\widehat{a}=\frac{\sum_{i \in S}\left(\overline{G} -\widehat{G}_{-i} \right)^3}{6\left\{\sum_{i \in S}\left(\overline{G} -\widehat{G}_{-i} \right)^2\right\}^{3/2}},$$ where Ĝ−i are the jackknife estimates defined in the following section, and $$\overline{G} = \frac{1}{n}\sum_{i \in S}\widehat{G}_{-i}.$$
The "zjackknife"
and "tjackknife"
methods
compute the variance of the Gini index using the Ogwang Jackknife
procedure (Ogwang, 2000; Langel and Tille, 2013). This variance si given
by $$\widehat{V}_{J}(\widehat{G})=
\displaystyle \frac{n-1}{n}\sum_{i \in S}\left(\widehat{G}_{-i}-
\overline{G} \right)^2,$$ where $$\widehat{G}_{-i}=\widehat{G}+\frac{2}{\sum_{j \in
S}y_j - y_{(i)} }\left[ \frac{y_{(i)} \sum_{j \in S}jy_{(j)}}{n\sum_{j
\in S}y_j}+\frac{\sum_{j \in S}jy_{(j)}}{n(n-1)} - \frac{\sum_{j \in
S}y_j-\sum_{j=1}^{i}y_{(j)} +iy_{(i)}
}{n-1} \right]-\frac{1}{n(n-1)}, $$ with i = {1, …, n} being the
jackknife estimates, i.e., Ĝ−i is the
estimation of the Gini index when the unit i is removed from the sample. For a
confidence level 1 − α, the
"zjackknife"
confidence interval is defined as $$\left[\widehat{G} -
Z_{1-\alpha/2}\sqrt{\widehat{V}_{J}(\widehat{G})}, \widehat{G} +
Z_{1-\alpha/2}\sqrt{\widehat{V}_{J}(\widehat{G})} \right],$$
where Z1 − α/2 is the
(1 − α/2)th quantile of the
standard Normal distribution.
# Gini index estimation and confidence interval using 'zjackknife'.
igini(y, interval = "zjackknife")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4103563 0.5240296
#>
#> $Variance
#> [1] 0.0008409313
"tjackknife"
sustitutes the critical value Z1 − α/2 by
critical values computed from the studentized bootstrap. This confidence
interval is given by
$$\left[\widehat{G} -
t_{J;1-\alpha/2}^{*}\sqrt{\widehat{V}_{J}(\widehat{G})}, \widehat{G} -
t_{J;\alpha/2}^{*}\sqrt{\widehat{V}_{J}(\widehat{G})} \right],$$
where tJ; a*
is the ath quantile of the
values $$
t^{*}_{J}(b)=\frac{\widehat{G}^{*}(b) -
\widehat{G}}{\sqrt{\widehat{V}_{J}\left[\widehat{G}^{*}(b)\right]}}$$
computed using the bootstrap technique, where V̂J[Ĝ*(b)]
is the estimated Ogwang Jackknife variance of Ĝ*(b) for the
bth bootstrap sample.
The linearization technique for variance estimation (Deville, 1999)
has been applied to the following estimators of the Gini index: $$\widehat{G}^{a} = \displaystyle
\frac{1}{2\overline{y}n^{2}}\sum_{i \in S}\sum_{j\in S}
|y_i-y_j|$$ and $$\widehat{G}^{b} = \displaystyle
\frac{2}{\overline{y}n}\sum_{i \in S}y_{i}\widehat{F}_{n}(y_{i}) -
1,$$ where $$\widehat{F}_{n}(y_i)=\frac{1}{n}\sum_{j \in
S}\delta(y_j \leq y_i)$$ and δ(⋅) is the indicator variable that
takes the value 1 when its argument is true and 0 otherwise. For a given
estimator Ĝ and a linearizated
variable z, the confidence
interval, with confidence level 1 − α, is defined as:
$$\left[\widehat{G} -
Z_{1-\alpha/2}\sqrt{\widehat{V}_{L}(\widehat{G})}, \widehat{G} +
Z_{1-\alpha/2}\sqrt{\widehat{V}_{L}(\widehat{G})} \right],$$
where the variance estimator of the Gini index is given by: $$\widehat{V}_{L}(\widehat{G})= \displaystyle \frac{1}{n(n-1)}\sum_{i \in S}\left(z_{i} - \overline{z}\right)^2,$$ and $$\overline{z}=\frac{1}{n}\sum_{i \in S}z_{i}.$$
On the one hand, interval = "zalinearization"
linearizates the estimator Ĝa, and the
corresponding pseudo-values are (see Langel anf Tillé 2013):
$$z_{(i)}^{a}=\frac{1}{\overline{y}}\left[ \frac{2i}{n}\left( y_{(i)} - \widehat{\overline{Y}}_{(i)}\right) + \overline{y} - y_{(i)} - \widehat{G}^{a}\left(\overline{y} + y_{(i)}\right) \right],$$ where $$\widehat{\overline{Y}}_{(i)} = \displaystyle \frac{1}{i}\sum_{j = 1}^{i}y_{(j)}.$$
On the other hand, interval = "zblinearization"
linearizates the estimator Ĝb, and the
corresponding pseudo values are (see Berger, 2008):
$$z_i^{b}=\frac{1}{\overline{y}}\left[ 2y_i\widehat{F}_{n}(y_i) - (\widehat{G}^{b}+1)(y_i+\overline{y})+2\frac{\sum_{j \in S}y_j\delta(y_j \geq y_i)}{n} \right].$$
# Gini index estimation and confidence interval using 'zalinearization'.
igini(y, interval = "zalinearization")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4125876 0.5217982
#>
#> $Variance
#> [1] 0.0007762
# Gini index estimation and confidence interval using 'zblinearization'.
igini(y, interval = "zblinearization")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4107537 0.5236321
#>
#> $Variance
#> [1] 0.0008292117
Intervals "talinearization"
and
"tblinearization"
substitute the critical value Z1 − α/2 by
critical values computed from the Studentized bootstrap. This confidence
interval is given by
$$\left[\widehat{G} -
t_{L;1-\alpha/2}^{*}\sqrt{\widehat{V}_{L}(\widehat{G})}, \widehat{G} -
t_{L;\alpha/2}^{*}\sqrt{\widehat{V}_{L}(\widehat{G})} \right],$$
where tL; a*
is the ath quantile of the
values $$
t^{*}_{L}(b)=\frac{\widehat{G}^{*}(b) -
\widehat{G}}{\sqrt{\widehat{V}_{L}\left[\widehat{G}^{*}(b)\right]}}.$$
V̂L(⋅) is
computed using the pseudo-values z(i)a
when interval = "zalinearization"
, and using the
pseudo-values zib
when interval = "zblinearization"
.
# Gini index estimation and confidence interval using 'talinearization'.
igini(y, interval = "talinearization")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4195142 0.5253662
#>
#> $Variance
#> [1] 0.0007762
# Gini index estimation and confidence interval using 'tblinearization'.
igini(y, interval = "tblinearization")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4224278 0.5329734
#>
#> $Variance
#> [1] 0.0008292117
Intervals "ELchisq"
and "ELboot"
compute
the empirical likelihood (EL) method, a nonparametric
technique that provides desirable inferences under skewed distributions.
The shape of the EL
confidence intervals are determined by the data-driven likelihood ratio
function (Owen, 2001). interval = "ELchisq"
obtains the
EL confidence
interval, with confidence level 1 − α, for the Gini index as defined
by Qin et al. (2010): $$\left\{
\theta|-2R(\theta) \leq
\frac{\chi^2_{1;1-\alpha}}{k}\right\}$$
where R(θ) = −∑i ∈ Slog{1 + λZ(yi, θ)} is the log-EL ratio statistic for θ = G, Z(yi, θ) = {2F̂n(yi) − 1}yi − θyi, λ is the solution to $$ \frac{1}{n}\sum_{i \in S}\frac{Z(y_i,\theta)}{1+Z(y_i,\theta)}=0,$$ k = σ̂22/σ̂12 is the scaling factor, $$ \widehat{\sigma}_{j}^{2}=\frac{1}{n-1}\sum_{i \in S}\left(u_{ji} - \overline{u}_{j} \right)^2,$$ with j = {1, 2}, $$ \overline{u}_{j} = \frac{1}{n}\sum_{i \in S}u_{ji},$$ and χ1; 1 − α2 is the (1 − α)th quantile of Chi-Squared distribution with one degree of freedom.
# Gini index estimation and confidence interval using 'ELchisq'.
igini(y, interval = "ELchisq")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4216374 0.5319404
#>
#> $Variance
#> [1] 0.0008292117
interval = "ELboot"
substitutes the critical value based
on the Chi-Squared distribution by an empirical critical value based on
bootstrap. "ELboot"
computes the EL confidence interval (Qin
et al., 2010): {θ|−2R(θ) ≤ C1 − α},
where C1 − α is the
(1 − α)th quantile of the
values {−R1*(Ĝ), …, −RB*(Ĝ)},
and where Rb*(Ĝ)
denotes the value of R(θ) computed from the
bth bootstrap sample.
# Gini index estimation and confidence interval using 'ELboot'.
igini(y, interval = "ELboot")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4118394 0.5413343
#>
#> $Variance
#> [1] 0.0008343201
The function icompareCI() compares the various confidence
intervals for the scenario of a sample derived from an infinite
population. The argument plotCI = TRUE
plots the results
derived from the various available methods for constructing confidence
intervals.
# Comparisons of variance estimators and confidence intervals.
icompareCI(y, plotCI = FALSE)
#> interval bc gini lowerlimit upperlimit var.gini
#> 1 zjackknife FALSE 0.46 0.41 0.52 8e-04
#> 2 zjackknife TRUE 0.47 0.41 0.52 8e-04
#> 3 tjackknife FALSE 0.46 0.41 0.53 8e-04
#> 4 tjackknife TRUE 0.47 0.42 0.54 8e-04
#> 5 zalinearization FALSE 0.46 0.41 0.52 8e-04
#> 6 zalinearization TRUE 0.47 0.41 0.52 8e-04
#> 7 talinearization FALSE 0.46 0.41 0.52 8e-04
#> 8 talinearization TRUE 0.47 0.42 0.52 8e-04
#> 9 zblinearization FALSE 0.46 0.41 0.52 8e-04
#> 10 zblinearization TRUE 0.47 0.41 0.52 8e-04
#> 11 tblinearization FALSE 0.46 0.42 0.53 8e-04
#> 12 tblinearization TRUE 0.47 0.42 0.53 8e-04
#> 13 pbootstrap FALSE 0.46 0.40 0.51 8e-04
#> 14 pbootstrap TRUE 0.47 0.40 0.51 8e-04
#> 15 BCa FALSE 0.46 0.42 0.53 7e-04
#> 16 BCa TRUE 0.47 0.42 0.53 8e-04
#> 17 ELchisq FALSE 0.46 0.42 0.53 8e-04
#> 18 ELchisq TRUE 0.47 0.42 0.53 8e-04
#> 19 ELboot FALSE 0.46 0.41 0.53 7e-04
#> 20 ELboot TRUE 0.47 0.42 0.54 7e-04
For a finite population U, {Yi : i ∈ U} denotes a sequence, with size N, of nonnegative random variables with the same distribution as the variable of interest Y, and {yi : i ∈ U} are the population values of the variable of interest. A sample S is selected from U by using a sampling design with survey weights wi, with i ∈ S. For example, the survey weights can be wi = πi−1, where πi = P(i ∈ S) are the inclusion probabilities (Muñoz et al., 2023). The Gini index (G) is estimated using the observations of individuals selected in the sample {yi : i ∈ S}, and the corresponding survey weights {wi : i ∈ S}. The different methods for estimating the Gini index are (see also Muñoz et al., 2023):
method = 1
(Langel and Tillé, 2013).$$\widehat{G}_{w1}= \displaystyle \frac{1}{2\widehat{N}^{2}\overline{y}_{w}}\sum_{i \in S}\sum_{j \in S}w_{i}w_{j}|y_{i}-y_{j}|,$$ where N̂ = ∑i ∈ Swi and $$\overline{y}_{w}= \frac{1}{\widehat{N}}\sum_{i \in S}w_{i}y_{i}.$$
method = 2
(Alfons and Templ, 2012; Langel and Tillé,
2013).$$\widehat{G}_{w2} =\displaystyle \frac{2\sum_{i \in S}w_{(i)}^{+}\widehat{N}_{(i)}y_{(i)} -\sum_{i \in S}w_{i}^{2}y_{i} }{\widehat{N}^{2}\overline{y}_{w}}-1,$$ where y(i) are the values yi sorted in increasing order, w(i)+ are the values wi sorted according to the increasing order of the values yi, and $\widehat{N}_{(i)}=\sum_{j=1}^{i}w_{(j)}^{+}$. Note that Langel and Tillé (2013) show that Ĝw1 = Ĝw2.
method = 3
(Berger, 2008).$$\widehat{G}_{w3} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}}\sum_{i \in S}w_{i}y_{i}\widehat{F}_{w}^{\ast}(y_{i})-1, $$ where $$\widehat{F}_{w}^{\ast}(t) = \displaystyle \frac{1}{\widehat{N}}\sum_{i \in S}w_{i}[\delta(y_i < t) + 0.5\delta(y_i = t)]$$ is the smooth (mid-point) distribution function.
method = 4
(Berger and Gedik-Balay, 2020).$$\widehat{G}_{w4} = 1 - \displaystyle \frac{\overline{v}_{w}}{\overline{y}_{w}},$$ where $\overline{v}_{w}=\widehat{N}^{-1}\sum_{i \in S}w_{i}v_{i}$ and $$v_{i} = \displaystyle \frac{1}{\widehat{N} - w_{i}}\sum_{ \substack{j \in S\\ j\neq i}}\min(y_{i},y_{j}).$$
method = 5
(Lerman and Yitzhaki, 1989).$$\widehat{G}_{w5} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}} \sum_{i \in S} w_{(i)}^{+}[y_{(i)} - \overline{y}_{w}]\left[ \widehat{F}_{w}^{LY}(y_{(i)}) - \overline{F}_{w}^{LY} \right], $$ where $$\widehat{F}_{w}^{LY}(y_{(i)}) = \displaystyle \frac{1}{\widehat{N}}\left(\widehat{N}_{(i-1)} + \frac{w_{(i)}^{+}}{2} \right)$$ and $$\overline{F}_{w}^{LY}=\frac{1}{\widehat{N}}\sum_{i \in S}w_{(i)}^{+}\widehat{F}_{w}^{LY}(y_{(i)}).$$
In the finite population example, income and weights from the 2006
Austrian EU-SILC data set (laeken package) are used to
estimate the Gini index in the Austrian region of Burgenland. The Gini
index is estimated using fgini() and method = 2
(the default method).
data(eusilc, package="laeken")
y <- eusilc$eqIncome[eusilc$db040 == "Burgenland"]
w <- eusilc$rb050[eusilc$db040 == "Burgenland"]
fgini(y, w)
#> [1] 0.3205489
fginindex() can be used to estimate the Gini index using various estimation methods and both R and C++ codes. Efficiency comparisons between both implementations and with other functions available in other packages, such as laeken, DescTools, ineq or REAT, can be made using, for example, the function microbenchmark():
#Comparing the computation time for the various estimation methods and using R
microbenchmark::microbenchmark(
fginindex(y, w, method = 1, useRcpp = FALSE),
fginindex(y, w, method = 2, useRcpp = FALSE),
fginindex(y, w, method = 3, useRcpp = FALSE),
fginindex(y, w, method = 4, useRcpp = FALSE),
fginindex(y, w, method = 5, useRcpp = FALSE)
)
#> Unit: microseconds
#> expr min lq mean
#> fginindex(y, w, method = 1, useRcpp = FALSE) 1810.628 1845.0520 1961.34120
#> fginindex(y, w, method = 2, useRcpp = FALSE) 50.684 56.7860 64.37533
#> fginindex(y, w, method = 3, useRcpp = FALSE) 3269.090 3315.3655 4081.59382
#> fginindex(y, w, method = 4, useRcpp = FALSE) 6974.813 7110.8965 8061.91984
#> fginindex(y, w, method = 5, useRcpp = FALSE) 62.296 70.9275 82.86917
#> median uq max neval
#> 1854.9155 1873.5055 4575.116 100
#> 61.8950 70.2205 109.835 100
#> 3346.0880 5803.8480 6745.895 100
#> 7194.2875 9733.1090 10871.743 100
#> 81.3215 91.5760 157.805 100
# Comparing the computation time for the various estimation methods and using Rcpp
microbenchmark::microbenchmark(
fginindex(y, w, method = 1),
fginindex(y, w, method = 2),
fginindex(y, w, method = 3),
fginindex(y, w, method = 4),
fginindex(y, w, method = 5)
)
#> Unit: microseconds
#> expr min lq mean median uq
#> fginindex(y, w, method = 1) 1127.804 1129.0615 1135.04679 1130.2740 1137.4675
#> fginindex(y, w, method = 2) 44.994 46.4770 49.81497 47.4835 49.8185
#> fginindex(y, w, method = 3) 1132.243 1133.8260 1146.05683 1136.8315 1145.3770
#> fginindex(y, w, method = 4) 1140.959 1142.8775 1150.87898 1146.2235 1152.3155
#> fginindex(y, w, method = 5) 48.691 50.3135 53.26006 52.0875 54.0755
#> max neval
#> 1318.741 100
#> 144.820 100
#> 1356.150 100
#> 1332.215 100
#> 90.078 100
# Comparing the computation time for estimates of the Gini index in various R packages.
# Comparing 'method = 2', used also by the laeken package.
microbenchmark::microbenchmark(
fgini(y,w),
laeken::gini(y,w)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> fgini(y, w) 32.540 33.7335 36.18876 34.8805 35.797 98.344 100
#> laeken::gini(y, w) 47.529 49.2870 52.56107 51.2405 52.618 121.406 100
# Comparing 'method = 5', used also by the DescTools and REAT packages.
microbenchmark::microbenchmark(
fgini(y,w, method = 5),
DescTools::Gini(y,w),
REAT::gini(y, weighting = w)
)
#> Unit: microseconds
#> expr min lq mean median uq
#> fgini(y, w, method = 5) 36.889 40.0650 43.63459 42.5540 44.1825
#> DescTools::Gini(y, w) 80.901 85.4745 89.55896 87.8890 90.5390
#> REAT::gini(y, weighting = w) 187.139 190.9460 196.29949 193.8865 196.8420
#> max neval
#> 98.103 100
#> 137.016 100
#> 287.226 100
Jackknife and linearization tecniques compute pseudo-values
(named as zi, with i ∈ S) that require the use
of an expression for the variance estimation. The function
fgini() can compute the following type variance estimators
using the argument varformula
:
"HT"
) type variance estimator
(Hortvitz and Thompson, 1952).V̂HT(Ĝw) = ∑i ∈ S∑j ∈ SΔ̆ijwiwjzizj,
which is computed when varformula = "HT"
, where $$\breve{\Delta}_{ij}=\displaystyle
\frac{\pi_{ij}-\pi_{i}\pi_{j}}{\pi_{ij}}.$$
"SYG"
) type variance estimator
(Sen, 1953; Yates and Grundy, 1953).$$\widehat{V}_{SYG}(\widehat{G}_{w}) = -
\displaystyle \frac{1}{2}\sum_{i\in S}\sum_{j\in
S}\breve{\Delta}_{ij}(w_{i}z_i-w_{j}z_{j})^{2},$$ which is
computed when varformula = "SYG"
.
"HR"
) type variance estimator (Hartley
and Rao, 1962).$$\widehat{V}_{HR}(\widehat{G}_{w}) =
\displaystyle \frac{1}{n-1}\sum_{i\in S}\sum_{\substack{j \in S\\ j <
i}}\left(1-\pi_i-\pi_j + \frac{1}{n}\sum_{k\in U}\pi_{k}^{2}
\right)(w_{i}z_i-w_{j}z_{j})^{2},$$ which is computed when
varformula = "HR"
.
Note that the "HT"
variance estimator may give negative
values, and the "SYG"
variance estimator is suitable for
fixed-size sampling designs. This implies that "SYG"
should
not be used under Poisson sampling. Fortunately, "HT"
always give positive values under this sampling design. We observe that
both Horvitz-Thompson and Sen-Yates-Grundy variance estimators depend on
second (joint) inclusion probabilities (argument Pij
). The
Hàjek (1964) approximation $$\pi_{ij}\cong
\pi_{i}\pi_{j}\left[1- \displaystyle
\frac{(1-\pi_{i})(1-\pi_{j})}{\sum_{i \in S}(1-\pi_{i})}
\right]$$ is used when the second (joint) inclusion probabilities
are not available (Pij = NULL
). Note that the Hàjek
approximation is suggested for large-entropy sampling designs, large
samples, and large populations (see Tille 2006; Berger and Tille, 2009;
Haziza et al., 2008; Berger, 2011). For instance, this approximation is
not recomended for highly-stratified samples (Berger, 2005). The
Hartley-Rao variance estimator requires the first inclusion
probabilities at the population level (argument PiU
).
For complex sampling designs, the rescaled bootstrap (Rao el al.,
1992; Rust and Rao, 1996) can be used for variance estimation and
construction of confidence intervals.
interval = "pbootstrap"
returns the confidence interval for
the Gini index using the rescaled bootstrap with confidence limits
obtained by the percentile method. For a given estimator Ĝw and a
confidence level 1 − α, this
confidence interval is given by [Ĝw; α/2*, Ĝw; 1 − α/2*],
where Ĝw; a* is the ath quantile of the bootstrapped coefficients Ĝw*(b), with b = {1, …, B}, and which are obtained by using the expression Ĝw after substituting the original survey weights wi by the bootstrap weights $$w_{i}^{*}(b)=w_{i}\frac{r_{i}n}{n-1},$$ where ri is the number of times that iyh unit is selected by the bootstrap procedure. A variance estimator of the Gini index based on the rescaled bootstrap is defined as: $$\widehat{V}_{B}(\widehat{G}_{w})= \displaystyle \frac{1}{B-1}\sum_{b=1}^{B}\left(\widehat{G}^{*}_{w}(b) - \overline{G}^{*}_{w} \right)^2,$$ where $$\overline{G}^{*}_{w}=\frac{1}{B}\sum_{b=1}^{B}\widehat{G}^{*}_{w}(b).$$
The "zjackknife"
method computes the variance of the
Gini index using the jackknife technique. For a given estimator Ĝw, the
pseudo-values for variance estimation are defined as (see Berger, 2008):
$$z_{i}=\displaystyle
\frac{1}{w_{i}}\left(1-\frac{w_{i}}{\widehat{N}}\right)\left(\widehat{G}_{w}
- \widehat{G}_{w;-i}\right),$$
where Ĝw; −i
denotes the estimator Ĝw computed from
S \ {i}, i.e., from
the sample S after removing
the ith unit. For a confidence
level 1 − α, the
"zjackknife"
confidence interval is defined as $$\left[\widehat{G}_{w} -
Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_{w})}, \widehat{G}_{w} +
Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_{w})} \right],$$
where the variance V̂(Ĝw)
is computed using the pseudo-values zi and any of
the aforementioned type variance estimators (Horvitz-Thompson;
Sen-Yates-Grundy; or Harley-Rao).
The linearization technique for variance estimation (Deville, 1999) has been applied to the following estimators of the Gini index: $$\widehat{G}_{w}^{a}= \displaystyle \frac{1}{2\widehat{N}^{2}\overline{y}_{w}}\sum_{i \in S}\sum_{j \in S}w_{i}w_{j}|y_{i}-y_{j}|,$$
and $$\widehat{G}_{w}^{b} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}}\sum_{i \in S}w_{i}y_{i}\widehat{F}_{w}(y_{i})-1, $$
where $$\widehat{F}_{w}(t)=\frac{1}{\widehat{N}}\sum_{i
\in S}w_i\delta(y_i \leq t)$$ For a given estimator Ĝw and a
linearizated variable z, the
confidence interval, with confidence level 1 − α, is defined as
$$\left[\widehat{G}_w -
Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_w)}, \widehat{G}_w +
Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_w)} \right],$$
where the variance V̂(Ĝw) is computed using the corresponding pseudo-values and any of the aforementioned type variance estimators (Horvitz-Thompson; Sen-Yates-Grundy, or Harley-Rao).
On the one hand, interval = "zalinearization"
linearizates the estimator Ĝwa,
and the corresponding pseudo-values are defined as (Langel anf Tillé
2013):
$$z_{(i)}^{a}=\frac{1}{\widehat{N}^{2}\overline{y}_w}\left[ 2\widehat{N}_{(i)}\left( y_{(i)} - \widehat{\overline{Y}}_{(i)}\right) + \widehat{N}\left\{ \overline{y}_{w} - y_{(i)} - \widehat{G}_{w}^{a}\left(\overline{y}_{w} + y_{(i)} \right) \right\} \right],$$ where $$\widehat{\overline{Y}}_{(i)} = \displaystyle \frac{1}{\widehat{N}_{(i)}}\sum_{j=1}^{i}w_{(j)}^{+}y_{(j)}.$$
On the other hand, interval = "zblinearization"
linearizates the estimator Ĝwb,
and the corresponding pseudo values are (see Berger, 2008):
$$z_i^{b}=\frac{1}{\hat{N}\overline{y}_{w}}\left[ 2y_i\widehat{F}_{w}(y_i) - (\widehat{G}_{w}^{b}+1)(y_i+\overline{y}_{w})+\frac{2}{\hat{N}}\sum_{j \in S}w_jy_j\delta(y_j \geq y_i) \right],$$ where $$\widehat{F}_{w}(t) = \displaystyle \frac{1}{\widehat{N}}\sum_{i \in S}w_{i}\delta(y_i \leq t).$$
# Gini index estimation and confidence interval using:
## a: The method 2 for point estimation.
## b: The method 'zalinearization' for variance estimation.
## c: The Sen-Yates-Grundy type variance estimator.
## d: The Hàjek approximation for the joint inclusion probabilities.
fgini(y, w, interval = "zalinearization")
#> $Gini
#> [1] 0.3205489
#>
#> $Interval
#> lower upper
#> [1,] 0.2946057 0.346492
#>
#> $Variance
#> [1] 0.0001752056
# Gini index estimation and confidence interval using:
## a: The method 3 for point estimation.
## b: The method 'zblinearization' for variance estimation.
## c: The Sen-Yates-Grundy type variance estimator.
## d: The Hàjek approximation for the joint inclusion probabilities.
fgini(y, w, method = 3, interval = "zblinearization")
#> $Gini
#> [1] 0.3205489
#>
#> $Interval
#> lower upper
#> [1,] 0.2944802 0.3466175
#>
#> $Variance
#> [1] 0.0001769051
Alfons, A., and Templ, M. (2012). Estimation of social exclusion indicators from complex surveys: The R package laeken. KU Leuven, Faculty of Business and Economics Working Paper.
Berger, Y. G. (2005). Variance estimation with highly stratified sampling designs with unequal probabilities. Australian & New Zealand Journal of Statistics, 47, 365–373.
Berger, Y. G. (2008). A note on the asymptotic equivalence of jackknife and linearization variance estimation for the Gini Coefficient. Journal of Official Statistics, 24(4), 541-555.
Berger, Y. G. (2011). Asymptotic consistency under large entropy sampling designs with unequal probabilities. Pakistan Journal of Statistics, 27, 407–426.
Berger, Y. G. and Tille, Y. (2009). Sampling with unequal probabilities. In Sample Surveys: Design, Methods and Applications (eds. D. Pfeffermann and C. R. Rao), 39–54. Elsevier, Amsterdam
Berger, Y., and Gedik Balay, İ. (2020). Confidence intervals of Gini coefficient under unequal probability sampling. Journal of Official Statistics, 36(2), 237-249.
David, H.A. (1968). Gini’s mean difference rediscovered. Biometrika, 55, 573–575.
Davison, A. C., and Hinkley, D. V. (1997). Bootstrap Methods and Their Application (Cambridge Series in Statistical and Probabilistic Mathematics, No 1)–Cambridge University Press.
Deville, J.C. (1999). Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques. Survey Methodology, 25, 193–203.
Deltas, G. (2003). The small-sample bias of the Gini coefficient: results and implications for empirical research. Review of Economics and Statistics, 85(1), 226-234.
Giorgi, G. M., and Gigliarano, C. (2017). The Gini concentration index: a review of the inference literature. Journal of Economic Surveys, 31(4), 1130-1148.
Hàjek, J. (1964). Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35, 4, 1491–1523.
Hartley, H. O., and Rao, J. N. K. (1962). Sampling with unequal probabilities and without replacement. The Annals of Mathematical Statistics, 350-374.
Haziza, D., Mecatti, F. and Rao, J. N. K. (2008). Evaluation of some approximate variance estimators under the Rao-Sampford unequal probability sampling design. Metron, LXVI, 91–108.
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685.
Kendall, M., and Stuart, A. (1977). The advanced theory of statistics. Vol. 1: Distribution Theory. London: Griffin.
Langel, M., and Tillé, Y. (2013). Variance estimation of the Gini index: revisiting a result several times published. Journal of the Royal Statistical Society: Series A (Statistics in Society), 176(2), 521-540.
Lerman, R. I., and Yitzhaki, S. (1989). Improving the accuracy of estimates of Gini coefficients. Journal of econometrics, 42(1), 43-47.
Muñoz, J. F., Moya-Fernández, P. J., and Álvarez-Verdejo, E. (2023). Exploring and Correcting the Bias in the Estimation of the Gini Measure of Inequality. Sociological Methods & Research. https://doi.org/10.1177/00491241231176847
Ogwang, T. (2000). A convenient method of computing the Gini index and its standard error. Oxford Bulletin of Economics and Statistics, 62(1), 123-123.
Owen, A. B. (2001). Empirical likelihood. CRC press.
Qin, Y., Rao, J. N. K., and Wu, C. (2010). Empirical likelihood confidence intervals for the Gini measure of income inequality. Economic Modelling, 27(6), 1429-1435.
Rao, J. N. K., Wu, C. F. J., and Yue, K. (1992). Some recent work on resampling methods for complex surveys. Survey methodology, 18(2), 209-217.
Rust, K. F., and Rao, J. N. K. (1996). Variance estimation for complex surveys using replication techniques. Statistical methods in medical research, 5(3), 283-310.
Sen, A. R. (1953). On the estimate of the variance in sampling with varying probabilities. Journal of the Indian Society of Agricultural Statistics, 5, 119–127.
Tillé, Y. (2006). Sampling Algorithms. Springer, New York.
Yates, F., and Grundy, P. M. (1953). Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society B, 15, 253–261.