Publications  Income Distribution






Our methodology is based on a sequence of ordinary least-square regressions nested into the following general model:


Yit   = a + SbjD jit + gt + (d + jt)LINit + (y + lt) LINit2 + SmsX sit + uit  (1)


Yit is a measure of income distribution, i.e. Gini coefficient or the share of the poorest 40 percent of population, for the country i in the year t;
Djit are corrective dummy variables for differences in definitions and cover­age of the left-hand variables; (j = 1,2, ... 6) ;
t           is time;
LINit is logarithm of per capita GNP in the 1964 U.S. dollars; and s
X sit is the value of the s-th additional explanatory variable.



Two submodels for testing different groups of hypotheses can be derived from Equation (1).



(a) First we shall estimate a submodel that is obtained from Equation (1) by putting g = j = l = 0 and choosing the country~specific dummy variables CDiit for the additional explanatory variables X sit .

Yit   = a + S bj D jit + d LINit + y LINit2 + S ms  CD iit + uit   (2)


 In this model the only genuine (i.e. non-dummy) variables are the "Kuznets Curve" variable LIN and its square, LIN2. The coefficient d represents the slope of the Kuznets Curve at the $1 level of GNP per capita while the coefficient y indicates the degree of curvature of the Kuznets Curve. The Kuznets hypothesis implies d > 0, y <0 for the Gini coefficient (inverted U) and d <0, y >0 for the share of the poorest 40 percent (regular U curve). The smaller the absolute value of both d and y, the flatter the Kuznets Curve.

Because the coefficients d and y in Equation (2) are neither country nor time-specific, this model assumes that all the countries lie on a family of parallel (i.e. identically sloped) Kuznets Curves that are constant in time but have distinct, country-specific levels of income inequality. This model does not try to explain differences in the levels of income inequality across the countries. It just tries to test whether the hypothesis is tenable that all countries evolve in time along a stable U ­shaped curve and whether the shape of the U curves corresponds to the Kuznets hypothesis. Although the model will be estimated from pooled time series and cross ­section data, it does not assume that (nor does it test whether) there exists a U ­shaped curve across countries and that (nor whether) the two curves are the same. Obviously, this model tests a weaker version of the Kuznets hypothesis. Notice that in our model (2) there is only one country-specific coefficient to be estimated for each country; all the other coefficients of the model are shared by all countries or at least by groups of them. This means that in the estimation of model (2) we can use data for all those countries for which we have observations for at least two distinct points in time. On the other hand, all the countries with a single observation have to be discarded, considerably reducing our degrees of freedom. One can think about an even weaker hypothesis according to which not only the levels but also the slopes of U curves would be country-specific, but this would require the estimation of at least 3 country-specific parameters and therefore would necessitate discarding of all countries with less than four data-points. There is not enough information in our data set for such a model.  



(b) Secondly we shall estimate a submodel that is obtained from Equation (1) by putting g = j = l = 0  and using various sets of additional explanatory variables as discussed above.                 


   Yit   =   a + SbjD jit + dLINit + y LINit2 + SmsX sit + uit  (3)


 The submodel (3) tests a stronger version of the Kuznets hypothesis, namely that the cross-country Kuznets Curve is the same as the intertemporal one and that it is the same for all the countries of our sample. Because there are no country-specific dummy variables in this submodel, both the cross-sectional and time-wise variation will be used to estimate coefficients d and y. The addition of other explanatory variables, on the other hand, implies that variations of income distribution over time and across the countries do not depend just on the level of income; they depend on other social, political and economic factors as well. In other words, this submodel presumes that cross-country and intertemporal Kuznets Curves are identical but they explain only part of the variation in income distribution. One of the advantages of submodel (3) is that we can use our entire data set, including the countries with a single observation.

Within the submodel (3) we shall estimate several partial submodels nested into it. We shall do that by sequentially adding groups of variables X and perform a series of F-tests to determine which of the variables - including the two "Kuznets variables" - appear to be significant in explaining variation in income distribution across the countries and over time.


(c) Finally, we shall estimate the full model (1) which contains three additional coefficients (g, j and l) signifying the time shift in the Kuznets Curve. This model is estimated to test the hypothesis that the Kuznets Curve is not stable over time. The three coefficients may cause the curve to move up or down and to change its curvature. Any joint significance of these coefficients would weaken the Kuznets hypothesis because it would imply that the cross-sectional curve and the temporal curve are not identical. It would also mean that different countries may evolve along distinct paths during their development. In particular, it will be interesting to see whether the Kuznets Curve flattens over time. On the other hand, if the three time ­shift coefficients are not statistically significant, that would support the hypothesis of a stable Kuznets Curve


Four other methodological points need to be mentioned here.

(i) Missing Observations. For several countries, data were missing for one or more right-hand variables other than GNP. Where we had data for more than one point in time, we usually used interpolation or extrapolation to estimate the missing variable. For countries with a single time-point, the missing observation was replaced by the mean for a group of countries with similar characteristics. Although this procedure may cause a bias in estimates of parameters, such a bias is likely to be small and well compensated for by the benefit of reduced variance due to an in­creased number of observations.


(ii) F-tests. In several cases, we calculate F-statistics for a null hypothesis that both "Kuznets variables" have jointly no explanatory power. Similarly, we calculate F-statistics for the joint explanatory power of various groups of additional variables (see the diagonal in the section of F-statistics at the bottom of Tables 2,3, 5 and 6). It is well known that the result of any F-test may depend a great deal on which and how many other variables are present in the regression (see the rows of the F-statistics sections). Because we report results of several regression in which groups of variables are sequentially added to the Kuznets Curve variables, for each group we also report a sequence of F-statistics showing the change of its significance as the other groups are added to the regression. It should be kept in mind, however, that the result of such sequential testing is not independent of the particular sequence in which groups enter in the regression - those groups that are added earlier are likely to show greater significance than those added later.


(iii) Outliers. To be sure that the results of our regression analysis are not distorted by influential outliers, we calculate influence statistics (see Belsley et al.  [5]) for most of our regressions. The reported influence statistics and their meaning are as follows:


RSTUDENT are 'Studentized' residuals, i.e. OLS residuals divided by their standard errors obtained from the regression in which the respective observation was dropped. Dividing residuals by standard errors scales them so that they do not depend on units of measurement and makes them t-distributed, provided that the original errors were normal. This allows us to use t-tables for judging whether the given observation is or is not an outlier. Usually we would suspect any observation with RSTUDENT larger (in absolute value) than 2.


 COVRATIO measures the influence of a given observation on standard errors. If it is less than one, then removal of that observation would reduce standard errors; if it is larger than one, then removal of that observation would increase standard errors. Therefore, outliers with a small COVRATIO are suspected of having an undesirable influence on a regression.


 DFITS show how much and in what direction the fitted value at certain observation would change when that observation is added to the sample from which parameters are being estimated. Like RSTUDENT, DFITS are scaled by standard errors of fitted values to make them independent of units of measurement. DFITS have the same signs as RSTUDENT but their absolute value depends not only on residuals but also on the lever­age that the given observation exerts over estimated parameters.


 DFBETAS are statistics calculated for each estimated parameter and each observation showing the degree of change in the estimated parameter due to the addition of that observation to the sample. Again, DFBETAS are scaled by standard errors of estimated parameters so that they come close to measuring the change in the t-statistic of the estimated parameter due to the addition of a given observation.

These statistics are useful in identifying outliers and the influence they have on estimated parameters, standard errors and fitted values. But, of course, they do not indicate whether a particular influential outlier exerts a "good" or a "bad" influence. If it represents a correct observation, it is "good" because it helps to determine firmly the direction of the regression line and to reduce standard errors. But if it is an incorrect observation, it will be bad for the regression by pushing it in the wrong direction, if it is influential. Finally, if the outlier is not influential, it does not matter much whether it is correct or not, since it does not influence the results to any degree.

The calculation of influence statistics is especially important in any analysis of income distribution data, because they are particularly subject to error and because much past work has been based on a limited number of observations. One or two influential incorrect outliers can determine results and the sample of countries used can explain to a substantial extent the differences in the results of different analysts. For further analysis, see the section on "Unusual Cases" below.


(iv) Unreported Results and Probability Values. To have manageable tables we have not reported the following results:

  •  regressions excluding the Kuznets Curve variables. Excluding of these variables does change coefficients and tests of significance, but not sufficiently to affect the conclusions;

  •  regression constants, which do not appear to be of particular interest;

  •  the coefficients for the definitional dummy variables, again of little or no importance for the major conclusions.

All these data are available from the authors.

In addition to the usual t-statistics, we have also reported probability values for each estimated parameter and most F-statistics. These are another indication of the individual or joint statistical significance of estimated parameters (e.g.: .021 means that the coefficient is different from zero at the 2.1 percent level of significance).





OK Economics was designed and it is maintained by Oldrich Kyn.
To send me a message, please use one of the following addresses: ---

This website contains the following sections:

General  Economics:

Economic Systems:

Money and Banking:

Past students:

Czech Republic

Kyn’s Publications

 American education

free hit counters
Nutrisystem Diet Coupons