Thursday, September 13, 2012

Imputation by mean?

Today, I was briefed that when computing the regional aggregates such as those defined by the M49 country standard of the United Nation (http://unstats.un.org/unsd/methods/m49/m49regin.htm) I should use the regional mean to replace missing values.

I was sceptical about this approach based on the little knowledge I had about missing value since the assumption required by the method is extremely strong.

(1) The missing value has to be in the form of MCAR (Missing completely at random), which is highly violated since missing value are more likely to come from countries where the statistical office are not well established or less developed.

(2) The method also required the data to be relatively symmetric, otherwise the mean will not be an unbiased estimate of the missing value.

So I decide to do some data checking and download some data from the nice World Bank (http://data.worldbank.org/) and see what the data look like.




## Read the name file, lets lets just work with the first 100 variables
WDI = read.csv(file = "http://dl.dropbox.com/u/18161931/WorldBankIndicators.csv",
  stringsAsFactors = FALSE, nrows = 100)
WDI = WDI[-c(1:10), ]

## Download and merge the data. Some vairables are not collected in 2010
## and thus they are discarded
WDI.df = WDI(indicator = WDI$WDI_NAME[1], start = 2010, end = 2010)
for(i in 2:NROW(WDI)){
  tmp = WDI(indicator = WDI$WDI_NAME[i], start = 2010, end = 2010)
  if(!inherits(tmp, "try-error") &
     (sum(is.na(tmp[, WDI$WDI_NAME[i]])) != NROW(tmp)))
    WDI.df = merge(WDI.df, tmp, by = c("iso2c", "country", "year"))
}

## Produce histogram to examine the suitability of mean imputation
pdf(file = "dataDist.pdf")
for(i in 3:NCOL(WDI.df)){
  hist(WDI.df[, i], breaks = 100, main = colnames(WDI.df)[i], xlab = NULL)
  abline(v = mean(WDI.df[, i], na.rm = TRUE), col = "red")
  pctBelowMean = round(100 * sum(na.omit(WDI.df[, i]) <
    mean(WDI.df[, i], na.rm = TRUE))/length(na.omit(WDI.df[, i])), 2)
  legend("topright", legend = paste(pctBelowMean,
                       "% of data are below the mean", sep = ""))
}
graphics.off()


From the saved plot we can clearly see that a large amount of variables are heavily skewed (typical for monetary and population related type data). In addition, we can see that the majority of the data lies far below the mean and thus if the mean imputation method was used to compute the aggregates, we would end up with an estimate biased significantly upwards.


No comments:

Post a Comment