Saturday, January 25, 2014

Accurate imputation and valid statistical inference with ensemble

Imputation is predictive inference and not causal inference!

I have met many people, who consider the two are equivalent. Their reasoning is based on the belief that if you can produce a model which replicate the data generating mechanism, it will give you the best prediction. Which may or may not be true depending on your philosophy of mathematics (i.e. whether you believe mathematics is created (non-platonist) or mathematics is discovered (platonist), the latter being the assumption of the reasoning.

Nevertheless, most applied statisticians know that the two are different in application since one can almost never build the correct model (in particularly under resource and time constraints) and thus the need to choose between minimizing prediction error or unbiased coefficients.

Ensemble Prediction

Ensemble is a general class of methods which combines multiple models to obtain better predictive performance. The idea is based on the observation that all models are wrong but some are useful; each model captures a certain aspect of the total information set and by combining all subsets you result in a larger utilized set of information.

In his posthumous book, "Probability Theory: The Logic of Science", Jaynes states that all variables that has predictive power should be included.

The idea of combining predictors instead of selecting the single best is well-known in statistics and has a long and honourable theoretical background. Literature has shown that ensemble always produces a predictor which is better than any of its component predictor.

Many variation of this concept exist and are already widely applied. Random forest is an application of bagging in CART; boosting, Bayesian model combination and stacking are all widely seen in predictive application due to their outstanding performance.

Multiple Imputation

Multiple Imputation is an idea originated in 1987 by Donald Rubin to account for the under-estimate of variance in single imputation.

Oppose to providing a single point estimate of the missing value as in single imputation, multiple imputation aims to draw various imputation to account for the noise due to imputation allowing valid statistical inference.

The data are imputed multiple times each representing the possible state of reality, analysis can be conducted on each imputation then later combined to produce estimates and confidence intervals that incorporate missing-data uncertainty.

Illustration

In this small exercise I will illustrate the use of bagging/bootstrap aggregating and Stacked regression to generate imputation for wheat yield from 1960 to 2012 for France. Each imputation is generated using simple linear regression of just a subset of the explanatory variables estimated with the EM-algorithm. The predictors are simply the yield of wheat from other observed countries.

Another advantage of bagging and stacking is the fact that you are able to include all relevant variables in your imputation. In this exercise, I have 52 observations but more than 200 explanatory variables and even more (p >> n)! By using subsets of the variable to build predictors then combining the predictors I can utilize all the information I consider relevant.

The following plot shows the original data in black and red, where the red points are observed value which has been removed to represent missing values.

I have withdrawn 80 percent of the data, leaving only 20 percent for imputation. The imputation seems extremely satisfactory with an error rate of 6.4% for bagging and 6.2% for stacking. The grey lines are single imputation generate from each bootstrap representing the uncertainty of imputation, and these can be used for drawing valid statistical inference.

Remark

This small example shows how ensemble methods can be used to produce accurate imputation, I believe the performance can be further improved if we include additional climate variables such as temperature and precipitation.

This is a rather simple exercise in which a linear interpolation will also yield satisfactory result. However, the shortfall is that other explanatory variables which may have predictive power may not be included and extrapolation is impossible.

In my simulation, it appears that bagging works better than stacking. I believe this is due to the small sample I have for estimating the stacking coefficients, and no training was carried out.

Codes


## Download data
library(FAOSTAT)
library(plyr)
library(zoo)
library(nnls)


## Simple function for estimate regression with EM
lmImpute = function(formula, data, n.iter = 100, tol = 1e-3){
    response = as.character(formula[[2]])
    missIndex = which(is.na(data[, response]))
    ## Initialize missing values
    data[, response] = na.approx(data[, response], na.rm = FALSE)
    init = lm(formula = formula, data = data)
    data[is.na(data[, response]), response] =
        predict(init, data[is.na(data[, response]), ])
    ll = -Inf
    for(i in 1:n.iter){
        fit = lm(formula = formula, data = data)
        l1 = logLik(fit)
        if((l1 - ll) < tol){
            break
        } else {
            data[missIndex, response] = fitted(fit)[missIndex]
            ll = l1
        }
    }
    list(impute = data[, response], fitted = fitted(fit))
}


wheatYield.df = getFAOtoSYB(name = "wheatYield", domainCode = "QC",
    itemCode = 15, elementCode = 5419)$entity

## For simplicity we only work with countries that has full data
cwheatYield.df = dcast(wheatYield.df, Year ~ FAOST_CODE,
    value.var = "wheatYield")
colnames(cwheatYield.df)[-1] =
    paste("wheatYield", colnames(cwheatYield.df)[-1], sep = "_")
fwheatYield.df =
    cwheatYield.df[, which(colSums(is.na(cwheatYield.df)) == 0)]


## Generate missing value
missProp = 0.8
responseVar = "wheatYield_68"
tmp = fwheatYield.df
missing = sample(1:NROW(fwheatYield.df),
    size = round(missProp * NROW(tmp)))
tmp[missing, responseVar] = NA

## Multiple imputation with Ensemble
n.iter = 300
mi = matrix(NA, nc = n.iter, nr = NROW(tmp))
mf = matrix(NA, nc = n.iter, nr = NROW(tmp))
for(i in 1:n.iter){
    bootstrapExplainVar = colnames(tmp[, colnames(tmp) != responseVar])
    new.df = tmp[, c(responseVar,
        sample(bootstrapExplainVar, size = 15))]
    tmp.imp = lmImpute(formula(paste(responseVar, " ~ .")),
        data = new.df)    
    mi[, i] = tmp.imp$impute
    mf[, i] = tmp.imp$fitted
}

## Compute Bagging prediction
baggingPrediction = rowMeans(mi)

## Compute Stacking prediction
stackingCoef = coef(nnls(mf[-missing, ], tmp[-missing, responseVar]))
scaleStackingCoef = stackingCoef/sum(stackingCoef)
stackingPrediction = mf %*% scaleStackingCoef

## Plot the result
with(fwheatYield.df, plot(Year, eval(parse(text = responseVar)),
                          type = "n",
                          ylim = c(0, max(mi, mf, tmp[, responseVar],
                              na.rm = TRUE)),
                          ylab = "Wheat Yield"))
for(i in 1:n.iter){
    lines(fwheatYield.df$Year, mi[, i],
          col = rgb(0.5, 0.5, 0.5, alpha = 0.2))
}
with(fwheatYield.df, lines(Year, eval(parse(text = responseVar))))
with(fwheatYield.df,
     points(Year, eval(parse(text = responseVar)), pch = 19,
            col = is.na(tmp[, responseVar]) + 1,
            cex = ifelse(is.na(tmp[, responseVar]), 1, 2)))
lines(fwheatYield.df$Year, baggingPrediction, col = "green")
points(fwheatYield.df$Year[missing], baggingPrediction[missing],
       pch = 19, col = "green")
lines(fwheatYield.df$Year, stackingPrediction, col = "steelblue")
points(fwheatYield.df$Year[missing], stackingPrediction[missing],
       pch = 19, col = "steelblue")
legend("topleft",
       legend = c("Observed", "Missing", "Bagging", "Stacking"),
       col = c("black", "red", "green", "steelblue"),
       lty = c(1, 0, 1, 1), pch = 19, bty = "n")

## Compute MAPE error rate
mean(abs(baggingPrediction[missing] -
         fwheatYield.df[missing, responseVar])/
     fwheatYield.df[missing, responseVar]) * 100
mean(abs(stackingPrediction[missing] -
         fwheatYield.df[missing, responseVar])/
     fwheatYield.df[missing, responseVar]) * 100