StaTEAstics.: 2014

Tuesday, September 16, 2014

3D Sine Wave

Had a headache last night, so decided to take things easy and just read posts Google+. Then I came across this post which seems interesting so I thought I would play around before I head to bed.

First of all, I thought generating a square base would be much easier in R compare to hexagons. Starting with 2 numeric vectors and then expand them by expand.grid() would do the job. The next step is to provide the rotation, this is also rather simple since it can be achieved by multiplying the following matrix.

\[ R = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \\ \end{bmatrix} \]

Finally, the last step is to define the wave. This is done by a sine function with the phase being the distance away from the center.

\[ d = \sqrt{x^2 + y^2}\\ y_t = y_t + 10 \times \sin(t + d) \]

And here is the code, enjoy....



library(animation)

## Create the square to start with
x = seq(-5, 5, length = 50)
y = seq(-5, 5, length = 50)
square = as.matrix(expand.grid(x, y))

## Create the rotation matrix
angle = pi/180
rotation =
    matrix(c(cos(angle), -sin(angle), sin(angle), cos(angle)), ncol = 2)

## Plot
saveGIF(
    {
        init = square
        for(i in seq(0, 2 * pi, length = 360)){
            tmp = init
            distFromCenter = sqrt(tmp[, 1]^2 + tmp[, 2]^2)
            tmp[, 2] = tmp[, 2] + 10 * sin(i - distFromCenter)
            colIntensity = (tmp[, 2] + abs(min(tmp[, 2])))/
                max((tmp[, 2] + abs(min(tmp[, 2]))))
            plot(tmp[, c(1, 2)], xlim = c(-10, 10), ylim = c(-20, 20),
                 pch = ".", cex = 3, axes = FALSE, xlab = "", ylab = "",
                 col = rgb(colIntensity, 0, 0))            
            init = init %*% rotation
        }
    },
    movie.name = "./wave.gif", interval = 0.005,
    nmax = 30, ani.width = 800,  ani.height = 800
    )

Monday, August 25, 2014

Spline interpolation of temporal resolution for satellite images.

This week, I had a discussion with a few of my colleagues on the possibility of utilizing remote sensing data or satellite images to improve our statistical estimation such as imputation.

One source of interest is the Normalized Difference Vegetation Index which quantify the concentrations of green leaf vegetation around the globe. More details about the index can be accessed here (Measuring NDVI), how it is measured and what it measures. In brief, it measures the visible (VIS) and near-infrared (NIR) sunlight reflected by the plants and the index can be computed as follow

\[\text{NDVI} = \frac{NIR - VIS}{NIR + VIS}\]

Our colleague was interested where it could be mapped to the crop calender, and potentially identify the cultivation area of selected crops. Nevertheless, the first problem we faced was that the satellite image has a collection interval of 16 days which does not have the same temporal resolution as our crop calender. Thus, we thought maybe we can interpolated the time series then build images which correspond to the crop calender.

Shown below is a single grid (there are 595 grid for the whole world) of the satellite image for Central Western Africa captured on the 1st of January 2013, there are a total of 23 images for the year 2013.

After some basic research, we decided to start of with something simple, splines. Splines are widely used for spatial interpolation, and it appears to be a method we can implement quickly to see the results. To see how the splines work, we have plotted the evolution of a single pixel over time and super-imposed the spline interpolation.

The result seems to be satisfactory, but we realized that the NDVI seems to follow a cyclical pattern from year-to-year and thus our next step is to implement a periodic spline but with adjustment to the end points.

The next graph shows that the utilization of the periodic interpolation improved the interpolation slightly. In particularly, at the start of the year and towards the end of the year, however year-to-year adjustments will be required since the NDVI will return to a similar level but not exact.

Finally, we show the result of the interpolation on the satellite images. The following graph illustrates two interpolation on the 6th and the 11th of January between two observed image.

Summary

This exercise proves that the interpolation can be performed without making heroic assumption. Which lead us to optimistically believe that we can construct satellite images of the NDVI, and with combination of the crop calender to further improve methodology and data quality of our work.

One shortfall of this approach is that the interpolation does not capture spatial correlation. I hope to make an update shortly describing how we will deal with it.

Codes and data can be obtained from my Github repository.

WARNING: Do not attempt to run the script unless you have more than 8GB of RAM on your computer, the interpolated dataset has more than half a billion data points.

Thursday, March 20, 2014

Why multiple imputation?

Background

In the forth coming week, I will be giving a presentation on the fundamentals of imputation to my colleagues. One of the most important idea I would like to present is multiple imputation.

In my last post, I have given a small example of multiple imputation, but it does not provide the evidence why we should use it. This is the aim of this post in which I have taken a small example to illustrate how inference may change under single and multiple imputation.

Like most international organization and statistical institutes, we use the least-squares growth rate whenever possible to estimate the growth of a certain indicator. In brevity, a linear regression on the logged variable with respect to time is fitted, then the exponential of the time coefficient minus one will be the growth rate. \[ LSGR = (e^{\hat{\beta}} - 1) \times 100 \]

Since a linear regression is fitted, we can conveniently test whether a growth exist. In this example, I have chosen the production of wheat of Romania to illustrate my point of why multiple imputation should be used for drawing inference and to avoid being over confident.

Method

Lets load the data and simulate some missing value.



## Load the data, the production domain of Romania
romaniaMaize.df = structure(list(
    year = 1961:2011,
    production = c(5739600, 4932400, 6022700, 6691700,
        5877000, 8021600, 6857879, 7105287, 7675808, 6535512,
        7850300, 9816700, 7397200, 7439600, 9240661,
        11583200, 10113559, 9671200, 11768100, 10563300,
        8321500, 10550000, 10296000, 9891000, 11903200,
        10900900, 7526900, 7182200, 6761800, 6809604,
        10497338, 6828270, 7987450, 9343000, 9923132,
        9607944, 12686700, 8623370, 10934800, 4897600,
        9119200, 8399800, 9576985, 14541564, 10388499,
        8984729, 3853920, 7849080, 7973258, 9042032,
        11717591),
    areaHarvested = c(3428400, 3106766, 3379300, 3319100,
        3305800, 3287700, 3221075, 3344013, 3293068,
        3084036, 3131400, 3196500, 2956800, 2963000,
        3304691, 3377600, 3317671, 3178798, 3311291,
        3287560, 3077900, 2764387, 2935120, 3090530,
        3090100, 2858100, 2787100, 2579400, 2733400, 2466735,
        2574999, 3335920, 3065682, 2983400, 3109236, 3277041,
        3037700, 3128900, 3013400, 3049400, 2974000, 2761223,
        3119104, 3196130, 2609110, 2512944, 2263080, 2432210,
        2333501, 2094249, 2587102),
    seed = c(124000, 135000, 133000, 132000, 139000, 137000,
        140000, 140000, 135000, 135000, 138000, 135000,
        133000, 140000, 120000, 107000, 134000, 139000,
        133000, 135000, 124000, 144000, 148000, 145000,
        135000, 130000, 122000, 130000, 139000, 145000,
        180000, 180000, 142500, 125600, 130900, 90587,
        124753, 77000, 76000, 69907, 70692, 72785, 79065,
        67000, 64600, 63829, 65123, 62851, 59985, 53880,
        64744)),
    .Names = c("year", "production", "areaHarvested", "seed"),
    row.names = 6110:6160, class = "data.frame")

## Simulate missing value in production
n = NROW(romaniaMaize.df)
pctMiss = 0.5
myseed = 587
set.seed(myseed)
missIndex = sample(n, n * pctMiss)
romaniaMaize.df$productionSim = romaniaMaize.df$production
romaniaMaize.df[missIndex, "productionSim"] = NA

Then I will impute the data using the mi package assuming that production depends on the seed utilized and area harvested. I have imputed the data 50 times for the plot, but according to Rubin 3 to 10 imputation would be sufficient.



## Create missing information matrix
info = mi.info(romaniaMaize.df)
info = mi.info.update.include(info,
    list(production = FALSE, year = TRUE))

## I assume that the production has a trend and is determined by the
## area harvested and seed utilized.
info = mi.info.update.imp.formula(info,
    list(productionSim = "productionSim ~ year + areaHarvested + seed"))


## multiple imputation via mi
romaniaMaize.mi = mi(romaniaMaize.df, info = info, n.imp = 50,
    n.iter = 100, seed = myseed, run.past.convergence = TRUE)

Then we compute the single imputation by taking the average value of all the multiple imputation.


## Compute the single imputation
romaniaMaize.df$imputed = romaniaMaize.df$productionSim
romaniaMaize.df[is.na(romaniaMaize.df$imputed), "imputed"] =
    rowMeans(sapply(romaniaMaize.mi@imp,
                    FUN = function(chain) chain$productionSim@random))

The following plot illustrate the difference between multiple and single imputation. For single imputation we impute only once, while in multiple imputation we impute multiple times to reflect the uncertainty, each set of imputation can be interpreted as a potentially observed realization.



## Plot and compare the imputations
with(romaniaMaize.df, plot(year, production, type = "n",
                          ylim = c(0, max(production, imputed))))
sapply(mi.completed(romaniaMaize.mi), FUN = function(chain)
       lines(romaniaMaize.df$year, chain$productionSim,
             col = rgb(0.5, 0.5, 0.5, alpha = 0.3)))
with(romaniaMaize.df, lines(year, production, 
                          ylim = c(0, max(production, imputed))))
with(romaniaMaize.df, points(year, production, pch = 19, 
                          ylim = c(0, max(production, imputed)),
                          col = is.na(productionSim) + 1))
with(romaniaMaize.df, lines(year, imputed, col = "steelblue", lwd = 2))

Compute the growth rate on single and multiple imputation and test using the frequentists approach.


> ## Fit the regression on single/multiple imputed data set
> single.fit = lm(log(romaniaMaize.df$imputed) ~ year,
+     data = romaniaMaize.df)
> multiple.fit = lm.mi(log(productionSim) ~ year, romaniaMaize.mi)
> 
> ## Compare the growth rate
> (exp(coef(single.fit)[2]) - 1) * 100
   year 
0.59327 
> ((exp(coef(multiple.fit)[2]) - 1) * 100)
     year 
0.6079975 
> ## Compare the P-value
> 2 * pt(single.fit$coef[2]/sqrt(diag(vcov(single.fit)))[2],
+        df = single.fit$df.residual, lower.tail = FALSE)
        year 
4.187208e-05 
> 2 * pt(multiple.fit@mi.pooled$coef[2]/multiple.fit@mi.pooled$se[2],
+        df = single.fit$df.residual, lower.tail = FALSE)
      year 
0.05538677

From the result we can see that the point estimates are very similar. However, if we perform the hypothesis testing on the singularly imputed dataset, we would have came to the conclusion that the production of Romania has been growing at a rate which is statistical significant over the past half a century. On the other hand, the test on the multiplied dataset has a P-value greater than the typical threshold of 5 percent.

Why are the results different? Lets not get into all the issues of the hypothesis testing under the Frequentist paradigm. The main disparity results from the fact that we take into account of the uncertainty of imputation and correctly estimate the standard error. Rather than providing a single point we provide a distribution of possible points.

Saturday, January 25, 2014

Accurate imputation and valid statistical inference with ensemble

Imputation is predictive inference and not causal inference!

I have met many people, who consider the two are equivalent. Their reasoning is based on the belief that if you can produce a model which replicate the data generating mechanism, it will give you the best prediction. Which may or may not be true depending on your philosophy of mathematics (i.e. whether you believe mathematics is created (non-platonist) or mathematics is discovered (platonist), the latter being the assumption of the reasoning.

Nevertheless, most applied statisticians know that the two are different in application since one can almost never build the correct model (in particularly under resource and time constraints) and thus the need to choose between minimizing prediction error or unbiased coefficients.

Ensemble Prediction

Ensemble is a general class of methods which combines multiple models to obtain better predictive performance. The idea is based on the observation that all models are wrong but some are useful; each model captures a certain aspect of the total information set and by combining all subsets you result in a larger utilized set of information.

In his posthumous book, "Probability Theory: The Logic of Science", Jaynes states that all variables that has predictive power should be included.

The idea of combining predictors instead of selecting the single best is well-known in statistics and has a long and honourable theoretical background. Literature has shown that ensemble always produces a predictor which is better than any of its component predictor.

Many variation of this concept exist and are already widely applied. Random forest is an application of bagging in CART; boosting, Bayesian model combination and stacking are all widely seen in predictive application due to their outstanding performance.

Multiple Imputation

Multiple Imputation is an idea originated in 1987 by Donald Rubin to account for the under-estimate of variance in single imputation.

Oppose to providing a single point estimate of the missing value as in single imputation, multiple imputation aims to draw various imputation to account for the noise due to imputation allowing valid statistical inference.

The data are imputed multiple times each representing the possible state of reality, analysis can be conducted on each imputation then later combined to produce estimates and confidence intervals that incorporate missing-data uncertainty.

Illustration

In this small exercise I will illustrate the use of bagging/bootstrap aggregating and Stacked regression to generate imputation for wheat yield from 1960 to 2012 for France. Each imputation is generated using simple linear regression of just a subset of the explanatory variables estimated with the EM-algorithm. The predictors are simply the yield of wheat from other observed countries.

Another advantage of bagging and stacking is the fact that you are able to include all relevant variables in your imputation. In this exercise, I have 52 observations but more than 200 explanatory variables and even more (p >> n)! By using subsets of the variable to build predictors then combining the predictors I can utilize all the information I consider relevant.

The following plot shows the original data in black and red, where the red points are observed value which has been removed to represent missing values.

I have withdrawn 80 percent of the data, leaving only 20 percent for imputation. The imputation seems extremely satisfactory with an error rate of 6.4% for bagging and 6.2% for stacking. The grey lines are single imputation generate from each bootstrap representing the uncertainty of imputation, and these can be used for drawing valid statistical inference.

Remark

This small example shows how ensemble methods can be used to produce accurate imputation, I believe the performance can be further improved if we include additional climate variables such as temperature and precipitation.

This is a rather simple exercise in which a linear interpolation will also yield satisfactory result. However, the shortfall is that other explanatory variables which may have predictive power may not be included and extrapolation is impossible.

In my simulation, it appears that bagging works better than stacking. I believe this is due to the small sample I have for estimating the stacking coefficients, and no training was carried out.

Codes


## Download data
library(FAOSTAT)
library(plyr)
library(zoo)
library(nnls)


## Simple function for estimate regression with EM
lmImpute = function(formula, data, n.iter = 100, tol = 1e-3){
    response = as.character(formula[[2]])
    missIndex = which(is.na(data[, response]))
    ## Initialize missing values
    data[, response] = na.approx(data[, response], na.rm = FALSE)
    init = lm(formula = formula, data = data)
    data[is.na(data[, response]), response] =
        predict(init, data[is.na(data[, response]), ])
    ll = -Inf
    for(i in 1:n.iter){
        fit = lm(formula = formula, data = data)
        l1 = logLik(fit)
        if((l1 - ll) < tol){
            break
        } else {
            data[missIndex, response] = fitted(fit)[missIndex]
            ll = l1
        }
    }
    list(impute = data[, response], fitted = fitted(fit))
}


wheatYield.df = getFAOtoSYB(name = "wheatYield", domainCode = "QC",
    itemCode = 15, elementCode = 5419)$entity

## For simplicity we only work with countries that has full data
cwheatYield.df = dcast(wheatYield.df, Year ~ FAOST_CODE,
    value.var = "wheatYield")
colnames(cwheatYield.df)[-1] =
    paste("wheatYield", colnames(cwheatYield.df)[-1], sep = "_")
fwheatYield.df =
    cwheatYield.df[, which(colSums(is.na(cwheatYield.df)) == 0)]


## Generate missing value
missProp = 0.8
responseVar = "wheatYield_68"
tmp = fwheatYield.df
missing = sample(1:NROW(fwheatYield.df),
    size = round(missProp * NROW(tmp)))
tmp[missing, responseVar] = NA

## Multiple imputation with Ensemble
n.iter = 300
mi = matrix(NA, nc = n.iter, nr = NROW(tmp))
mf = matrix(NA, nc = n.iter, nr = NROW(tmp))
for(i in 1:n.iter){
    bootstrapExplainVar = colnames(tmp[, colnames(tmp) != responseVar])
    new.df = tmp[, c(responseVar,
        sample(bootstrapExplainVar, size = 15))]
    tmp.imp = lmImpute(formula(paste(responseVar, " ~ .")),
        data = new.df)    
    mi[, i] = tmp.imp$impute
    mf[, i] = tmp.imp$fitted
}

## Compute Bagging prediction
baggingPrediction = rowMeans(mi)

## Compute Stacking prediction
stackingCoef = coef(nnls(mf[-missing, ], tmp[-missing, responseVar]))
scaleStackingCoef = stackingCoef/sum(stackingCoef)
stackingPrediction = mf %*% scaleStackingCoef

## Plot the result
with(fwheatYield.df, plot(Year, eval(parse(text = responseVar)),
                          type = "n",
                          ylim = c(0, max(mi, mf, tmp[, responseVar],
                              na.rm = TRUE)),
                          ylab = "Wheat Yield"))
for(i in 1:n.iter){
    lines(fwheatYield.df$Year, mi[, i],
          col = rgb(0.5, 0.5, 0.5, alpha = 0.2))
}
with(fwheatYield.df, lines(Year, eval(parse(text = responseVar))))
with(fwheatYield.df,
     points(Year, eval(parse(text = responseVar)), pch = 19,
            col = is.na(tmp[, responseVar]) + 1,
            cex = ifelse(is.na(tmp[, responseVar]), 1, 2)))
lines(fwheatYield.df$Year, baggingPrediction, col = "green")
points(fwheatYield.df$Year[missing], baggingPrediction[missing],
       pch = 19, col = "green")
lines(fwheatYield.df$Year, stackingPrediction, col = "steelblue")
points(fwheatYield.df$Year[missing], stackingPrediction[missing],
       pch = 19, col = "steelblue")
legend("topleft",
       legend = c("Observed", "Missing", "Bagging", "Stacking"),
       col = c("black", "red", "green", "steelblue"),
       lty = c(1, 0, 1, 1), pch = 19, bty = "n")

## Compute MAPE error rate
mean(abs(baggingPrediction[missing] -
         fwheatYield.df[missing, responseVar])/
     fwheatYield.df[missing, responseVar]) * 100
mean(abs(stackingPrediction[missing] -
         fwheatYield.df[missing, responseVar])/
     fwheatYield.df[missing, responseVar]) * 100