Monday, October 8, 2012

Cooking with Statistics

I received my copy of "La Cucina - The regional cooking of Italy" from Accademia Italiana Della Cucina today and I can't wait to dig into all 2000 Italian regional recipes in the book!

I have always believed that everything has a mathematical form, and they are every where in our life.

Take cooking for example, a recipe is a model, ingredients are like data, and the chef turns the ingredients into magnificent dishes just like how a statistician would turn data into useful insight.

Furthermore vegetables, meat and all other inputs just like data are not always the same. Time-series data are subject to sampling frequency, and surveys are prone to non-response. Ingredients can also be good or bad and have a distribution they follow.

A bad cook like me would have blindly follow the recipe and make a terrible dish. On the other hand, a good chef would analyse the ingredients by looking, smelling and even taste it to determine the state of the ingredients just like a good analyst would plot the data to grasp the feeling of the data.

The only difference is that a statistician would analyse all these information with a computer, while the chef analyse all these in his personal computer (brain and senses).

I wonder would it be possible one day that when we can measure all the aspects of taste, a terrible cook like me can turn into a Master chef!

Wednesday, October 3, 2012

Perculiar behaviour of the sum function

The sum function in R is a special one in contrast to other summary statistics functions such as mean and median. The first distinguish is that it is a Primitive function where the others are not (Although you can call mean using .Internal). This causes many inconsistency and unexpected behaviours.

(1) Inconsistency in argument

For example, the arguments are inconsistent. Both mean and median takes the argument x, while the sum operates on whatever argument that is not matched. This can be a problem in the case when you want to write a function which switches between all the summary functions such as:

do.call(myFUN, list(x = x))

Where myFun can be any statistical summary function. The problem first arises when I wanted to write a function which encompasses several different summary statistics and so I can switch between them when required. The main problem arises when I have to pass additional arguments such as the "weight" in the weighted.mean function. I wrote the following call and naively hope it would work

do.call(myFUN, list(x = x, w = w))

What turns out is that this line of code works find for all the summary statistics except the sum function where the "weight" is also summed. So my current solution is just to use the switch function which is not my favourite function.

(2) Inconsistency in output

Another inconsistency arises in how the NA's are treated. In the mean, median and weighted.mean summaries; if all the observations are NA then either NA or NaN are returned.

mean(rep(NA, 10), na.rm = TRUE)
median(rep(NA, 10), na.rm = TRUE)

While the sum function returns zero. It puzzles me how you get zero when NA stands for not available and this is like creating something out of nothing. This is a problem for me since if I want to sum up multiple time series with missing values, I want the function to remove NA and compute where there are partial data while returning NA instead of zero when there are no data at all.

Nevertheless, a simple solution exists and thanks to the active R community. This post on R help addresses this problem and solve in an elegant manner.

sum(x, na.rm = any(!is.na(x)))


"The computations and the software for data analysis should be trustworthy" - John Chamber, Software for Data Analysis

I am not sure about the reasoning underlay the behaviour of sum, but it should be consistent so people can trust it and use it as what they expect.

Saturday, September 29, 2012

my Facebook social network

I got very excited on making a network diagram of my Facebook network using Ghefi (https://gephi.org/) and submitted my first assignment for the Social Network Analysis course on https://www.coursera.org/. It's middle of the night, so I will keep the post short and update more details tomorrow.

This is my Facebook social network, after spending 12 years in New Zealand you can see how small my network is and how tightly knitted they are! The larger and darker the nodes are the more mutual friends they share with me.

You can see major clusters from the diagram, the top left major cluster are my statistic geeks, the middle cluster are the people I hang out whether drinking, went to university together etc and the lower cluster are my college friends.

Then there is three small cluster as well, at the far right top are my family and relatives and close to it are my intermediate friends from Taiwan and since some of whom are my brother's friend they are loosely connected to my family. Finally, there is the small cluster in the lower right which are my colleagues and friends from Rome!

Enrol in the course, and you can create and understand you Facebook friend ships.

Saturday, September 22, 2012

Network of trade

This week,  I got my hands on some agricultural trade data. Trade data are typically extremely dirty so treat with care when you get your hands on them. Lab standard equipments are required.

So I decided to look how countries trade by plotting the network  (The data is confidential so I would not disclose the country nor the commodity).


## Load the library
library(XML)
library(reshape)
library(igraph)

## Read the data

t1001.df = read.csv("http://dl.dropbox.com/u/18161931/trade.csv", header = TRUE)

## Create the graph
net.mat = as.matrix(t1001.df)
net.g = graph(t(unique(net.mat[, 1:2])))

## Delete vertices with no edge and set edge with proportional to the side of trade
full.g = delete.vertices(net.g, which(degree(net.g) == 0))
E(full.g)$width = scale(net.mat[, 3])

net.g = graph(t(unique(net.mat[, 1:2])))
full.g = delete.vertices(net.g, which(degree(net.g) == 0))

## Change arrow size according to trade volume
E(full.g)$width = net.mat[, 3]/sum(net.mat[, 3])
E(full.g)$width[E(full.g)$width < 0.05] = 0.05
E(full.g)$width = E(full.g)$width * 20

## Compute the size of exporting vertex
sum.df = with(t1001.df, aggregate(TradeValue, list(rtCode), sum))

## Change size and color of exporting country
V(full.g)$size = 8
V(full.g)$size[c(6, 10, 13, 28, 43)] =
  ((sum.df[, "x"]/sum(sum.df[, "x"])) - min(sum.df[, "x"]/sum(sum.df[, "x"]))) *
  15 + 8
V(full.g)$color = "lightblue"
V(full.g)$color[c(6, 10, 13, 28, 43)] = "steelblue"

## Plot the network

set.seed(587)
plot(full.g, edge.arrow.size = 0.3, edge.curved = TRUE,
     vertex.label.color = "black")




The exporters are coloured in dark blue while the importers in light blue. The width of the connection is proportional to the amount of trading between the countries.

Looking at the plot one can easily identify that country 43, 13, are major exporters while country 7 and 37 are major importers. These information can be easily extracted with some simple analysis, however, there are some subtle points which are a little bit hard to identify without a network diagram.

(1) There are clear cluster relationships, certain countries only import from either 43, 13, or 28 while some import from more than one. There could be certain cost/logistic/trade/geographical reasons for this kind of pattern.

(2) Country 10 is isolated meaning that there are no trading between the rest of the world!

The network reveal some subtle information very quickly and is a very good exploratory tool for trade data.

Saturday, September 15, 2012

Preferential attachment for network

I am currently taking the networked life course on Coursera.org offered by Professor Michael Kearns from the University of Pennsylvania.  I have been took several courses including machine learning, natural language processing since the platform was launched late last year. I have to give my biggest praise to Andrew Ng and his team for creating such platform and enable knowledge sharing to those who have no access to it.

Nevertheless, I want to talk about and demonstrate a type of simulated network model called preferential attachment introduced in the course.

The idea of this particular network is that individuals are more likely to be connected to others who already has a large connection (degree), this is typical for social network and many real life situation. The more connection one has, the more exposed the individual is to the rest of the world.

This model is able to mimic the small diameter, high cluster and heavy-tailed degree distribution of typical large networks.

For brevity, we will just focus on the heavy-tail property and power-law decay of the degree distribution of the network. Where a large number of individuals has small number of connection while a small number of individuals have a much much greater number of connections.

 The following simulation simulates a preferential attachment network and plot the histogram and the log-log relationship.  The simulation is based on the same example that N number of individuals chooses from 1000 night clubs.



library(animation)
saveHTML({
  N = 1000
  for(i in 1:100){
    m = 100 * i
    rlt = rep(1, N)
    for(i in 1:m){
      selection = sample(x = 1:N, size = 1, prob = rlt)
      rlt[selection] = rlt[selection] + 1
    }
    rlt = rlt - 1
        par(mfrow = c(1, 2))
    bin = hist(rlt, breaks =  m/10,
      main = paste("Histogram of preferential network\n with ", m,
        " individuals choosing\nbetween 1000 clubs", sep = ""),
      xlab = "Number of Individuals", ylab = "Frequency")
    plot(log(bin$breaks[-1]), log(bin$counts),
         main = "log-log histogram", xlab = "", ylab = "")
    lm.df = data.frame(lbin = log(bin$breaks[-1]), lcounts = log(bin$counts))
    lm.df = lm.df[lm.df$lcounts != -Inf & lm.df$lcounts != 0 , ]
    fit = lm(lcounts ~lbin, data = lm.df)
    abline(coef = coef(fit))
  }
})



From the simulation, we can see the histogram becomes more and more heavy-tailed as we increase the number of individuals/vertex in the network, but I remain sceptical about the linear relationship in the log-log plot which is an indicator of the power law decay.

Thursday, September 13, 2012

Imputation by mean?

Today, I was briefed that when computing the regional aggregates such as those defined by the M49 country standard of the United Nation (http://unstats.un.org/unsd/methods/m49/m49regin.htm) I should use the regional mean to replace missing values.

I was sceptical about this approach based on the little knowledge I had about missing value since the assumption required by the method is extremely strong.

(1) The missing value has to be in the form of MCAR (Missing completely at random), which is highly violated since missing value are more likely to come from countries where the statistical office are not well established or less developed.

(2) The method also required the data to be relatively symmetric, otherwise the mean will not be an unbiased estimate of the missing value.

So I decide to do some data checking and download some data from the nice World Bank (http://data.worldbank.org/) and see what the data look like.




## Read the name file, lets lets just work with the first 100 variables
WDI = read.csv(file = "http://dl.dropbox.com/u/18161931/WorldBankIndicators.csv",
  stringsAsFactors = FALSE, nrows = 100)
WDI = WDI[-c(1:10), ]

## Download and merge the data. Some vairables are not collected in 2010
## and thus they are discarded
WDI.df = WDI(indicator = WDI$WDI_NAME[1], start = 2010, end = 2010)
for(i in 2:NROW(WDI)){
  tmp = WDI(indicator = WDI$WDI_NAME[i], start = 2010, end = 2010)
  if(!inherits(tmp, "try-error") &
     (sum(is.na(tmp[, WDI$WDI_NAME[i]])) != NROW(tmp)))
    WDI.df = merge(WDI.df, tmp, by = c("iso2c", "country", "year"))
}

## Produce histogram to examine the suitability of mean imputation
pdf(file = "dataDist.pdf")
for(i in 3:NCOL(WDI.df)){
  hist(WDI.df[, i], breaks = 100, main = colnames(WDI.df)[i], xlab = NULL)
  abline(v = mean(WDI.df[, i], na.rm = TRUE), col = "red")
  pctBelowMean = round(100 * sum(na.omit(WDI.df[, i]) <
    mean(WDI.df[, i], na.rm = TRUE))/length(na.omit(WDI.df[, i])), 2)
  legend("topright", legend = paste(pctBelowMean,
                       "% of data are below the mean", sep = ""))
}
graphics.off()


From the saved plot we can clearly see that a large amount of variables are heavily skewed (typical for monetary and population related type data). In addition, we can see that the majority of the data lies far below the mean and thus if the mean imputation method was used to compute the aggregates, we would end up with an estimate biased significantly upwards.


Thursday, August 2, 2012

Units and metadata

Handling meta-data is not natural in R, or any traditional rectangular shaped type data storage system.

There are several tricks and packages which attempt to solve this problem, with Hmisc using the atrribute feature and the IRange package having its own DataFrame class.

The Hmisc allows one to store meta data such as units, label and comments

library(Hmisc)

## Create a test data frame
test.df <- data.frame(x = ts(1:12, start = c(2000, 1), frequency = 12),
                      y = ts(1:12, start = c(2001, 1), frequency = 12))

## Assign the units and comment
units(test.df$x) = "cm"
units(test.df$y) = "m"
comment(test.df) <- "this is a test data set"

## Summary of the data
describe(test.df)
contents(test.df)

The disadvantage of this approach is that the meta data is lost when functions such as subset is used.

str(subset(test.df, select = a, drop = FALSE))

This render the use only restrict to storage but not manipulation.

The second approach of the IRange package creates a whole new S4 class for handling data with meta-data, with corresponding accessor functions the attributes can be retained.



library(IRanges)
test2.df <- DataFrame(x = 1:10, y = letters[1:10])
metadata(test2.df) <- list(units=list(a = "cm", b="m"))



str(subset(test2.df, select = x))


In this case the units are still preserved, nevertheless the subset function does not subset the meta-data which can cause problem.

In short, there are definitely rooms for improvement. Writing a new class is definitely more natural and gives the developer and user more control.