StaTEAstics.: September 2012

Saturday, September 29, 2012

my Facebook social network

I got very excited on making a network diagram of my Facebook network using Ghefi (https://gephi.org/) and submitted my first assignment for the Social Network Analysis course on https://www.coursera.org/. It's middle of the night, so I will keep the post short and update more details tomorrow.

This is my Facebook social network, after spending 12 years in New Zealand you can see how small my network is and how tightly knitted they are! The larger and darker the nodes are the more mutual friends they share with me.

You can see major clusters from the diagram, the top left major cluster are my statistic geeks, the middle cluster are the people I hang out whether drinking, went to university together etc and the lower cluster are my college friends.

Then there is three small cluster as well, at the far right top are my family and relatives and close to it are my intermediate friends from Taiwan and since some of whom are my brother's friend they are loosely connected to my family. Finally, there is the small cluster in the lower right which are my colleagues and friends from Rome!

Enrol in the course, and you can create and understand you Facebook friend ships.

Saturday, September 22, 2012

Network of trade

This week, I got my hands on some agricultural trade data. Trade data are typically extremely dirty so treat with care when you get your hands on them. Lab standard equipments are required.

So I decided to look how countries trade by plotting the network (The data is confidential so I would not disclose the country nor the commodity).

## Load the library
library(XML)
library(reshape)
library(igraph)

## Read the data

t1001.df = read.csv("http://dl.dropbox.com/u/18161931/trade.csv", header = TRUE)

## Create the graph
net.mat = as.matrix(t1001.df)
net.g = graph(t(unique(net.mat[, 1:2])))

## Delete vertices with no edge and set edge with proportional to the side of trade
full.g = delete.vertices(net.g, which(degree(net.g) == 0))
E(full.g)$width = scale(net.mat[, 3])

net.g = graph(t(unique(net.mat[, 1:2])))
full.g = delete.vertices(net.g, which(degree(net.g) == 0))

## Change arrow size according to trade volume
E(full.g)$width = net.mat[, 3]/sum(net.mat[, 3])
E(full.g)$width[E(full.g)$width < 0.05] = 0.05
E(full.g)$width = E(full.g)$width * 20

## Compute the size of exporting vertex
sum.df = with(t1001.df, aggregate(TradeValue, list(rtCode), sum))

## Change size and color of exporting country
V(full.g)$size = 8
V(full.g)$size[c(6, 10, 13, 28, 43)] =
((sum.df[, "x"]/sum(sum.df[, "x"])) - min(sum.df[, "x"]/sum(sum.df[, "x"]))) *
15 + 8
V(full.g)$color = "lightblue"
V(full.g)$color[c(6, 10, 13, 28, 43)] = "steelblue"

## Plot the network

set.seed(587)
plot(full.g, edge.arrow.size = 0.3, edge.curved = TRUE,
vertex.label.color = "black")

The exporters are coloured in dark blue while the importers in light blue. The width of the connection is proportional to the amount of trading between the countries.

Looking at the plot one can easily identify that country 43, 13, are major exporters while country 7 and 37 are major importers. These information can be easily extracted with some simple analysis, however, there are some subtle points which are a little bit hard to identify without a network diagram.

(1) There are clear cluster relationships, certain countries only import from either 43, 13, or 28 while some import from more than one. There could be certain cost/logistic/trade/geographical reasons for this kind of pattern.

(2) Country 10 is isolated meaning that there are no trading between the rest of the world!

The network reveal some subtle information very quickly and is a very good exploratory tool for trade data.

Saturday, September 15, 2012

Preferential attachment for network

I am currently taking the networked life course on Coursera.org offered by Professor Michael Kearns from the University of Pennsylvania. I have been took several courses including machine learning, natural language processing since the platform was launched late last year. I have to give my biggest praise to Andrew Ng and his team for creating such platform and enable knowledge sharing to those who have no access to it.

Nevertheless, I want to talk about and demonstrate a type of simulated network model called preferential attachment introduced in the course.

The idea of this particular network is that individuals are more likely to be connected to others who already has a large connection (degree), this is typical for social network and many real life situation. The more connection one has, the more exposed the individual is to the rest of the world.

This model is able to mimic the small diameter, high cluster and heavy-tailed degree distribution of typical large networks.

For brevity, we will just focus on the heavy-tail property and power-law decay of the degree distribution of the network. Where a large number of individuals has small number of connection while a small number of individuals have a much much greater number of connections.

The following simulation simulates a preferential attachment network and plot the histogram and the log-log relationship. The simulation is based on the same example that N number of individuals chooses from 1000 night clubs.

library(animation)
saveHTML({
N = 1000
for(i in 1:100){
m = 100 * i
rlt = rep(1, N)
for(i in 1:m){
selection = sample(x = 1:N, size = 1, prob = rlt)
rlt[selection] = rlt[selection] + 1
}
rlt = rlt - 1
par(mfrow = c(1, 2))
bin = hist(rlt, breaks = m/10,
main = paste("Histogram of preferential network\n with ", m,
" individuals choosing\nbetween 1000 clubs", sep = ""),
xlab = "Number of Individuals", ylab = "Frequency")
plot(log(bin$breaks[-1]), log(bin$counts),
main = "log-log histogram", xlab = "", ylab = "")
lm.df = data.frame(lbin = log(bin$breaks[-1]), lcounts = log(bin$counts))
lm.df = lm.df[lm.df$lcounts != -Inf & lm.df$lcounts != 0 , ]
fit = lm(lcounts ~lbin, data = lm.df)
abline(coef = coef(fit))
}
})

From the simulation, we can see the histogram becomes more and more heavy-tailed as we increase the number of individuals/vertex in the network, but I remain sceptical about the linear relationship in the log-log plot which is an indicator of the power law decay.

Thursday, September 13, 2012

Imputation by mean?

Today, I was briefed that when computing the regional aggregates such as those defined by the M49 country standard of the United Nation (http://unstats.un.org/unsd/methods/m49/m49regin.htm) I should use the regional mean to replace missing values.

I was sceptical about this approach based on the little knowledge I had about missing value since the assumption required by the method is extremely strong.

(1) The missing value has to be in the form of MCAR (Missing completely at random), which is highly violated since missing value are more likely to come from countries where the statistical office are not well established or less developed.

(2) The method also required the data to be relatively symmetric, otherwise the mean will not be an unbiased estimate of the missing value.

So I decide to do some data checking and download some data from the nice World Bank (http://data.worldbank.org/) and see what the data look like.

## Read the name file, lets lets just work with the first 100 variables
WDI = read.csv(file = "http://dl.dropbox.com/u/18161931/WorldBankIndicators.csv",
stringsAsFactors = FALSE, nrows = 100)
WDI = WDI[-c(1:10), ]

## Download and merge the data. Some vairables are not collected in 2010
## and thus they are discarded
WDI.df = WDI(indicator = WDI$WDI_NAME[1], start = 2010, end = 2010)
for(i in 2:NROW(WDI)){
tmp = WDI(indicator = WDI$WDI_NAME[i], start = 2010, end = 2010)
if(!inherits(tmp, "try-error") &
(sum(is.na(tmp[, WDI$WDI_NAME[i]])) != NROW(tmp)))
WDI.df = merge(WDI.df, tmp, by = c("iso2c", "country", "year"))
}

## Produce histogram to examine the suitability of mean imputation
pdf(file = "dataDist.pdf")
for(i in 3:NCOL(WDI.df)){
hist(WDI.df[, i], breaks = 100, main = colnames(WDI.df)[i], xlab = NULL)
abline(v = mean(WDI.df[, i], na.rm = TRUE), col = "red")
pctBelowMean = round(100 * sum(na.omit(WDI.df[, i]) <
mean(WDI.df[, i], na.rm = TRUE))/length(na.omit(WDI.df[, i])), 2)
legend("topright", legend = paste(pctBelowMean,
"% of data are below the mean", sep = ""))
}
graphics.off()

From the saved plot we can clearly see that a large amount of variables are heavily skewed (typical for monetary and population related type data). In addition, we can see that the majority of the data lies far below the mean and thus if the mean imputation method was used to compute the aggregates, we would end up with an estimate biased significantly upwards.