## Tuesday, February 5, 2013

### Relearn boxplot and label the outliers

Despite the fact that box plot is used almost every where and taught at undergraduate statistic classes, I recently had to re-learn the box plot in order to know how to label the outliers.

This stackoverflow post was where I found how the outliers and whiskers of the Tukey box plots are defined in R and ggplot2:
In ggplot2, what do the end of the boxplot lines represent?

and this post on how to label the outliers using base graphics.
How to label all the outliers in a boxplot

Since the use of ggplot2 is required for this task, I have written some basic hack code to label the outliers for ggplot2.

Here are the codes:

## Install the FAOSTAT package to obtain the data
if(!is.element("FAOSTAT", .packages()))
install.packages("FAOSTAT")
library(FAOSTAT)

cp.lst = getFAOtoSYB(name = "cassava_production", domainCode = "QC",
itemCode = 125, elementCode = 5510)

## Use the country level data, and take only data for 2011 and remove the NA's
cp.df = cp.lst$entity[!is.na(cp.lst$entity$cassava_production) & cp.lst$entity\$Year == 2011, ]

## Merge with the country profile to obtain the country names for labelling
ccp.df = merge(cp.df, FAOcountryProfile[, c("FAOST_CODE", "ABBR_FAO_NAME")],
all.x = TRUE)

## Merge with the regional pofile to obtain the UNSD M49 macro region
## composition for multiple boxplot.
rcp.df = merge(ccp.df, FAOregionProfile[, c("FAOST_CODE", "UNSD_MACRO_REG")],
all.x = TRUE)

## Compute the quantile
qrcp.df = ddply(.data = rcp.df, .variables = .(UNSD_MACRO_REG), transform,
lQntl = quantile(cassava_production, probs = 0.25, na.rm = TRUE),
uQntl = quantile(cassava_production, probs = 0.75, na.rm = TRUE))

## Compute the lower and upper bound which defines the outlier
brcp.df = ddply(.data = qrcp.df, .variables = .(UNSD_MACRO_REG), transform,
lBound = lQntl - 1.5 * (uQntl - lQntl),
uBound = uQntl + 1.5 * (uQntl - lQntl))

## Remove the country names if it is within the bounds
with(brcp.df, {
brcp.df[cassava_production <= uBound &
cassava_production >= lBound, "ABBR_FAO_NAME"] <<- NA
})

## Plot the data
set.seed(587)
ggplot(data = brcp.df, aes(x = UNSD_MACRO_REG, y = cassava_production)) +
geom_boxplot(outlier.colour = NA) +
geom_text(aes(label = ABBR_FAO_NAME), size = 2,
position = position_jitter(width = 0.1)) +
labs(x = NULL, y = NULL, title = "Production of Cassava by region")


Here is the final product, to avoid over-plotting of texts I have used position_jitter. Which is not an elegant solution but I just can not find any algorithm that works well in general.