Thursday, August 2, 2012

Units and metadata

Handling meta-data is not natural in R, or any traditional rectangular shaped type data storage system.

There are several tricks and packages which attempt to solve this problem, with Hmisc using the atrribute feature and the IRange package having its own DataFrame class.

The Hmisc allows one to store meta data such as units, label and comments

library(Hmisc)

## Create a test data frame
test.df <- data.frame(x = ts(1:12, start = c(2000, 1), frequency = 12),
                      y = ts(1:12, start = c(2001, 1), frequency = 12))

## Assign the units and comment
units(test.df$x) = "cm"
units(test.df$y) = "m"
comment(test.df) <- "this is a test data set"

## Summary of the data
describe(test.df)
contents(test.df)

The disadvantage of this approach is that the meta data is lost when functions such as subset is used.

str(subset(test.df, select = a, drop = FALSE))

This render the use only restrict to storage but not manipulation.

The second approach of the IRange package creates a whole new S4 class for handling data with meta-data, with corresponding accessor functions the attributes can be retained.



library(IRanges)
test2.df <- DataFrame(x = 1:10, y = letters[1:10])
metadata(test2.df) <- list(units=list(a = "cm", b="m"))



str(subset(test2.df, select = x))


In this case the units are still preserved, nevertheless the subset function does not subset the meta-data which can cause problem.

In short, there are definitely rooms for improvement. Writing a new class is definitely more natural and gives the developer and user more control.