I received my copy of "La Cucina - The regional cooking of Italy" from Accademia Italiana Della Cucina today and I can't wait to dig into all 2000 Italian regional recipes in the book!

I have always believed that everything has a mathematical form, and they are every where in our life.

Take cooking for example, a recipe is a model, ingredients are like data, and the chef turns the ingredients into magnificent dishes just like how a statistician would turn data into useful insight.

Furthermore vegetables, meat and all other inputs just like data are not always the same. Time-series data are subject to sampling frequency, and surveys are prone to non-response. Ingredients can also be good or bad and have a distribution they follow.

A bad cook like me would have blindly follow the recipe and make a terrible dish. On the other hand, a good chef would analyse the ingredients by looking, smelling and even taste it to determine the state of the ingredients just like a good analyst would plot the data to grasp the feeling of the data.

The only difference is that a statistician would analyse all these information with a computer, while the chef analyse all these in his personal computer (brain and senses).

I wonder would it be possible one day that when we can measure all the aspects of taste, a terrible cook like me can turn into a Master chef!

## Monday, October 8, 2012

## Wednesday, October 3, 2012

### Perculiar behaviour of the sum function

The

do.call(myFUN, list(x = x))

Where

do.call(myFUN, list(x = x, w = w))

What turns out is that this line of code works find for all the summary statistics except the

mean(rep(NA, 10), na.rm = TRUE)

median(rep(NA, 10), na.rm = TRUE)

While the sum function returns zero. It puzzles me how you get zero when NA stands for not available and this is like creating something out of nothing. This is a problem for me since if I want to sum up multiple time series with missing values, I want the function to remove NA and compute where there are partial data while returning NA instead of zero when there are no data at all.

Nevertheless, a simple solution exists and thanks to the active R community. This post on R help addresses this problem and solve in an elegant manner.

sum(x, na.rm = any(!is.na(x)))

"The computations and the software for data analysis should be trustworthy" - John Chamber, Software for Data Analysis

I am not sure about the reasoning underlay the behaviour of sum, but it should be consistent so people can trust it and use it as what they expect.

*sum*function in R is a special one in contrast to other summary statistics functions such as*mean*and*median*. The first distinguish is that it is a**Primitive**function where the others are not (Although you can call*mean*using*.Internal*). This causes many inconsistency and unexpected behaviours.__(1) Inconsistency in argument__**For example, the arguments are inconsistent. Both**

*mean*and*median*takes the argument x, while the*sum*operates on whatever argument that is not matched. This can be a problem in the case when you want to write a function which switches between all the summary functions such as:do.call(myFUN, list(x = x))

Where

**myFun**can be any statistical summary function. The problem first arises when I wanted to write a function which encompasses several different summary statistics and so I can switch between them when required. The main problem arises when I have to pass additional arguments such as the "weight" in the*weighted.mean*function. I wrote the following call and naively hope it would workdo.call(myFUN, list(x = x, w = w))

What turns out is that this line of code works find for all the summary statistics except the

*sum*function where the "weight" is also summed. So my current solution is just to use the*switch*function which is not my favourite function.__(2) Inconsistency in output__**Another inconsistency arises in how the NA's are treated. In the**

*mean*,*median*and*weighted.mean*summaries; if all the observations are NA then either NA or NaN are returned.mean(rep(NA, 10), na.rm = TRUE)

median(rep(NA, 10), na.rm = TRUE)

While the sum function returns zero. It puzzles me how you get zero when NA stands for not available and this is like creating something out of nothing. This is a problem for me since if I want to sum up multiple time series with missing values, I want the function to remove NA and compute where there are partial data while returning NA instead of zero when there are no data at all.

Nevertheless, a simple solution exists and thanks to the active R community. This post on R help addresses this problem and solve in an elegant manner.

sum(x, na.rm = any(!is.na(x)))

"The computations and the software for data analysis should be trustworthy" - John Chamber, Software for Data Analysis

I am not sure about the reasoning underlay the behaviour of sum, but it should be consistent so people can trust it and use it as what they expect.

Subscribe to:
Posts (Atom)