*sum*function in R is a special one in contrast to other summary statistics functions such as

*mean*and

*median*. The first distinguish is that it is a

**Primitive**function where the others are not (Although you can call

*mean*using

*.Internal*). This causes many inconsistency and unexpected behaviours.

__(1) Inconsistency in argument__**For example, the arguments are inconsistent. Both**

*mean*and

*median*takes the argument x, while the

*sum*operates on whatever argument that is not matched. This can be a problem in the case when you want to write a function which switches between all the summary functions such as:

do.call(myFUN, list(x = x))

Where

**myFun**can be any statistical summary function. The problem first arises when I wanted to write a function which encompasses several different summary statistics and so I can switch between them when required. The main problem arises when I have to pass additional arguments such as the "weight" in the

*weighted.mean*function. I wrote the following call and naively hope it would work

do.call(myFUN, list(x = x, w = w))

What turns out is that this line of code works find for all the summary statistics except the

*sum*function where the "weight" is also summed. So my current solution is just to use the

*switch*function which is not my favourite function.

__(2) Inconsistency in output__**Another inconsistency arises in how the NA's are treated. In the**

*mean*,

*median*and

*weighted.mean*summaries; if all the observations are NA then either NA or NaN are returned.

mean(rep(NA, 10), na.rm = TRUE)

median(rep(NA, 10), na.rm = TRUE)

While the sum function returns zero. It puzzles me how you get zero when NA stands for not available and this is like creating something out of nothing. This is a problem for me since if I want to sum up multiple time series with missing values, I want the function to remove NA and compute where there are partial data while returning NA instead of zero when there are no data at all.

Nevertheless, a simple solution exists and thanks to the active R community. This post on R help addresses this problem and solve in an elegant manner.

sum(x, na.rm = any(!is.na(x)))

"The computations and the software for data analysis should be trustworthy" - John Chamber, Software for Data Analysis

I am not sure about the reasoning underlay the behaviour of sum, but it should be consistent so people can trust it and use it as what they expect.

## No comments:

## Post a Comment