Tuesday, April 16, 2013

Pre-allocating a DataFrame in R

Anyone who has ever tried to load a few thousand rows of data into an R dataframe of a couple hundred columns will have learned the hard way that the storage space should be allocated in advance.

Normally this is not a problem. The columns are initialized with empty vectors sized to the number of rows expected:

n <- 100
df <- data.frame( x=numeric(n), y=character(n) )
for ( i in 1:n ) {
  df[i,] = list(...)
}

R dataframes act a little funny with time series, though. When storing a time series in a dataset, the rows represent the data points in a time series (or attributes), while the column represents the time series itself (or entity). Thus, the two time series
  1 3 5 7 9
  8 2 5 1 4
should be stored in an R data frame as
  1 8
  3 2
  5 5
  7 1
  9 4
...i.e. the transpose of how data is normally stored in R dataframes (rows being the entity, columns being the attributes). This is mostly due to an assumption in tools like ggplot: the analysis or visualization is performed on the values of an attribute (column) in a set of entities (rows).

This poses a problem when dynamically allocating a dataframe for time series: the number of columns is not known in advance, while the number of rows often is (e.g. in DSP samples).

The solution is to create a list of columns, then pass the list to the data.frame() constructor:

ts.allocate.dataframe <- function(num_ts, ts_size=) {
        # create a list of numeric vectors
        cols = lapply(1:num_ts, function(x) numeric(ts_size))
        data.frame(index=1:ts_size, 
                   # initialize a column of timestamps to now()
                   timestamp=as.POSIXct(1:ts_size, origin=Sys.time()),
                   # add the columns for the time series
                   as.data.frame(cols))


When filling the dataframe, be sure to set the column name when inserting the data:

# ... build lists ts_data and ts_names ...

df.ts <- ts.allocate.dataframe(length(ts_data[[1]]), 
                                  length(ts_data) )

for ( i in 1:length(ts_data) ) {
    # set column i+2 to ts_data[i] contents
    # note that the first two columns in the dataframe 
    #      are 'index and 'timestamp'
    df.ts[,i+2] <- ts_data[[i]]
    # set column name to ts_names[i]
    names(df.ts)[[i+2]] <- ts_names[[i]]
}