Create a vector in R by summing rows based on multiple criteria -


i have financial data in 15 minute intervals, want convert intervals 15 minutes 30 minutes before conduct rest of analysis. such, sum traded volumes 2 adjacent 15 minute intervals , take closing price of second 15 minute sub-interval (ie end of 30 minute period).

i have shown below example of data (df) , desired output (df.30min) using sapply function. works fine example below, given analysing 10 years of daily data 50 companies , 27 intervals per day processing time excessive, 1 year of data. have similar issues if try loop.

i new r hoping there easy solution using 1 of built in functions.

in actual dataset there 27 x 15 minute intervals (10:00-16:45). final "30 minute" dataset have 1 15 minute interval 13:30-13:45. also, there may other anomalies stock exchange opened late / closed or stock put on trading halt partway through day. (i have managed map times in data correct interval using lookup table match function.) given imperfect structure of data after solution not reliant on complete set , number of 15 minute intervals. in excel use sumifs function.

set.seed(1) df <- data.frame(   company = rep(c("co a", "co b", "co c"), each = 8),   date = as.date(rep(c("2005-01-01", "2005-01-02"), times = 3, each = 4)),   time = as.factor(c("10:00:00", "10:15:00", "10:30:00", "10:45:00")),   interval = as.factor(c(1,1,2,2)),   interval.end = as.factor(c(0,1)),   close = abs(round(rnorm(24),1))*10+100,   volume = abs(round(rnorm(24),1))*10)  > df    company       date     time interval interval.end close volume 1     co 2005-01-01 10:00:00        1            0   106      6 2     co 2005-01-01 10:15:00        1            1   102      1 3     co 2005-01-01 10:30:00        2            0   108      2 4     co 2005-01-01 10:45:00        2            1   116     15 5     co 2005-01-02 10:00:00        1            0   103      5 6     co 2005-01-02 10:15:00        1            1   108      4 7     co 2005-01-02 10:30:00        2            0   105     14 8     co 2005-01-02 10:45:00        2            1   107      1 9     co b 2005-01-01 10:00:00        1            0   106      4 10    co b 2005-01-01 10:15:00        1            1   103      1 11    co b 2005-01-01 10:30:00        2            0   115     14 12    co b 2005-01-01 10:45:00        2            1   104      4 13    co b 2005-01-02 10:00:00        1            0   106      4 14    co b 2005-01-02 10:15:00        1            1   122      1 15    co b 2005-01-02 10:30:00        2            0   111     11 16    co b 2005-01-02 10:45:00        2            1   100      8 17    co c 2005-01-01 10:00:00        1            0   100      2 18    co c 2005-01-01 10:15:00        1            1   109      3 19    co c 2005-01-01 10:30:00        2            0   108      7 20    co c 2005-01-01 10:45:00        2            1   106      6 21    co c 2005-01-02 10:00:00        1            0   109      7 22    co c 2005-01-02 10:15:00        1            1   108      7 23    co c 2005-01-02 10:30:00        2            0   101      4 24    co c 2005-01-02 10:45:00        2            1   120      8  df.30min <- df[-which(df$interval.end == 0),]  df.30min$volume <-sapply(seq_len(nrow(df.30min)),             function(i) sum(df$volume[df$company == df.30min$company[i] &                                         df$date == df.30min$date[i] &                                         df$interval == df.30min$interval[i]]))  > df.30min    company       date     time interval interval.end close volume 2     co 2005-01-01 10:15:00        1            1   102      7 4     co 2005-01-01 10:45:00        2            1   116     17 6     co 2005-01-02 10:15:00        1            1   108      9 8     co 2005-01-02 10:45:00        2            1   107     15 10    co b 2005-01-01 10:15:00        1            1   103      5 12    co b 2005-01-01 10:45:00        2            1   104     18 14    co b 2005-01-02 10:15:00        1            1   122      5 16    co b 2005-01-02 10:45:00        2            1   100     19 18    co c 2005-01-01 10:15:00        1            1   109      5 20    co c 2005-01-01 10:45:00        2            1   106     13 22    co c 2005-01-02 10:15:00        1            1   108     14 24    co c 2005-01-02 10:45:00        2            1   120     12 

using library dplyr, can try this:

library(dplyr) df %>% arrange(company, date, time, interval, interval.end) %>% group_by(company, date, interval) %>% summarise(time = time[2], interval.end = interval.end[2], close = close[2], volume = sum(volume)) source: local data frame [12 x 7] groups: company, date [?]     company       date interval     time interval.end close volume     (fctr)     (date)   (fctr)   (fctr)       (fctr) (dbl)  (dbl) 1     co 2005-01-01        1 10:15:00            1   102      7 2     co 2005-01-01        2 10:45:00            1   116     17 3     co 2005-01-02        1 10:15:00            1   108      9 4     co 2005-01-02        2 10:45:00            1   107     15 5     co b 2005-01-01        1 10:15:00            1   103      5 6     co b 2005-01-01        2 10:45:00            1   104     18 7     co b 2005-01-02        1 10:15:00            1   122      5 8     co b 2005-01-02        2 10:45:00            1   100     19 9     co c 2005-01-01        1 10:15:00            1   109      5 10    co c 2005-01-01        2 10:45:00            1   106     13 11    co c 2005-01-02        1 10:15:00            1   108     14 12    co c 2005-01-02        2 10:45:00            1   120     12 

if data frame arranged properly, can rid of arrange part above.

note: assuming there 2 intervals (0, 1) , therefore using hardcoded value of 2. if not case, can use proper subsetting.


Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -