Create a vector in R by summing rows based on multiple criteria -
i have financial data in 15 minute intervals, want convert intervals 15 minutes 30 minutes before conduct rest of analysis. such, sum traded volumes 2 adjacent 15 minute intervals , take closing price of second 15 minute sub-interval (ie end of 30 minute period).
i have shown below example of data (df) , desired output (df.30min) using sapply function. works fine example below, given analysing 10 years of daily data 50 companies , 27 intervals per day processing time excessive, 1 year of data. have similar issues if try loop.
i new r hoping there easy solution using 1 of built in functions.
in actual dataset there 27 x 15 minute intervals (10:00-16:45). final "30 minute" dataset have 1 15 minute interval 13:30-13:45. also, there may other anomalies stock exchange opened late / closed or stock put on trading halt partway through day. (i have managed map times in data correct interval using lookup table match function.) given imperfect structure of data after solution not reliant on complete set , number of 15 minute intervals. in excel use sumifs function.
set.seed(1) df <- data.frame( company = rep(c("co a", "co b", "co c"), each = 8), date = as.date(rep(c("2005-01-01", "2005-01-02"), times = 3, each = 4)), time = as.factor(c("10:00:00", "10:15:00", "10:30:00", "10:45:00")), interval = as.factor(c(1,1,2,2)), interval.end = as.factor(c(0,1)), close = abs(round(rnorm(24),1))*10+100, volume = abs(round(rnorm(24),1))*10) > df company date time interval interval.end close volume 1 co 2005-01-01 10:00:00 1 0 106 6 2 co 2005-01-01 10:15:00 1 1 102 1 3 co 2005-01-01 10:30:00 2 0 108 2 4 co 2005-01-01 10:45:00 2 1 116 15 5 co 2005-01-02 10:00:00 1 0 103 5 6 co 2005-01-02 10:15:00 1 1 108 4 7 co 2005-01-02 10:30:00 2 0 105 14 8 co 2005-01-02 10:45:00 2 1 107 1 9 co b 2005-01-01 10:00:00 1 0 106 4 10 co b 2005-01-01 10:15:00 1 1 103 1 11 co b 2005-01-01 10:30:00 2 0 115 14 12 co b 2005-01-01 10:45:00 2 1 104 4 13 co b 2005-01-02 10:00:00 1 0 106 4 14 co b 2005-01-02 10:15:00 1 1 122 1 15 co b 2005-01-02 10:30:00 2 0 111 11 16 co b 2005-01-02 10:45:00 2 1 100 8 17 co c 2005-01-01 10:00:00 1 0 100 2 18 co c 2005-01-01 10:15:00 1 1 109 3 19 co c 2005-01-01 10:30:00 2 0 108 7 20 co c 2005-01-01 10:45:00 2 1 106 6 21 co c 2005-01-02 10:00:00 1 0 109 7 22 co c 2005-01-02 10:15:00 1 1 108 7 23 co c 2005-01-02 10:30:00 2 0 101 4 24 co c 2005-01-02 10:45:00 2 1 120 8 df.30min <- df[-which(df$interval.end == 0),] df.30min$volume <-sapply(seq_len(nrow(df.30min)), function(i) sum(df$volume[df$company == df.30min$company[i] & df$date == df.30min$date[i] & df$interval == df.30min$interval[i]])) > df.30min company date time interval interval.end close volume 2 co 2005-01-01 10:15:00 1 1 102 7 4 co 2005-01-01 10:45:00 2 1 116 17 6 co 2005-01-02 10:15:00 1 1 108 9 8 co 2005-01-02 10:45:00 2 1 107 15 10 co b 2005-01-01 10:15:00 1 1 103 5 12 co b 2005-01-01 10:45:00 2 1 104 18 14 co b 2005-01-02 10:15:00 1 1 122 5 16 co b 2005-01-02 10:45:00 2 1 100 19 18 co c 2005-01-01 10:15:00 1 1 109 5 20 co c 2005-01-01 10:45:00 2 1 106 13 22 co c 2005-01-02 10:15:00 1 1 108 14 24 co c 2005-01-02 10:45:00 2 1 120 12
using library dplyr
, can try this:
library(dplyr) df %>% arrange(company, date, time, interval, interval.end) %>% group_by(company, date, interval) %>% summarise(time = time[2], interval.end = interval.end[2], close = close[2], volume = sum(volume)) source: local data frame [12 x 7] groups: company, date [?] company date interval time interval.end close volume (fctr) (date) (fctr) (fctr) (fctr) (dbl) (dbl) 1 co 2005-01-01 1 10:15:00 1 102 7 2 co 2005-01-01 2 10:45:00 1 116 17 3 co 2005-01-02 1 10:15:00 1 108 9 4 co 2005-01-02 2 10:45:00 1 107 15 5 co b 2005-01-01 1 10:15:00 1 103 5 6 co b 2005-01-01 2 10:45:00 1 104 18 7 co b 2005-01-02 1 10:15:00 1 122 5 8 co b 2005-01-02 2 10:45:00 1 100 19 9 co c 2005-01-01 1 10:15:00 1 109 5 10 co c 2005-01-01 2 10:45:00 1 106 13 11 co c 2005-01-02 1 10:15:00 1 108 14 12 co c 2005-01-02 2 10:45:00 1 120 12
if data frame arranged properly, can rid of arrange
part above.
note: assuming there 2 intervals (0, 1) , therefore using hardcoded value of 2
. if not case, can use proper subsetting.
Comments
Post a Comment