Fast count of number of Events for each Object with R data.tables -


i have number of objects can within number of locations (where number of locations considerably smaller number of objects), each object has start , end date. have number of events, have location, , day on occur. want know each object number of events occured @ same location during stay (so occur between object's start , end date).

as have several sets , number of objects differs 450.000 6 million, task takes considerable time. until now, fastests method found uses data.table method. function below shows example of can vary number of sizes.

coupleeventobject <- function(sizeo=100,sizee=100){  require(data.table)  require(zoo)   #create events    events <- data.table(eventnumber = c(1:sizee),                       location    = as.character(sample(c(1:floor(sizeo/10)),size=sizee,replace=t)),                       dayevent    = rand.day(day.start="2007-01-01",                                      day.end  ="2015-12-31",                                      size=sizee))   #create objects  objects <- data.table(objectnumber = c(1:sizeo),                        location     = as.character(sample(c(1:floor(sizeo/10)),size=sizeo,replace=t)),                        day1 = rand.day(day.start="2007-01-01",                                           day.end  ="2015-12-31",                                           size=sizeo),                        day2 = rand.day(day.start="2007-01-01",                                          day.end  ="2015-12-31",                                          size=sizeo))   objects[, daystart := as.date(ifelse (day1>day2,day2,day1))]  objects[, dayend   := as.date(ifelse (day1<day2,day2,day1))]  objects[,c("day1","day2"):=null]   #set keys right coupling/counting  setkey(objects,location,daystart,dayend)  setkey(events,location,dayevent)   #count number of events  system.time(  objects[,numberevents:=events[location,][dayevent >= daystart & dayevent <= dayend,.n],by=list(daystart,dayend,location)]  ) }  rand.day <- function(day.start,day.end,size) {   dayseq <- seq.date(as.date(day.start),as.date(day.end),by="day")   dayselect <- sample(dayseq,size,replace=true)   return(dayselect) } 

for 100 objects , 100 events code runs on laptop within 0.3 seconds

> coupleeventobject()    user  system elapsed     0.30    0.00    0.29  

but if increasy number of objects scales linearly process time.

> coupleeventobject(sizee=200,sizeo=6000)    user  system elapsed    15.11    0.00   15.26  

so count number of events 6 million objects costs 4 hours, , have several times (different kind of location levels and). there way speed up? , ideas!

here option. main idea join events , objects in such way every combination exist first, , count valid values. see, new way runs far faster , same results. btw... might have change random date generator because didn't have access original function.

coupleeventobject1 <- function(sizeo=100,sizee=100){  require(data.table)  require(zoo)   #create events    events <- data.table(eventnumber = c(1:sizee),                       location    = as.character(sample(c(1:floor(sizeo/10)),size=sizee,replace=t)),                       dayevent    = as.date(as.integer(runif(sizee)*1000)))   #create objects  objects <- data.table(objectnumber = c(1:sizeo),                        location     = as.character(sample(c(1:floor(sizeo/10)),size=sizeo,replace=t)),                        day1 = as.date(as.integer(runif(sizee)*1000)),                        day2 = as.date(as.integer(runif(sizee)*1000)))   objects[, daystart := as.date(ifelse (day1>day2,day2,day1))]  objects[, dayend   := as.date(ifelse (day1<day2,day2,day1))]  objects[,c("day1","day2"):=null]   #set keys right coupling/counting  setkey(objects,location,daystart,dayend)  setkey(events,location,dayevent)   #count number of events  cat("first method:")  cat(system.time(  res1 <- objects[,numberevents:=events[location,][dayevent >= daystart & dayevent <= dayend,.n],by=list(daystart,dayend,location)]  ))  cat("\n")    ## second method   #set keys right coupling/counting  setkey(objects,location)  setkey(events,location)   #count number of events  cat("second method:")  cat(system.time({  oe <- objects[events,allow.cartesian=t]  res2 <- oe[,sum(dayevent >= daystart & dayevent <= dayend),by=list(objectnumber,daystart,dayend,location)]  }))  cat("\n")   # comparing  setkey(res1, objectnumber, location, daystart, dayend)  setkey(res2, objectnumber, location, daystart, dayend)  cat("compare values: ", nrow(res1[res2][numberevents != v1,])," mismatches\n")   return(list(res1=res1,res2=res2)) } 

and here results:

xx <- coupleeventobject1(200,6000) first method:8.151 0.041 8.15 0 0 second method:0.614 0.017 0.625 0 0 compare values:  0  mismatches 

Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -