Fast count of number of Events for each Object with R data.tables -
i have number of objects can within number of locations (where number of locations considerably smaller number of objects), each object has start , end date. have number of events, have location, , day on occur. want know each object number of events occured @ same location during stay (so occur between object's start , end date).
as have several sets , number of objects differs 450.000 6 million, task takes considerable time. until now, fastests method found uses data.table method. function below shows example of can vary number of sizes.
coupleeventobject <- function(sizeo=100,sizee=100){ require(data.table) require(zoo) #create events events <- data.table(eventnumber = c(1:sizee), location = as.character(sample(c(1:floor(sizeo/10)),size=sizee,replace=t)), dayevent = rand.day(day.start="2007-01-01", day.end ="2015-12-31", size=sizee)) #create objects objects <- data.table(objectnumber = c(1:sizeo), location = as.character(sample(c(1:floor(sizeo/10)),size=sizeo,replace=t)), day1 = rand.day(day.start="2007-01-01", day.end ="2015-12-31", size=sizeo), day2 = rand.day(day.start="2007-01-01", day.end ="2015-12-31", size=sizeo)) objects[, daystart := as.date(ifelse (day1>day2,day2,day1))] objects[, dayend := as.date(ifelse (day1<day2,day2,day1))] objects[,c("day1","day2"):=null] #set keys right coupling/counting setkey(objects,location,daystart,dayend) setkey(events,location,dayevent) #count number of events system.time( objects[,numberevents:=events[location,][dayevent >= daystart & dayevent <= dayend,.n],by=list(daystart,dayend,location)] ) } rand.day <- function(day.start,day.end,size) { dayseq <- seq.date(as.date(day.start),as.date(day.end),by="day") dayselect <- sample(dayseq,size,replace=true) return(dayselect) }
for 100 objects , 100 events code runs on laptop within 0.3 seconds
> coupleeventobject() user system elapsed 0.30 0.00 0.29
but if increasy number of objects scales linearly process time.
> coupleeventobject(sizee=200,sizeo=6000) user system elapsed 15.11 0.00 15.26
so count number of events 6 million objects costs 4 hours, , have several times (different kind of location levels and). there way speed up? , ideas!
here option. main idea join events , objects in such way every combination exist first, , count valid values. see, new way runs far faster , same results. btw... might have change random date generator because didn't have access original function.
coupleeventobject1 <- function(sizeo=100,sizee=100){ require(data.table) require(zoo) #create events events <- data.table(eventnumber = c(1:sizee), location = as.character(sample(c(1:floor(sizeo/10)),size=sizee,replace=t)), dayevent = as.date(as.integer(runif(sizee)*1000))) #create objects objects <- data.table(objectnumber = c(1:sizeo), location = as.character(sample(c(1:floor(sizeo/10)),size=sizeo,replace=t)), day1 = as.date(as.integer(runif(sizee)*1000)), day2 = as.date(as.integer(runif(sizee)*1000))) objects[, daystart := as.date(ifelse (day1>day2,day2,day1))] objects[, dayend := as.date(ifelse (day1<day2,day2,day1))] objects[,c("day1","day2"):=null] #set keys right coupling/counting setkey(objects,location,daystart,dayend) setkey(events,location,dayevent) #count number of events cat("first method:") cat(system.time( res1 <- objects[,numberevents:=events[location,][dayevent >= daystart & dayevent <= dayend,.n],by=list(daystart,dayend,location)] )) cat("\n") ## second method #set keys right coupling/counting setkey(objects,location) setkey(events,location) #count number of events cat("second method:") cat(system.time({ oe <- objects[events,allow.cartesian=t] res2 <- oe[,sum(dayevent >= daystart & dayevent <= dayend),by=list(objectnumber,daystart,dayend,location)] })) cat("\n") # comparing setkey(res1, objectnumber, location, daystart, dayend) setkey(res2, objectnumber, location, daystart, dayend) cat("compare values: ", nrow(res1[res2][numberevents != v1,])," mismatches\n") return(list(res1=res1,res2=res2)) }
and here results:
xx <- coupleeventobject1(200,6000) first method:8.151 0.041 8.15 0 0 second method:0.614 0.017 0.625 0 0 compare values: 0 mismatches
Comments
Post a Comment