r - Subsetting a data.table by range making use of binary search -
how go subsetting data.table numeric range, intention of using binary search?
for example:
require(data.table) set.seed(1) x<-runif(10000000,min=0,max=10) y<-runif(10000000,min=0,max=10) df<-data.frame(x,y) dt<-data.table(x,y) system.time(dfsub<-df[df$x>5 & df$y<7,]) # user system elapsed # 1.529 0.250 1.821 #subset dt system.time(dtsub<-dt[x>5 & y<7]) # user system elapsed #0.716 0.119 0.841 the above doesn't use key (vector scan), , speed-up isn't dramatic. syntax subsetting numeric range of data.table, making use of binary search? can't find example in documentation; helpful if provide example using toy data.table above.
edit: question similar, still doesn't demonstrate how subset range: data.table: vector scan v binary search numeric columns - super-slow setkey
interesting question. first let's @ example data :
> print(dt) x y 1: 2.607703e-07 5.748127 2: 8.894131e-07 5.233994 3: 1.098961e-06 9.834267 4: 1.548324e-06 2.016585 5: 1.569279e-06 7.957730 --- 9999996: 9.999996e+00 9.977782 9999997: 9.999998e+00 2.666575 9999998: 9.999999e+00 6.869967 9999999: 9.999999e+00 1.953145 10000000: 1.000000e+01 4.001616 > length(dt$x) [1] 10000000 > length(unique(dt$x)) [1] 9988478 > length(dt$y) [1] 10000000 > length(unique(dt$y)) [1] 9988225 > dt[,.n,by=x][,table(n)] n 1 2 3 9976965 11504 9 > dt[,.n,by="x,y"][,table(n)] n 1 10000000 > so there 10 million unique floating point values in first column: few groups of size 2 , 3 rows 1 row groups. once second column including, there 10 million unique groups of size 1 row. quite tough problem, since data.table designed more grouped data in mind; e.g, (id, date), (id1, id2, date, time) etc.
however, data.table , setkey support floating point data in keys, let's give go.
on slow netbook :
> system.time(setkey(dt,x,y)) user system elapsed 7.097 0.520 7.650 > system.time(dt[x>5 & y<7]) user system elapsed 2.820 0.292 3.122 so vector scanning approach faster setting key (and haven't used key yet). given data floating point , unique isn't surprising, think that's pretty fast time setkey sort 10 million thoroughly random , unique doubles.
compare base example, sorting x not y :
> system.time(base::order(x)) user system elapsed 72.445 0.292 73.072 assuming data representative of real data, , don't want once several times, willing pay price of setkey, first step pretty clear :
system.time(w <- dt[.(5),which=true,roll=true]) user system elapsed 0.004 0.000 0.003 > w [1] 4999902 but here we're stuck. next step dt[(w+1):nrow(dt)] ugly , copies. can't think of decent way use key here y<7 part well. in other example data dt[.(unique(x), 7), which=true, roll=true] in case data unique , floating point that's going slow.
ideally, task needs range joins (fr#203) implementing. syntax in example might :
dt[.( c(5,inf), c(-inf,7) )] or make easier, dt[x>5 & y<7] optimized under hood. allowing two-column range in joins corresponding x columns quite useful , has come several times.
the speedups in v1.9.2 needed done first before move on things that. if try setkey on data in v1.8.10 you'll find v1.9.2 faster.
see :
Comments
Post a Comment