r - Subsetting a data.table by range making use of binary search -

July 15, 2010

how go subsetting data.table numeric range, intention of using binary search?

for example:

require(data.table) set.seed(1)  x<-runif(10000000,min=0,max=10) y<-runif(10000000,min=0,max=10)  df<-data.frame(x,y) dt<-data.table(x,y)  system.time(dfsub<-df[df$x>5 & df$y<7,]) # user  system elapsed  # 1.529   0.250   1.821   #subset dt system.time(dtsub<-dt[x>5 & y<7]) # user  system elapsed  #0.716   0.119   0.841

the above doesn't use key (vector scan), , speed-up isn't dramatic. syntax subsetting numeric range of data.table, making use of binary search? can't find example in documentation; helpful if provide example using toy data.table above.

edit: question similar, still doesn't demonstrate how subset range: data.table: vector scan v binary search numeric columns - super-slow setkey

interesting question. first let's @ example data :

> print(dt)                      x        y        1: 2.607703e-07 5.748127        2: 8.894131e-07 5.233994        3: 1.098961e-06 9.834267        4: 1.548324e-06 2.016585        5: 1.569279e-06 7.957730       ---                        9999996: 9.999996e+00 9.977782  9999997: 9.999998e+00 2.666575  9999998: 9.999999e+00 6.869967  9999999: 9.999999e+00 1.953145 10000000: 1.000000e+01 4.001616 > length(dt$x) [1] 10000000 > length(unique(dt$x)) [1] 9988478 > length(dt$y) [1] 10000000 > length(unique(dt$y)) [1] 9988225 > dt[,.n,by=x][,table(n)] n       1       2       3  9976965   11504       9  > dt[,.n,by="x,y"][,table(n)] n        1  10000000  >

so there 10 million unique floating point values in first column: few groups of size 2 , 3 rows 1 row groups. once second column including, there 10 million unique groups of size 1 row. quite tough problem, since data.table designed more grouped data in mind; e.g, (id, date), (id1, id2, date, time) etc.

however, data.table , setkey support floating point data in keys, let's give go.

on slow netbook :

> system.time(setkey(dt,x,y))    user  system elapsed    7.097   0.520   7.650   > system.time(dt[x>5 & y<7])    user  system elapsed    2.820   0.292   3.122

so vector scanning approach faster setting key (and haven't used key yet). given data floating point , unique isn't surprising, think that's pretty fast time setkey sort 10 million thoroughly random , unique doubles.

compare base example, sorting x not y :

> system.time(base::order(x))    user  system elapsed   72.445   0.292  73.072

assuming data representative of real data, , don't want once several times, willing pay price of setkey, first step pretty clear :

system.time(w <- dt[.(5),which=true,roll=true])    user  system elapsed    0.004   0.000   0.003  > w [1] 4999902

but here we're stuck. next step dt[(w+1):nrow(dt)] ugly , copies. can't think of decent way use key here y<7 part well. in other example data dt[.(unique(x), 7), which=true, roll=true] in case data unique , floating point that's going slow.

ideally, task needs range joins (fr#203) implementing. syntax in example might :

dt[.( c(5,inf), c(-inf,7) )]

or make easier, dt[x>5 & y<7] optimized under hood. allowing two-column range in joins corresponding x columns quite useful , has come several times.

the speedups in v1.9.2 needed done first before move on things that. if try setkey on data in v1.8.10 you'll find v1.9.2 faster.

see :

how self join data.table on condition

remove range in data.table

Search This Blog

Two

r - Subsetting a data.table by range making use of binary search -

Comments

Post a Comment

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -