r - How to drop groups when there are not enough observations? -
this question has answer here:
how drop groups when there not enough observations? in following reproducible example, each person (identified name
) has 10 observations:
install.packages('randomnames') # install package if required install.packages('data.table') # install package if required lapply(c('data.table', 'randomnames'), require, character.only = true) # load packages set.seed(1) testdt <- data.table( date = rep(seq(as.date("2010/1/1"), as.date("2019/1/1"), "years"),10), name = rep(randomnames(10, which.names='first'), times=1, each=10), y = runif(100, 5, 15), x = rnorm(100, 2, 9), testdt <- testdt[ x > 0]
now want keep persons @ least 6 observations, gracelline, anna, aesha , michael must removed, because have 3, 2, 4 , 5 observations respectively.
testdt[, length(x), by=name] name v1 1: blake 6 2: alexander 6 3: leigha 8 4: gracelline 3 5: epifanio 7 6: keasha 6 7: robyn 6 8: anna 2 9: aesha 4 10: michael 5
how do in automatic way (real dataset larger)?
edit:
yes it's duplicate. :( last proposed method fastest one.
> system.time(testdt[, .sd[.n>=6], = name]) user system elapsed 0.293 0.227 0.517 > system.time(testdt[testdt[, .i[.n>=6], = name]$v1]) user system elapsed 0.163 0.243 0.415 > system.time(testdt[,if(.n>=6) .sd , = name]) user system elapsed 0.073 0.323 0.399
we group 'name', nrow (.n
), , if
greater 6, subset data.table (.sd
).
testdt[,if(.n>=6) .sd , = name] # name date y x # 1: blake 2010-01-01 9.820801 3.69913070 # 2: blake 2012-01-01 9.935413 15.18999375 # 3: blake 2013-01-01 6.862176 3.37928004 # 4: blake 2014-01-01 13.273733 21.55350503 # 5: blake 2015-01-01 11.684667 6.27958576 # 6: blake 2017-01-01 6.079436 7.49653718 # 7: alexander 2010-01-01 13.209463 4.62301612 # 8: alexander 2012-01-01 12.829328 2.00994816 # 9: alexander 2013-01-01 10.530363 2.66907192 #10: alexander 2016-01-01 5.233312 0.78339246 #11: alexander 2017-01-01 9.772301 12.60278297 #12: alexander 2019-01-01 11.927316 7.34551569 #13: leigha 2010-01-01 9.776196 4.99655334 #14: leigha 2011-01-01 13.612095 11.56789854 #15: leigha 2013-01-01 7.447973 5.33016929 #16: leigha 2014-01-01 5.706790 4.40388912 #17: leigha 2016-01-01 8.162717 12.87081025 #18: leigha 2017-01-01 10.186343 12.44362354 #19: leigha 2018-01-01 11.620051 8.30192285 #20: leigha 2019-01-01 9.068302 16.28150109 #21: epifanio 2010-01-01 8.390729 17.90558542 #22: epifanio 2011-01-01 13.394404 8.45036728 #23: epifanio 2012-01-01 8.466835 10.19156807 #24: epifanio 2013-01-01 8.337749 5.45766822 #25: epifanio 2014-01-01 9.763512 17.13958472 #26: epifanio 2017-01-01 8.899895 14.89054015 #27: epifanio 2019-01-01 14.606180 0.13357331 #28: keasha 2013-01-01 8.253522 6.44769498 #29: keasha 2014-01-01 12.570871 0.40402566 #30: keasha 2016-01-01 12.111212 14.08734943 #31: keasha 2017-01-01 6.216919 0.06878532 #32: keasha 2018-01-01 7.454885 0.38399123 #33: keasha 2019-01-01 6.433044 1.09828333 #34: robyn 2010-01-01 7.396294 8.41399676 #35: robyn 2011-01-01 5.589344 1.33792036 #36: robyn 2012-01-01 11.422883 1.66129246 #37: robyn 2015-01-01 12.973088 2.54144396 #38: robyn 2017-01-01 9.100841 6.78346573 #39: robyn 2019-01-01 11.049333 4.75902075
or instead of if
, can directly use .n>1
, wrap `.sd
testdt[, .sd[.n>=6], = name]
it little slow, option .i
row index , subset
testdt[testdt[, .i[.n>=6], = name]$v1]
Comments
Post a Comment