scala - Spark: save a output HashSet to a file -

April 15, 2014

i have following code:

val myset: hashset[string] = hashset[string]() val mysetbroadcastvar = sc.broadcast(myset)  val output = input.map { t =>    if (t.geta()!= null) {     stsetbroadcastvar.value.add(t.geta())   } }.count()  sc.parallelize(mybroadcastvar.value.tolist, 1).saveastextfile("mysetvalues")

then file mysetvalues empty though shouldn't be. because mysetvalues saved before output computed? how fix problem? thanks!

broadcast variables share read-only data across tasks , stagesin efficient manner
tasks not supposed modify broadcast variables updates aren't reflected either in other nodes , aren't transported driver.
you need accumulators purpose.

example (from spark-shell)

scala> val acc = sc.accumulablecollection(scala.collection.mutable.hashset[string]()) acc: org.apache.spark.accumulable[scala.collection.mutable.hashset[string],string] = set()  scala> val names=sc.parallelize(seq("aravind","sam","kenny","apple")) names: org.apache.spark.rdd.rdd[string] = parallelcollectionrdd[86] @ parallelize @ <console>:22  scala> names.foreach( x => if(x.startswith("a")) acc += x )  scala> acc res27: org.apache.spark.accumulable[scala.collection.mutable.hashset[string],string] = set(apple, aravind)  scala>

Search This Blog

Two

scala - Spark: save a output HashSet to a file -

Comments

Post a Comment

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -