scala - Spark: save a output HashSet to a file -


i have following code:

val myset: hashset[string] = hashset[string]() val mysetbroadcastvar = sc.broadcast(myset)  val output = input.map { t =>    if (t.geta()!= null) {     stsetbroadcastvar.value.add(t.geta())   } }.count()  sc.parallelize(mybroadcastvar.value.tolist, 1).saveastextfile("mysetvalues") 

then file mysetvalues empty though shouldn't be. because mysetvalues saved before output computed? how fix problem? thanks!

  1. broadcast variables share read-only data across tasks , stagesin efficient manner
  2. tasks not supposed modify broadcast variables updates aren't reflected either in other nodes , aren't transported driver.
  3. you need accumulators purpose.

example (from spark-shell)

scala> val acc = sc.accumulablecollection(scala.collection.mutable.hashset[string]()) acc: org.apache.spark.accumulable[scala.collection.mutable.hashset[string],string] = set()  scala> val names=sc.parallelize(seq("aravind","sam","kenny","apple")) names: org.apache.spark.rdd.rdd[string] = parallelcollectionrdd[86] @ parallelize @ <console>:22  scala> names.foreach( x => if(x.startswith("a")) acc += x )  scala> acc res27: org.apache.spark.accumulable[scala.collection.mutable.hashset[string],string] = set(apple, aravind)  scala> 

Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -