scala - Spark: save a output HashSet to a file -
i have following code:
val myset: hashset[string] = hashset[string]() val mysetbroadcastvar = sc.broadcast(myset) val output = input.map { t => if (t.geta()!= null) { stsetbroadcastvar.value.add(t.geta()) } }.count() sc.parallelize(mybroadcastvar.value.tolist, 1).saveastextfile("mysetvalues")
then file mysetvalues
empty though shouldn't be. because mysetvalues
saved before output computed? how fix problem? thanks!
- broadcast variables share read-only data across tasks , stagesin efficient manner
- tasks not supposed modify broadcast variables updates aren't reflected either in other nodes , aren't transported driver.
- you need accumulators purpose.
example (from spark-shell)
scala> val acc = sc.accumulablecollection(scala.collection.mutable.hashset[string]()) acc: org.apache.spark.accumulable[scala.collection.mutable.hashset[string],string] = set() scala> val names=sc.parallelize(seq("aravind","sam","kenny","apple")) names: org.apache.spark.rdd.rdd[string] = parallelcollectionrdd[86] @ parallelize @ <console>:22 scala> names.foreach( x => if(x.startswith("a")) acc += x ) scala> acc res27: org.apache.spark.accumulable[scala.collection.mutable.hashset[string],string] = set(apple, aravind) scala>
Comments
Post a Comment