mapreduce - Hadoop Reduce side what is really occurred? -
i'm reading "hadoop: definitive guide" , have questions.
in chapter 7 "how mapreduce works" on page 201, author said in reduce side
when [a] in-memory buffer reaches threshold size (controlled
mapreduce.reduce.shuffle.merge.percent
) or [b] reaches threshold number of map outputs (mapreduce.reduce.merge.inmem.threshold
), merged , spilled disk.
my questions (4 questions) conditions a , b.
in condition a default values config hadoop 2 that
- will 0.462(0.66 * 0.70) of 1gb memory reducer full (462mb), merge , spill begin?
in condition b default values config that
when 1000 pairs of key-value map output 1 mapper have been collected in buffer, merge , spill begin?
in above question 1 mapper correct or more 1 mapper different machine?
in following paragraph, author said that
as copies accumulate on disk, background thread merges them larger, sorted files
- it correct intended purpose when spill file wants write disk before spill exist in disk merge current spill?
please me better understand happening in hadoop.
thanks in advanced
please see image
each mapper run in different machine
can explain precise meaning specified
- will 0.462(0.66 * 0.70) of 1gb memory reducer full (462mb), merge , spill begin?
yes. trigger merge , spill disk.
default each reducer 1gb of heap space, of 70%mapreduce.reduce.shuffle.input.buffer.percent
used merge , shuffle buffer, 30% might needed reducer code, , other task specific requirements. when 66%mapreduce.reduce.shuffle.merge.percent
of merge , shuffle buffer full merging (and spill) triggred, make sure space if left process sorting , during sort other incoming files not waiting memory space.
refer http://www.bigsynapse.com/mapreduce-internals
- when 1000 pairs of key-value map output 1 mapper have been collected in buffer, merge , spill begin?
no. here map outputs may refer partitions of map output file. (i think). 1000 key value pair might small amount of memory.
in slide 24 of presentation cloudera same refered segmentsmapreduce.reduce.merge.inmem.threshold
segments accumulated (default 1000)
- in above question 1 mapper correct or more 1 mapper different machine?
understanding, 1000 partitions can come different partitions. partitions of map output files bieng copied parallely different nodes.
- it correct intended purpose when spill file wants write disk before spill exist in disk merge current spill?
no, (mapper output) partition file in memory merged other (mapper output) partition files in memory create spill file , written disk. might repeated several times buffer fills up. many files have been written disk. increase efficiency these files in disk merged larger file background process.
source:
hadoop definitive guide. 4th edition.
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://www.bigsynapse.com/mapreduce-internals
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html
Comments
Post a Comment