mapreduce - Hadoop Reduce side what is really occurred? -

August 15, 2014

i'm reading "hadoop: definitive guide" , have questions.

in chapter 7 "how mapreduce works" on page 201, author said in reduce side

when [a] in-memory buffer reaches threshold size (controlled mapreduce.reduce.shuffle.merge.percent) or [b] reaches threshold number of map outputs (mapreduce.reduce.merge.inmem.threshold), merged , spilled disk.

my questions (4 questions) conditions a , b.

in condition a default values config hadoop 2 that

will 0.462(0.66 * 0.70) of 1gb memory reducer full (462mb), merge , spill begin?

in condition b default values config that

when 1000 pairs of key-value map output 1 mapper have been collected in buffer, merge , spill begin?
in above question 1 mapper correct or more 1 mapper different machine?

in following paragraph, author said that

as copies accumulate on disk, background thread merges them larger, sorted files

it correct intended purpose when spill file wants write disk before spill exist in disk merge current spill?

please me better understand happening in hadoop.

thanks in advanced

please see image
each mapper run in different machine
can explain precise meaning specified

will 0.462(0.66 * 0.70) of 1gb memory reducer full (462mb), merge , spill begin?
yes. trigger merge , spill disk.
default each reducer 1gb of heap space, of 70% mapreduce.reduce.shuffle.input.buffer.percent used merge , shuffle buffer, 30% might needed reducer code, , other task specific requirements. when 66% mapreduce.reduce.shuffle.merge.percent of merge , shuffle buffer full merging (and spill) triggred, make sure space if left process sorting , during sort other incoming files not waiting memory space.
refer http://www.bigsynapse.com/mapreduce-internals

when 1000 pairs of key-value map output 1 mapper have been collected in buffer, merge , spill begin?
no. here map outputs may refer partitions of map output file. (i think). 1000 key value pair might small amount of memory.
in slide 24 of presentation cloudera same refered segments

mapreduce.reduce.merge.inmem.threshold segments accumulated (default 1000)

in above question 1 mapper correct or more 1 mapper different machine?
understanding, 1000 partitions can come different partitions. partitions of map output files bieng copied parallely different nodes.

it correct intended purpose when spill file wants write disk before spill exist in disk merge current spill?
no, (mapper output) partition file in memory merged other (mapper output) partition files in memory create spill file , written disk. might repeated several times buffer fills up. many files have been written disk. increase efficiency these files in disk merged larger file background process.

source:
hadoop definitive guide. 4th edition.
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://www.bigsynapse.com/mapreduce-internals
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html

Search This Blog

Two

mapreduce - Hadoop Reduce side what is really occurred? -

Comments

Post a Comment

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -