mapreduce - Hadoop Reduce side what is really occurred? -


i'm reading "hadoop: definitive guide" , have questions.

in chapter 7 "how mapreduce works" on page 201, author said in reduce side

when [a] in-memory buffer reaches threshold size (controlled mapreduce.reduce.shuffle.merge.percent) or [b] reaches threshold number of map outputs (mapreduce.reduce.merge.inmem.threshold), merged , spilled disk.

my questions (4 questions) conditions a , b.

in condition a default values config hadoop 2 that

  1. will 0.462(0.66 * 0.70) of 1gb memory reducer full (462mb), merge , spill begin?

in condition b default values config that

  1. when 1000 pairs of key-value map output 1 mapper have been collected in buffer, merge , spill begin?

  2. in above question 1 mapper correct or more 1 mapper different machine?

in following paragraph, author said that

as copies accumulate on disk, background thread merges them larger, sorted files

  1. it correct intended purpose when spill file wants write disk before spill exist in disk merge current spill?

please me better understand happening in hadoop.

thanks in advanced


please see image
each mapper run in different machine
can explain precise meaning specified enter image description here

  1. will 0.462(0.66 * 0.70) of 1gb memory reducer full (462mb), merge , spill begin?
    yes. trigger merge , spill disk.
    default each reducer 1gb of heap space, of 70% mapreduce.reduce.shuffle.input.buffer.percent used merge , shuffle buffer, 30% might needed reducer code, , other task specific requirements. when 66% mapreduce.reduce.shuffle.merge.percent of merge , shuffle buffer full merging (and spill) triggred, make sure space if left process sorting , during sort other incoming files not waiting memory space.
    refer http://www.bigsynapse.com/mapreduce-internals


  1. when 1000 pairs of key-value map output 1 mapper have been collected in buffer, merge , spill begin?
    no. here map outputs may refer partitions of map output file. (i think). 1000 key value pair might small amount of memory.
    in slide 24 of presentation cloudera same refered segments

    mapreduce.reduce.merge.inmem.threshold segments accumulated (default 1000)


  1. in above question 1 mapper correct or more 1 mapper different machine?
    understanding, 1000 partitions can come different partitions. partitions of map output files bieng copied parallely different nodes.


  1. it correct intended purpose when spill file wants write disk before spill exist in disk merge current spill?
    no, (mapper output) partition file in memory merged other (mapper output) partition files in memory create spill file , written disk. might repeated several times buffer fills up. many files have been written disk. increase efficiency these files in disk merged larger file background process.

source:
hadoop definitive guide. 4th edition.
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://www.bigsynapse.com/mapreduce-internals
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html


Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -