Moderator: Concepts and Technologies for DS and BDP
I'm a little bit confused when the reduce task is starting relatively to the map task.
Is it correct that the reduce calculation starts only if every map task has finished, but the reduce workers are already reading the results of the map tasks while the map tasks are still in process?
Please take a look at page 15 of slide 4 where a barrier aggregating intermediate values exists between map tasks and reduce tasks
"When a map task is executed first by worker A and then later executed by worker B (because A failed), all workers executing reduce tasks are notified of the reexecution. Any reduce task that has not already read the data from worker A will read the data from worker B."
However, "Any reduce task that has not already read the data from worker A will read the data from worker B" means that there could be reduce tasks that have already read the data from worker A and this would imply that there are reading reduce tasks while mapping is not finished.
In the paper section 3.1 step 4 it says the reducer has started reading when the output of writers are written on the local disk but until step 5 when the reducers have finished reading and sorting all intermediate (key, value) pairs, reducers do not compute. The barrier stands between reducers reading/sorting operations and reduce computation. And shuffles are not handled by reducers so they do not need to wait for all shuffles done.
Thanks a lot for pointing this subtle detail out