Moderator: Concepts and Technologies for DS and BDP
1) Whats the purpose of the partioning phase?
Because of the partitioning there is data sent to other nodes (as far as I understood). Why aren't they reduced or processed on the same node (data locality?)?
2) What's the purpose of the sort phase?
Also, in general, partitioning of the data is needed because you do not know for certain how much time a task needs. If one part of the job is surprisingly hard, chances are high that the node will still finish in a reasonable time, because the work junks are small and while this node is occupied the other nodes can receive additional junks to work on.
I maybe found another reason: Hadoop does not know the distribution of the intermediate keys/value pairs. So a singlenode can't determine where to send its processed (=mapped) data to. It would have to query all other nodes to get the distribution of the intermediate keys. If you know something about the distribution of the intermediate keys, you can pass a more efficient partitioner, which splits the intermediate keys more uniformly (= work load is better balanced at the reducers).
Is this right?
Sort phase: I read the "Shuffle and Sort" Part of "Hadoop: The Definitve Guide" (Tom White). The "sort phase" - as far as I understood it - is because during the map phase there is data spilled on the disk (= too much data to hold it all in-memory). The data is spilled in multiple files, which get merged later on. To optimize the later merging the data gets sorted (merging two sorted files into a new sorted file is easier than merging two unsorted files into one sorted file). So the sort phase is "just" an optimization/detail and not essential for MR.
The book is quite easy to find: "oreilly hadoop the definitive guide type:pdf"
The second reason of keeping data in multiple files is because as you have already said that hadoop keep data in sorted order, so this is another reason of keeping files in multiple parts.
for example splitting a name list based on indexes will give us extra efficiency when we have to add some more data, by doing this data remains sorted and there is no need of additional sort step.
I hope you understand my point of view.