Seite 1 von 1

MapReduce: Partitioner/Sort

Verfasst: 25. Jun 2016 11:48
von paprikawuerzung
Für was wird die Partitioner und Sort Phase in MR gebraucht? Zur Optimierung? Es werden ja aufgrund der Partionierung Daten zu anderen Nodes geschickt, wenn ichs richtig verstanden habe. Wieso werden die Daten vom Mapper nicht direkt auf dem gleichen Node reduced (Daten-Lokalität), sondern noch mal durchs Netzwerk geschickt? Für was ist die Sort Phase gut? Optimierungs-Heuristik?

Re: MapReduce: Partitioner/Sort

Verfasst: 28. Jun 2016 23:28
von paprikawuerzung
Ah, sorry. I forgot that the course language is english.

1) Whats the purpose of the partioning phase?
Because of the partitioning there is data sent to other nodes (as far as I understood). Why aren't they reduced or processed on the same node (data locality?)?

2) What's the purpose of the sort phase?

Re: MapReduce: Partitioner/Sort

Verfasst: 29. Jun 2016 21:49
von robertH
As far as I have understood it, data locality is taken into account when partitioning the data / the work. Sending to other nodes is needed whenever too much data is stored at only a few number of nodes. Then it is faster to send it (slowly) to other nodes and having a more uniform workload instead of having many nodes which are waiting for a few nodes to have finished their work.

Also, in general, partitioning of the data is needed because you do not know for certain how much time a task needs. If one part of the job is surprisingly hard, chances are high that the node will still finish in a reasonable time, because the work junks are small and while this node is occupied the other nodes can receive additional junks to work on.

Re: MapReduce: Partitioner/Sort

Verfasst: 30. Jun 2016 01:34
von paprikawuerzung
Thanks for the answer!

I maybe found another reason: Hadoop does not know the distribution of the intermediate keys/value pairs. So a singlenode can't determine where to send its processed (=mapped) data to. It would have to query all other nodes to get the distribution of the intermediate keys. If you know something about the distribution of the intermediate keys, you can pass a more efficient partitioner, which splits the intermediate keys more uniformly (= work load is better balanced at the reducers).
Is this right?

Sort phase: I read the "Shuffle and Sort" Part of "Hadoop: The Definitve Guide" (Tom White). The "sort phase" - as far as I understood it - is because during the map phase there is data spilled on the disk (= too much data to hold it all in-memory). The data is spilled in multiple files, which get merged later on. To optimize the later merging the data gets sorted (merging two sorted files into a new sorted file is easier than merging two unsorted files into one sorted file). So the sort phase is "just" an optimization/detail and not essential for MR.

The book is quite easy to find: "oreilly hadoop the definitive guide type:pdf"

Re: MapReduce: Partitioner/Sort

Verfasst: 1. Jul 2016 12:00
von AizazZaidee
Yes you are correct in this I guess if you know some about your keys then you can perform it earlier (in your mapping step) as well and pass it to respected practitioner both are correct.
The second reason of keeping data in multiple files is because as you have already said that hadoop keep data in sorted order, so this is another reason of keeping files in multiple parts.
for example splitting a name list based on indexes will give us extra efficiency when we have to add some more data, by doing this data remains sorted and there is no need of additional sort step.

I hope you understand my point of view.