MapReduce: Partitioner/Sort

Moderator: Concepts and Technologies for DS and BDP

paprikawuerzung
Mausschubser
Mausschubser
Beiträge: 80
Registriert: 23. Mär 2014 23:33

MapReduce: Partitioner/Sort

Beitrag von paprikawuerzung »

Für was wird die Partitioner und Sort Phase in MR gebraucht? Zur Optimierung? Es werden ja aufgrund der Partionierung Daten zu anderen Nodes geschickt, wenn ichs richtig verstanden habe. Wieso werden die Daten vom Mapper nicht direkt auf dem gleichen Node reduced (Daten-Lokalität), sondern noch mal durchs Netzwerk geschickt? Für was ist die Sort Phase gut? Optimierungs-Heuristik?

paprikawuerzung
Mausschubser
Mausschubser
Beiträge: 80
Registriert: 23. Mär 2014 23:33

Re: MapReduce: Partitioner/Sort

Beitrag von paprikawuerzung »

Ah, sorry. I forgot that the course language is english.

1) Whats the purpose of the partioning phase?
Because of the partitioning there is data sent to other nodes (as far as I understood). Why aren't they reduced or processed on the same node (data locality?)?

2) What's the purpose of the sort phase?

robertH
Mausschubser
Mausschubser
Beiträge: 58
Registriert: 29. Apr 2013 13:11

Re: MapReduce: Partitioner/Sort

Beitrag von robertH »

As far as I have understood it, data locality is taken into account when partitioning the data / the work. Sending to other nodes is needed whenever too much data is stored at only a few number of nodes. Then it is faster to send it (slowly) to other nodes and having a more uniform workload instead of having many nodes which are waiting for a few nodes to have finished their work.

Also, in general, partitioning of the data is needed because you do not know for certain how much time a task needs. If one part of the job is surprisingly hard, chances are high that the node will still finish in a reasonable time, because the work junks are small and while this node is occupied the other nodes can receive additional junks to work on.

paprikawuerzung
Mausschubser
Mausschubser
Beiträge: 80
Registriert: 23. Mär 2014 23:33

Re: MapReduce: Partitioner/Sort

Beitrag von paprikawuerzung »

Thanks for the answer!

I maybe found another reason: Hadoop does not know the distribution of the intermediate keys/value pairs. So a singlenode can't determine where to send its processed (=mapped) data to. It would have to query all other nodes to get the distribution of the intermediate keys. If you know something about the distribution of the intermediate keys, you can pass a more efficient partitioner, which splits the intermediate keys more uniformly (= work load is better balanced at the reducers).
Is this right?

Sort phase: I read the "Shuffle and Sort" Part of "Hadoop: The Definitve Guide" (Tom White). The "sort phase" - as far as I understood it - is because during the map phase there is data spilled on the disk (= too much data to hold it all in-memory). The data is spilled in multiple files, which get merged later on. To optimize the later merging the data gets sorted (merging two sorted files into a new sorted file is easier than merging two unsorted files into one sorted file). So the sort phase is "just" an optimization/detail and not essential for MR.

The book is quite easy to find: "oreilly hadoop the definitive guide type:pdf"

Benutzeravatar
AizazZaidee
BASIC-Programmierer
BASIC-Programmierer
Beiträge: 106
Registriert: 20. Apr 2016 22:49

Re: MapReduce: Partitioner/Sort

Beitrag von AizazZaidee »

Yes you are correct in this I guess if you know some about your keys then you can perform it earlier (in your mapping step) as well and pass it to respected practitioner both are correct.
The second reason of keeping data in multiple files is because as you have already said that hadoop keep data in sorted order, so this is another reason of keeping files in multiple parts.
for example splitting a name list based on indexes will give us extra efficiency when we have to add some more data, by doing this data remains sorted and there is no need of additional sort step.

I hope you understand my point of view.
Thanks,
AZ

Antworten

Zurück zu „Archiv“