recommended number of map task Size and copies?

Moderator: Concepts and Technologies for DS and BDP

riyad786
Erstie
Erstie
Beiträge: 16
Registriert: 31. Mär 2016 21:36

recommended number of map task Size and copies?

Beitrag von riyad786 »

Hallo Zusammen,

In GFS, Why data files are divided into 64MB blocks and 3
copies of each are stored on different machines? Why do the 3 copies recommend? If more than 3 copies, would it be any problem? Can anyone explain it please?

Dank im Voraus
riyad.

frankh
Neuling
Neuling
Beiträge: 2
Registriert: 27. Okt 2015 17:18
Wohnort: Darmstadt

Re: recommended number of map task Size and copies?

Beitrag von frankh »

As far as I understood, both values (64 MB and 3 replicas) are a kind of default configuration, that you might alter depending on the requirements of your application, but that work well for most cases. However, both settings give you a kind of trade-off that you need to decide upon:

For the block size, you have to decide between the overhead for small files (even a 100 kB file will require a 64 MB chunk) and performance for processing of huge amounts of data. Keep in mind that every client working on the data needs to store information about all the chunks involved, which will lead to a huge management overhead if the size is set too small. You might want to check out the GFS paper; the authors explain their decisions on the chunk size in section 2.5.

And for the amount of replicas, you decide between reliability and the impact on performance an storage.
As an example: If your application requires you to have a reliability of 99.999% that your data does not get lost, and the reliability of each node is 99.0% (giving a failure probability of 1%), you can check how many replicas you will need by calculating the probability of simultaneous failure of all replicas, which would be equivalent to data loss – there would be no replica to restore from.
1-0.005¹ = 0.99 (99.9%)
1-0.005² = 0.9999 (99.99%) ← 2 replicas are not enough (99.99% < 99.999%)
1-0.005³ = 0.999999 (99.9999%) ← 3 replicas will suffice (99.9999% > 99.999%)
1-0.005⁴ = ...
Of course, you can add more replicas, but you would overachieve the required value of reliability and that will have a negative impact on the performance (overhead to manage the replicas, additional storage will make it more expensive, ...).

riyad786
Erstie
Erstie
Beiträge: 16
Registriert: 31. Mär 2016 21:36

Re: recommended number of map task Size and copies?

Beitrag von riyad786 »

Hi Frankh,

dank dir... es war mega geile Beschreibung.. :D

schönes Wochenende!

Riyad

Antworten

Zurück zu „Archiv“