Moderator: Concepts and Technologies for DS and BDP
In GFS, Why data files are divided into 64MB blocks and 3
copies of each are stored on different machines? Why do the 3 copies recommend? If more than 3 copies, would it be any problem? Can anyone explain it please?
Dank im Voraus
For the block size, you have to decide between the overhead for small files (even a 100 kB file will require a 64 MB chunk) and performance for processing of huge amounts of data. Keep in mind that every client working on the data needs to store information about all the chunks involved, which will lead to a huge management overhead if the size is set too small. You might want to check out the GFS paper; the authors explain their decisions on the chunk size in section 2.5.
And for the amount of replicas, you decide between reliability and the impact on performance an storage.
As an example: If your application requires you to have a reliability of 99.999% that your data does not get lost, and the reliability of each node is 99.0% (giving a failure probability of 1%), you can check how many replicas you will need by calculating the probability of simultaneous failure of all replicas, which would be equivalent to data loss – there would be no replica to restore from.
1-0.005¹ = 0.99 (99.9%)
1-0.005² = 0.9999 (99.99%) ← 2 replicas are not enough (99.99% < 99.999%)
1-0.005³ = 0.999999 (99.9999%) ← 3 replicas will suffice (99.9999% > 99.999%)
1-0.005⁴ = ...
Of course, you can add more replicas, but you would overachieve the required value of reliability and that will have a negative impact on the performance (overhead to manage the replicas, additional storage will make it more expensive, ...).
dank dir... es war mega geile Beschreibung..