Discussion:
How to estimate hardware needed for a Hadoop CLuster
Amine Tengilimoglu
2018-10-21 08:25:51 UTC
Permalink
Hi all;

I want to learn how can i estimate the hardware nedeed for hadoop
cluster. is there any standart or other things?

for example I have 10TB data, and i will analiyze it... My replication
factor will be 2.

How much ram do i need for one node? how can I estimate it?
How much disk do i need for one node ? how can I estimate it?
How many core - CPU do i need for one node?


thanks in advance..
r***@post.bgu.ac.il
2018-10-21 09:17:39 UTC
Permalink
That's a tricky question, and it depends mostly on how do you plan to use Hadoop, more specifically what is your use case (for example, word count).
The answer should be divided for storage (disk) and computation limits (disk, CPU, and memory).
1. Disk- if you are using the default File System and with default block size then you would need 20 TB space to store the input and there will be 2*(10000000/128) = 156,250 blocks.
Afterward, it depends on your output size for the Map function (which will be deleted at the end of shuffle) and the Reduce function (which probably won't be the bottleneck).
If you believe that your Map output won't be larger than the input (the number of and size of the tuples), then I think around 40 TB would be enough.
2. Memory- if you want that the whole computation would be concurrent as possible then it depends on the amount of memory you specify for the containers (AM, Mappers, and Reducers) in the cluster configuration (yarn-site.xml), the number of containers and the use case demands (maybe each mapper should have at least 2056 MB). Otherwise, some of the containers would have to wait for space (formerly, the task assignment depends only on the Memory)
3. Cpu- the same as Memory but it could be irrelevant if it won't affect your container computation and assignment.
*When you do configure your cluster please pay attention also to the heap size.

Good luck
Post by Amine Tengilimoglu
Hi all;
I want to learn how can i estimate the hardware nedeed for hadoop
cluster. is there any standart or other things?
for example I have 10TB data, and i will analiyze it... My replication
factor will be 2.
How much ram do i need for one node? how can I estimate it?
How much disk do i need for one node ? how can I estimate it?
How many core - CPU do i need for one node?
thanks in advance..
---------------------------------------------------------------------
To unsubscribe, e-mail: user-***@hadoop.apache.org
For additional commands, e-mail: user-***@hadoop.apache.org

Loading...