HADOOP CAPACITY PLANNING
Daily Data:-
Historical Data which will be present always 400TB say it (A)
XML data 100GB say it (B)
Data from other sources 50GB say it (C)
Replication Factor (Let us assume 3) 3 say it (D)
Space for intermediate MR output (30% Non HDFS) = 30% of (B+C) say it (E)
Space for other OS and other admin activities (30% Non HDFS) = 30% of (B+C) say it (F)
Historical Data which will be present always 400TB say it (A)
XML data 100GB say it (B)
Data from other sources 50GB say it (C)
Replication Factor (Let us assume 3) 3 say it (D)
Space for intermediate MR output (30% Non HDFS) = 30% of (B+C) say it (E)
Space for other OS and other admin activities (30% Non HDFS) = 30% of (B+C) say it (F)
Daily Data = (D * (B + C)) + E+ F = 3 * (150) + 30 % of 150 + 30% of 150
Daily Data = 450 + 45 + 45 = 540GB per day is absolute minimum.
Add 5% buffer = 540 + 54 GB = 594 GB per Day
Daily Data = 450 + 45 + 45 = 540GB per day is absolute minimum.
Add 5% buffer = 540 + 54 GB = 594 GB per Day
Monthly Data = 30*594 + A = 18220 GB which nearly 18TB monthly approximately.
Yearly Data = 18 TB * 12 = 216 TB
Now we have got the approximate idea on yearly data, let us calculate other things:-
Yearly Data = 18 TB * 12 = 216 TB
Now we have got the approximate idea on yearly data, let us calculate other things:-
Number of Node:-
As a recommendation, a group of around 12 nodes, each with 2-4 disks (JBOD) of 1 to 4 TB capacity, will be a good starting point.
216 TB/12 Nodes = 18 TB per Node in a Cluster of 12 nodes
So we keep JBOD of 4 disks of 5TB each then each node in the cluster will have = 5TB*4 = 20 TB per node.
So we got 12 nodes, each node with JBOD of 20TB HDD.
As a recommendation, a group of around 12 nodes, each with 2-4 disks (JBOD) of 1 to 4 TB capacity, will be a good starting point.
216 TB/12 Nodes = 18 TB per Node in a Cluster of 12 nodes
So we keep JBOD of 4 disks of 5TB each then each node in the cluster will have = 5TB*4 = 20 TB per node.
So we got 12 nodes, each node with JBOD of 20TB HDD.
Number of Core in each node:-
A thumb rule is to use core per task. If tasks are not that much heavy then we can allocate 0.75 core per task.
Say if the machine is 12 Core then we can run at most 12 + (.25 of 12) = 15 tasks; 0.25 of 12 is added with the assumption that 0.75 per core is getting used. So we can now run 15 Tasks in parallel. We can divide these tasks as 8 Mapper and 7 Reducers on each node.
So till now, we have figured out 12 Nodes, 12 Cores with 20TB capacity each.
A thumb rule is to use core per task. If tasks are not that much heavy then we can allocate 0.75 core per task.
Say if the machine is 12 Core then we can run at most 12 + (.25 of 12) = 15 tasks; 0.25 of 12 is added with the assumption that 0.75 per core is getting used. So we can now run 15 Tasks in parallel. We can divide these tasks as 8 Mapper and 7 Reducers on each node.
So till now, we have figured out 12 Nodes, 12 Cores with 20TB capacity each.
Memory (RAM) size:-
This can be straight forward. We should reserve 1 GB per task on the node so 15 tasks means 15GB plus some memory required for OS and other related activities – which could be around 2-3GB.
So each node will have 15 GB + 3 GB = 18 GB RAM.
This can be straight forward. We should reserve 1 GB per task on the node so 15 tasks means 15GB plus some memory required for OS and other related activities – which could be around 2-3GB.
So each node will have 15 GB + 3 GB = 18 GB RAM.
Network Configuration:-
As data transfer plays the key role in the throughput of Hadoop. We should connect node at a speed of around 10 GB/sec at least.
As data transfer plays the key role in the throughput of Hadoop. We should connect node at a speed of around 10 GB/sec at least.
Comments