HADOOP CAPACITY PLANNING
Daily Data:- Historical Data which will be present always 400TB say it (A) XML data 100GB say it (B) Data from other sources 50GB say it (C) Replication Factor (Let us assume 3) 3 say it (D) Space for intermediate MR output (30% Non HDFS) = 30% of (B+C) say it (E) Space for other OS and other admin activities (30% Non HDFS) = 30% of (B+C) say it (F) Daily Data = (D * (B + C)) + E+ F = 3 * (150) + 30 % of 150 + 30% of 150 Daily Data = 450 + 45 + 45 = 540GB per day is absolute minimum. Add 5% buffer = 540 + 54 GB = 594 GB per Day Monthly Data = 30*594 + A = 18220 GB which nearly 18TB monthly approximately. Yearly Data = 18 TB * 12 = 216 TB Now we have got the approximate idea on yearly data, let us calculate other things:- Number of Node:- As a recommendation, a group of around 12 nodes, each with 2-4 disks (JBOD) of 1 to 4 TB capacity, will be a good starting point. 216 TB/12 Nodes = 18 TB per Node in a Cluster of 12 nodes So we keep JBOD of 4 disks of 5TB each then each node in...