Mungeol Heo: Prerequisites for Hadoop

소개
- 본 페이지는 하둡 클러스터 구축 시 필요한 정보를 설명합니다.
- 하둡을 포함한 관련 생태계 기술들이 빨리 발전하기에 영구적인 정답은 없으며 계속 검토 필요한 내용입니다.
랙
- Dual top of the rack 구조
- Network availability and sufficient network bandwidth are critical for cluster operation.To help avoid cluster failure, avoid single points of network failure. This means using dual, bonded network ports, dual top-of-the-rack switches, and dual core switches.
  Network bandwidth is the most challenging parameter to estimate because Hadoop workloads greatly vary from cluster to cluster and even within the same cluster at different times. It has been typical to see dual 1 Gb Ethernet ports on the worker nodes, but you might need more. Using 10 Gb Ethernet ports helps to ensure that your network bandwidth will be sufficient well in the the future, but is more expensive to purchase. In any case, to help ensure that your cluster receives all available network bandwidth you should dedicate the networks switches to the cluster.
  You should also consider the effect of a worker node failure. HDFS maintains three copies of all data blocks and ensures that not all copies reside on the same worker node. If a worker node fails, HDFS automatically makes additional copies of all data blocks that resided on the failed machine. This can result in significant additional network traffic as many of these data blocks will have to be copied across the network. For example, if a worker node with 10 terabytes of data fails, the cluster will produce approximately 10 terabytes of network traffic to recover.
서버
- ~~최소 스펙~~
- ~~Masters 2+~~
  - ~~Dual quad core +~~
  - ~~64+ GB RAM~~
  - ~~1+ TB drives~~
  - ~~RAID 10~~
  - ~~2x gigabit ethernet~~
  - ~~Redundant power supply~~
- ~~Slaves 3+~~
  - ~~Dual quad core +~~
  - ~~64+ GB RAM~~
  - ~~2x internal SATA drives for OS~~
  - ~~6 x 2 TB + SATA drives as JBOD (이 부분은 이견이 있는 내용이고, 패러랠 Disk 셋팅 대안입니다.)~~
- Hardware Guidelines for Master Nodes
  - Master nodes run master service components. As a result, availability is a primary concern. Availability is enhanced through redundancy. Where possible, all hardware components should be configured for redundancy.
    Use RAID 10 storage for both the operating system and all data disks. Configure dual, bonded Ethernet NICs. Consider dual power supplies and cooling fans. Also use ECC-protected memory.
    Some Hadoop master service components support high availability configurations across multiple hosts. You may even consider virtualizing the master servers to gain the benefits of live virtual machine migration or virtual machine high availability solutions like VMware HA.
    While not all master service components have the same CPU, memory, storage, or network hardware requirements, consider using the same hardware specifications for all master nodes. This enables an administrator to more easily migrate master service components to any master node as a result of maintenance requirements or following a system failure.
- Hardware Guidelines for Worker Nodes
  - Worker nodes perform data processing so throughput is more important than availability. Worker nodes are already redundant within the cluster so hardware redundancy in the individual nodes is not as necessary. If a worker node fails, Hadoop automatically takes action to protect data and restart any failed processing jobs. For example, by default, HDFS maintains three copies of all data blocks and ensures that not all copies reside on the same worker node. If a worker node fails, HDFS automatically makes new copies of all data blocks that resided on the failed machine.
    Because throughput is so important for performance, plan for parallel computing and data paths. This includes using dual-CPU socket servers, using multiple disk drives (as many as 8-12) and disk controllers, using fast drives with a fast disk interconnect, and using multiple, bonded Ethernet NICs.
    To simplify cluster configuration, try to use the same hardware specification for all worker nodes. Many Hadoop configuration properties are tied to the number of CPUs, the amount of memory, or the number of disks on the worker node. If different groups of worker nodes have different specifications, then these groups of worker nodes must use separate configuration files with different configuration property settings. The good news is that the Ambari Configuration Groups feature accommodates this, but there is still some additional administrative effort involved.
OS
- Required
  - Configure NTP on all cluster nodes to ensure synchronized time.
  - Configure all cluster nodes for forward and reverse DNS lookups.
  - Configure the system that will run Ambari for password-less SSH access to cluster nodes. If this is not possible for security or other reasons, manually install and register the Ambari agents before HDP installation.
  - Open HDP-specific network ports or disable the firewall.
  - Disable IPv6 on all cluster nodes
  - For the duration of the installation process disable Security Enhanced Linux (SELinux). It can be re-enabled following installation.
- Recommended
  - Linux file systems record the last access time for all files but there is a small performance cost associated with this. To disable last access time recording, use the noatime option when mounting a file system. Use the instructions in your vendor documentation to add the noatime mount option.
  - The ext3 and ext4 file systems normally reserve five percent of their disk space for the exclusive use of the root user. This can lead to an excessive waste of space with multiple terabyte-sized storage. You may disable or lower this reservation when creating a file system, or afterwards by tuning the file system. Use the instructions in your vendor documentation to change the root-reserved space.
  - Linux kernels have a feature named transparent huge pages that is not recommended for Hadoop workloads. Use the instructions in your operating system documentation to disable it.
  - Ethernet jumbo frames increase an Ethernet packet’s maximum payload from 1500 bytes to approximately 9000 bytes. This payload increase increases network performance. Use the instructions in your vendor documentation to enable jumbo frames.
    BIOS-based power management commonly has the ability to increase or decrease CPU clock speeds under certain conditions. Because Hadoop operates as a cluster of machines, having some machines running slower clock speeds can have an adverse affect on total cluster processing throughput. To avoid this, use the instructions in your vendor documentation to disable BIOS-based power management.
  - Linux systems also place limits on the total number of files that a process may have open at the same time. It also places limits on the total number of processes that a user may run at the same time. These limits could interfere with cluster operation. Use the instructions in your vendor documentation to increase these limits, as necessary.
클러스터

Mungeol Heo

Monday, June 12, 2017

Prerequisites for Hadoop

No comments:

Post a Comment