Hadoop Deployment Cheat Sheet

If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI), this document can help you navigate some of the technology and terminology, and guide you in setting up and configuring the system.
In this document we provide some background information about the framework, the key distributions, modules, components, and related products. We also provide you with single and multi-node Hadoop installation commands and configuration parameters.
The final section includes some tips and tricks to help you get started, and provides guidance in setting up a Hadoop project.

To learn how we make BI work on Hadoop:Jethro Data Sheet.

Key Hadoop Distributions

Vendor	Strength
Apache Hadoop	The open source distribution from Apache
Apache Cloudera	The leading vendor. with proprietary components for enterprise needs
MapR	Committed to ease of use, while supporting high performance and scalability
Hortonworks	100% open source package
IBM	Integration with IBM analytics products
Pivotal	Integration with Greenplum and Cloud Foundry (CF)

Hadoop Modules

Module	Description
Common	Common utilities. Supports other Hadoop modules
HDFS	Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware
YARN	Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling
MapReduce	Software framework for parallel processing of large data sets based on YARN

Hadoop Components

Component	Module	Description
NameNode	HDFS	The directory tree of the Hadoop HDFS file system (a.k.a Hadoop inode)
Secondary NameNode	HDFS	High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits file into the fsimage file
JournalNode	HDFS	Arbiter node that supports auto failover between NameNodes
DataNode	HDFS	Nodes (or servers) that store the actual data
NFS3 Gateway	HDFS	Daemons that enable NFS3 support
ResourceManager	YARN	Global daemon that arbitrates resources among all the applications in the Hadoop cluster
ApplicationMaster	YARN	Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager to consume them and monitor the tasks
NodeManager	YARN	Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage such as CPU and disk, and reporting back to the ResourceManager
Container	YARN	Running specific tasks on a specific machine for a specific application based on allocated resources

Hadoop Ecosystem – Related Products

Product	Description
Ambari	A completely open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters
Apex	Big data in motion platform based on YARN
Azbakan	Workflow job scheduling and management system for Hadoop
Flume	Reliable, distributed and available service that streams logs into HDFS
Knox	Authentication and Access gateway service for Hadoop
HBase	Distributed non-relational database that runs on top of HDFS
Hive	Data warehouse system based on Hadoop
Mahout	Machine learning algorithm (clustering, classification and batch-based collaborative filtering) implementation based on MapReduce
Pig	High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark
Impala	Enables low-latency SQL queries on HBase and HDFS
Oozie	Workflow job scheduling and management system for Hadoop
Ranger	Access policy manager for HDFS files, folders, databases, tables and columns
Spark	Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like interface and machine learning library.
Sqoop	Data migration application between RDBMS and Hadoop using CLI
Tez	Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN
ZooKeeper	Distributed name registry, synchronization service and configuration service that is used as a sub-system in Hadoop

Major Hadoop Cloud Providers

Cloud Operator	Service Name
Amazon Web Services	EMR (Elastic Map Reduce)
IBM Softlayer	IBM Brightsight
Microsoft Azure	HDInsight

Common Data Formats

Format	Description
Avro	JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.
Parquet	Columnar storage format
ORC	Fast Columnar storage format
RCFile	Data placement format for Rational tables
SequenceFile	Binary data format with a record of specific data types
Unstructured	Hadoop also supports various unstructured data formats

Single Node Installation

Requirement	Task	Command
Java Installation	Check version	*java -version*
	Install	*sudo apt-get -y update && sudo apt-get -y install default-jdk*
Create User and Permissions	Create User	*useradd hadoop* *passwd hadoop* *mkdir /home/hadoop* *chown -R hadoop:hadoop /home/hadoop*
	Create keys	*su - hadoop* *ssh-keygen -t rsa &&* *cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys* *&& chmod 0600 ~/.ssh/authorized_keys*
Install from source		*wget* {}{_}http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz_ && *tar xzf hadoop-2.7.2.tar.gz &&* *mv hadoop-2.7.2 hadoop*
Environment	Env Vars	*source ~/.bashrc* *export HADOOP_HOME=/home/hadoop/hadoopexport* *HADOOP_INSTALL=$HADOOP_HOMEexport* *HADOOP_MAPRED_HOME=$HADOOP_HOMEexport* *HADOOP_COMMON_HOME=$HADOOP_HOME* *export HADOOP_HDFS_HOME=$HADOOP_HOME* *export YARN_HOME=$HADOOP_HOME* *export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native* *export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin*
		*vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh* *export JAVA_HOME=/opt/jdk1.8.0_05/*
Configuration files		*core-site.xml* *hdfs-site.xml* *mapred-site.xml* *yarn-site.xml*
Format NameNode		*hdfs namenode -format*
Start System		*cd $HADOOP_HOME/sbin/* *start-dfs.sh* *start-yarn.sh*
Test System		*bin/hdfs dfs -mkdir /user* *bin/hdfs dfs -mkdir /user/hadoop* *bin/hdfs dfs -put /var/log/httpd logs*

Multi-node Installation

Task	Command
Configure hosts on each node	*> vi /etc/hosts* *192.168.1.11 hadoop-master* *192.168.1.12 hadoop-slave-1* *192.168.1.13 hadoop-slave-2*
Enable cross node authentication	*> su – hadoop* *> ssh-keygen -t rsa* *> ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master* *> ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1* *192.168.1.12 hadoop-slave-1* *> ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2* *> chmod 0600 ~/.ssh/authorized_keys* *> exit*
Copy system	*> su - hadoop* > cd /opt/hadoop > scp -r hadoop hadoop-slave-1:/opt/hadoop > scp -r hadoop hadoop-slave-2:/opt/hadoop
Configure Master	*> su - hadoop* *> cd /opt/hadoop/hadoop* *> vi conf/masters* *hadoop-master* *> vi conf/slaves* *hadoop-slave-1* *hadoop-slave-2* *> su - hadoop* *> cd /opt/hadoop/hadoop* *> bin/hadoop namenode -format*
Start system	*bin/start-all.sh*

Backup HDFS Metadata

Task	Command
Stop the cluster	*stop-all.sh*
Perform cold backup to metadata directories	*cd /data/dfs/nn* *tar -cvf /tmp/backup.tar.gz*
Start the cluster	*start-all.sh*

HDFS Basic Commands

Task	Command
List the content of the home directory	*hdfs dfs -ls /data/*
Upload a file from the local file system to HDFS	*hdfs dfs -put logs. csv /data/*
Read the content of the file from HDFS	*hdfs dfs -cat /data/ logs.csv*
Change the permission of a file	*hdfs dfs -chmod 744 /data/logs.csv*
Set the replication factor of a file to 3	*hdfs dfs -setrep -w 3 /data/logs.csv*
Check the size of the file	*hdfs dfs -du -h /data/logs.csv*
Move the file to the newly-created sub-directory	*hdfs dfs -mv logs.csv logs/*
Remove directory from HDFS	*hdfs dfs -rm -r logs*

HDFS Administration

Task	Command
Balance the cluster storage	*hdfs balancer -threshold*
Run the NameNode	*hdfs namenode*
Run the secondary NameNode	*hdfs secondarynamenode*
Run a datanode	*hdfs datanode*
Run the NFS3 gateway	*hdfs nfs3*
Run the RPC portmap for the NFS3 gateway	*hdfs portmap*

YARN

Task	Command
Show yarn help	*yarn*
Define configuration file	*yarn [--config confdir]*
Define log level	*yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG, and TRACE*
User commands
Show Hadoop classpath	*yarn classpath*
Show and kill application	*yarn application*
Show application attempt	*yarn applicationattempt*
Show container information	*yarn container*
Show node information	*yarn node*
Show queue information	*yarn queue*
Administration commands
Start NodeManager	*yarn nodemanager*
Start Proxy web server	*yarn proxyserver*
Start ResourceManager	*yarn resourcemanager*
Run ResourceManager admin client	*yarn rmadmin*
Start Shared Cache Manager	*yarn sharedcachemanager*
Start TimeLineServer	*yarn timelineserver*

MapReduce

Task	Command
Submit the WordCount MapReduce job to the cluster	hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input logs-output
Check the output of this job in HDFS	hadoop fs -cat logs -output/*
Submit a scalding job	hadoop jar scalding.jar com.twitter.scalding.Tool Scalding
Kill a MapReduce job	yarn application -kill

Resource Manager UI

Resource	Default URI
NameNode	http://:50070/
DataNode	http://:50075/
Sec NameNode	http://:50090/
Resource Manager	http://:8088
HBase Master	http://:60010

Secure Hadoop

Aspect	Best Practice
Authentication	Define users Enable Kerberos in Hadoop Setup Knox gateway to control access and authentication to the HDFS cluster Integrate with the organization's SSO and LDAP
Authorization	Define groups Define HDFS Permissions Define HDFS ACL's Enable Ranger policies to control access to HDFS folders, directories, databases, tables and columns
Audit	Enable process execution audit trail
Data Protection	Wire encryption with Knox or Hadoop

Hadoop Tips and Tricks

Project Concept

Iterate cluster sizing to optimize performance and meet actual load patterns

Hardware

Clusters with more nodes recover faster

The higher the storage per node, the longer the recovery time

Use commodity hardware:

Use large slow disks (SATA) without RAID (3-6TB disks)
Use as much RAM as is cost-effective (96-192GB RAM)
Use mainstream CPU with as many cores as possible (8-12 cores)

Invest in reliable hardware for the NameNodes

NameNode RAM should be 2GB + 1GB for every 100TB raw disk space

Networking cost should be 20% of hardware budget

40 nodes is the critical mass to achieve best performance/cost ratio

Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows for 3 replicas

Operating System and JVM

Must be 64-bit

Set file descriptor limit to 64K (ulimit)

Enable time synchronization using NTP

Speed up reads by mounting disks with NOATIME

Disable hugepages

System

Enable monitoring using Ambari

Monitor the checkpoints of the NameNodes to verify that they occur at the correct times. This will enable you to recover your cluster when needed

Avoid reaching 90% cluster disk utilization

Balance the cluster periodically using balancer

Edit metadata files using Hadoop utilities only, to avoid corruption

Keep replication >= 3

Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation

Clean /tmp regularly – it tends to fill up with junk files

Optimize the number of reducers to avoid system starvation

Verify that the file system you selected is supported by your Hadoop vendor

Data and System Recovery

Disk failure is not an issue

Data nodes failure is not a major issue

NameNodes failure is an issue even in a clustered environment

Make regular backups of namenode metadata

Enable NameNode clustering using ZooKeeper

Provide sufficient disk space for NameNode logging

Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml