Hadoop Deployment Cheat Sheet
If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI), this document can help you navigate some of the technology and terminology, and guide you in setting up and configuring the system.
In this document we provide some background information about the framework, the key distributions, modules, components, and related products. We also provide you with single and multi-node Hadoop installation commands and configuration parameters.
The final section includes some tips and tricks to help you get started, and provides guidance in setting up a Hadoop project.
To learn how we make BI work on Hadoop:Jethro Data Sheet.
Key Hadoop Distributions
Vendor | Strength |
---|---|
Apache Hadoop | The open source distribution from Apache |
Apache Cloudera | The leading vendor. with proprietary components for enterprise needs |
MapR | Committed to ease of use, while supporting high performance and scalability |
Hortonworks | 100% open source package |
IBM | Integration with IBM analytics products |
Pivotal | Integration with Greenplum and Cloud Foundry (CF) |
Hadoop Modules
Module | Description |
---|---|
Common | Common utilities. Supports other Hadoop modules |
HDFS | Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware |
YARN | Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling |
MapReduce | Software framework for parallel processing of large data sets based on YARN |
Hadoop Components
Component | Module | Description |
---|---|---|
NameNode | HDFS | The directory tree of the Hadoop HDFS file system (a.k.a Hadoop inode) |
Secondary NameNode | HDFS | High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits file into the fsimage file |
JournalNode | HDFS | Arbiter node that supports auto failover between NameNodes |
DataNode | HDFS | Nodes (or servers) that store the actual data |
NFS3 Gateway | HDFS | Daemons that enable NFS3 support |
ResourceManager | YARN | Global daemon that arbitrates resources among all the applications in the Hadoop cluster |
ApplicationMaster | YARN | Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager to consume them and monitor the tasks |
NodeManager | YARN | Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage such as CPU and disk, and reporting back to the ResourceManager |
Container | YARN | Running specific tasks on a specific machine for a specific application based on allocated resources |
Hadoop Ecosystem – Related Products
Product | Description |
---|---|
Ambari | A completely open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters |
Apex | Big data in motion platform based on YARN |
Azbakan | Workflow job scheduling and management system for Hadoop |
Flume | Reliable, distributed and available service that streams logs into HDFS |
Knox | Authentication and Access gateway service for Hadoop |
HBase | Distributed non-relational database that runs on top of HDFS |
Hive | Data warehouse system based on Hadoop |
Mahout | Machine learning algorithm (clustering, classification and batch-based collaborative filtering) implementation based on MapReduce |
Pig | High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark |
Impala | Enables low-latency SQL queries on HBase and HDFS |
Oozie | Workflow job scheduling and management system for Hadoop |
Ranger | Access policy manager for HDFS files, folders, databases, tables and columns |
Spark | Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like interface and machine learning library. |
Sqoop | Data migration application between RDBMS and Hadoop using CLI |
Tez | Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN |
ZooKeeper | Distributed name registry, synchronization service and configuration service that is used as a sub-system in Hadoop |
Major Hadoop Cloud Providers
Cloud Operator | Service Name |
---|---|
Amazon Web Services | EMR (Elastic Map Reduce) |
IBM Softlayer | IBM Brightsight |
Microsoft Azure | HDInsight |
Common Data Formats
Format | Description |
---|---|
Avro | JSON-based format that includes RPC and serialization support. Designed for systems that exchange data. |
Parquet | Columnar storage format |
ORC | Fast Columnar storage format |
RCFile | Data placement format for Rational tables |
SequenceFile | Binary data format with a record of specific data types |
Unstructured | Hadoop also supports various unstructured data formats |
Single Node Installation
Requirement | Task | Command |
---|---|---|
Java Installation | Check version | java -version |
Install | sudo apt-get -y update && sudo apt-get -y install default-jdk | |
Create User and Permissions | Create User | useradd hadoop passwd hadoop mkdir /home/hadoop chown -R hadoop:hadoop /home/hadoop |
Create keys | su - hadoop ssh-keygen -t rsa && cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && chmod 0600 ~/.ssh/authorized_keys | |
Install from source |
| wget {*}{_}http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz_* && tar xzf hadoop-2.7.2.tar.gz && mv hadoop-2.7.2 hadoop |
Environment | Env Vars | source ~/.bashrc export HADOOP_HOME=/home/hadoop/hadoopexport HADOOP_INSTALL=$HADOOP_HOMEexport HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin |
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh export JAVA_HOME=/opt/jdk1.8.0_05/ | ||
Configuration files |
| core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml |
Format NameNode |
| hdfs namenode -format |
Start System |
| cd $HADOOP_HOME/sbin/ start-dfs.sh start-yarn.sh |
Test System |
| bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/hadoop bin/hdfs dfs -put /var/log/httpd logs |
Multi-node Installation
Task | Command |
---|---|
Configure hosts on each node | > vi /etc/hosts 192.168.1.11 hadoop-master 192.168.1.12 hadoop-slave-1 192.168.1.13 hadoop-slave-2 |
Enable cross node authentication | > su – hadoop > ssh-keygen -t rsa > ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master > ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1 192.168.1.12 hadoop-slave-1 > ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2 > chmod 0600 ~/.ssh/authorized_keys > exit |
Copy system | > su - hadoop > cd /opt/hadoop |
Configure Master | > su - hadoop > cd /opt/hadoop/hadoop > vi conf/masters hadoop-master > vi conf/slaves hadoop-slave-1 hadoop-slave-2 > su - hadoop > cd /opt/hadoop/hadoop > bin/hadoop namenode -format |
Start system | bin/start-all.sh |
Backup HDFS Metadata
Task | Command |
---|---|
Stop the cluster | stop-all.sh |
Perform cold backup to metadata directories | cd /data/dfs/nn tar -cvf /tmp/backup.tar.gz |
Start the cluster | start-all.sh |
HDFS Basic Commands
Task | Command |
---|---|
List the content of the home directory | hdfs dfs -ls /data/ |
Upload a file from the local file system to HDFS | hdfs dfs -put logs. csv /data/ |
Read the content of the file from HDFS | hdfs dfs -cat /data/ logs.csv |
Change the permission of a file | hdfs dfs -chmod 744 /data/logs.csv |
Set the replication factor of a file to 3 | hdfs dfs -setrep -w 3 /data/logs.csv |
Check the size of the file | hdfs dfs -du -h /data/logs.csv |
Move the file to the newly-created sub-directory | hdfs dfs -mv logs.csv logs/ |
Remove directory from HDFS | hdfs dfs -rm -r logs |
HDFS Administration
Task | Command |
---|---|
Balance the cluster storage | hdfs balancer -threshold |
Run the NameNode | hdfs namenode |
Run the secondary NameNode | hdfs secondarynamenode |
Run a datanode | hdfs datanode |
Run the NFS3 gateway | hdfs nfs3 |
Run the RPC portmap for the NFS3 gateway | hdfs portmap |
YARN
Task | Command |
---|---|
Show yarn help | yarn |
Define configuration file | yarn [--config confdir] |
Define log level | yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG, and TRACE |
User commands |
|
Show Hadoop classpath | yarn classpath |
Show and kill application | yarn application |
Show application attempt | yarn applicationattempt |
Show container information | yarn container |
Show node information | yarn node |
Show queue information | yarn queue |
Administration commands |
|
Start NodeManager | yarn nodemanager |
Start Proxy web server | yarn proxyserver |
Start ResourceManager | yarn resourcemanager |
Run ResourceManager admin client | yarn rmadmin |
Start Shared Cache Manager | yarn sharedcachemanager |
Start TimeLineServer | yarn timelineserver |
MapReduce
Task | Command |
---|---|
Submit the WordCount MapReduce job to the cluster | hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input logs-output |
Check the output of this job in HDFS | hadoop fs -cat logs -output/* |
Submit a scalding job | hadoop jar scalding.jar com.twitter.scalding.Tool Scalding |
Kill a MapReduce job | yarn application -kill |
Resource Manager UI
Resource | Default URI |
---|---|
NameNode | http://:50070/ |
DataNode | http://:50075/ |
Sec NameNode | http://:50090/ |
Resource Manager | http://:8088 |
HBase Master | http://:60010 |
Secure Hadoop
Aspect | Best Practice |
---|---|
Authentication |
|
Authorization |
|
Audit |
|
Data Protection |
|
Hadoop Tips and Tricks
Project Concept |
Iterate cluster sizing to optimize performance and meet actual load patterns |
Hardware |
Clusters with more nodes recover faster |
The higher the storage per node, the longer the recovery time |
Use commodity hardware:
|
Invest in reliable hardware for the NameNodes |
NameNode RAM should be 2GB + 1GB for every 100TB raw disk space |
Networking cost should be 20% of hardware budget |
40 nodes is the critical mass to achieve best performance/cost ratio |
Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows for 3 replicas |
Operating System and JVM |
Must be 64-bit |
Set file descriptor limit to 64K (ulimit) |
Enable time synchronization using NTP |
Speed up reads by mounting disks with NOATIME |
Disable hugepages |
System |
Enable monitoring using Ambari |
Monitor the checkpoints of the NameNodes to verify that they occur at the correct times. This will enable you to recover your cluster when needed |
Avoid reaching 90% cluster disk utilization |
Balance the cluster periodically using balancer |
Edit metadata files using Hadoop utilities only, to avoid corruption |
Keep replication >= 3 |
Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation |
Clean /tmp regularly – it tends to fill up with junk files |
Optimize the number of reducers to avoid system starvation |
Verify that the file system you selected is supported by your Hadoop vendor |
Data and System Recovery |
Disk failure is not an issue |
Data nodes failure is not a major issue |
NameNodes failure is an issue even in a clustered environment |
Make regular backups of namenode metadata |
Enable NameNode clustering using ZooKeeper |
Provide sufficient disk space for NameNode logging |
Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml |