Hadoop Deployment Cheat Sheet

Hadoop Deployment Cheat Sheet

If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI), this document can help you navigate some of the technology and terminology, and guide you in setting up and configuring the system.
In this document we provide some background information about the framework, the key distributions, modules, components, and related products. We also provide you with single and multi-node Hadoop installation commands and configuration parameters.
The final section includes some tips and tricks to help you get started, and provides guidance in setting up a Hadoop project.

To learn how we make BI work on Hadoop:Jethro Data Sheet.

Key Hadoop Distributions

Vendor

Strength

Vendor

Strength

Apache Hadoop

The open source distribution from Apache

Apache Cloudera

The leading vendor. with proprietary components for enterprise needs

MapR

Committed to ease of use, while supporting high performance and scalability

Hortonworks

100% open source package

IBM

Integration with IBM analytics products

Pivotal

Integration with Greenplum and Cloud Foundry (CF)

Hadoop Modules

Module

Description

Module

Description

Common

Common utilities. Supports other Hadoop modules

HDFS

Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware

YARN

Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling

MapReduce

Software framework for parallel processing of large data sets based on YARN

Hadoop Components

Component

Module

Description

Component

Module

Description

NameNode

HDFS

The directory tree of the Hadoop HDFS file system (a.k.a Hadoop inode)

Secondary NameNode

HDFS

High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits file into the fsimage file

JournalNode

HDFS

Arbiter node that supports auto failover between NameNodes

DataNode

HDFS

Nodes (or servers) that store the actual data

NFS3 Gateway

HDFS

Daemons that enable NFS3 support

ResourceManager

YARN

Global daemon that arbitrates resources among all the applications in the Hadoop cluster

ApplicationMaster

YARN

Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager to consume them and monitor the tasks

NodeManager

YARN

Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage such as CPU and disk, and reporting back to the ResourceManager

Container

YARN

Running specific tasks on a specific machine for a specific application based on allocated resources

Hadoop Ecosystem – Related Products

Product

Description

Product

Description

Ambari

A completely open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters

Apex

Big data in motion platform based on YARN

Azbakan

Workflow job scheduling and management system for Hadoop

Flume

Reliable, distributed and available service that streams logs into HDFS

Knox

Authentication and Access gateway service for Hadoop

HBase

Distributed non-relational database that runs on top of HDFS

Hive

Data warehouse system based on Hadoop

Mahout

Machine learning algorithm (clustering, classification and batch-based collaborative filtering) implementation based on MapReduce

Pig

High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark

Impala

Enables low-latency SQL queries on HBase and HDFS

Oozie

Workflow job scheduling and management system for Hadoop

Ranger

Access policy manager for HDFS files, folders, databases, tables and columns

Spark

Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like interface and machine learning library.

Sqoop

Data migration application between RDBMS and Hadoop using CLI

Tez

Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN

ZooKeeper

Distributed name registry, synchronization service and configuration service that is used as a sub-system in Hadoop

Major Hadoop Cloud Providers

Cloud Operator

Service Name

Cloud Operator

Service Name

Amazon Web Services

EMR (Elastic Map Reduce)

IBM Softlayer

IBM Brightsight

Microsoft Azure

HDInsight

Common Data Formats

Format

Description

Format

Description

Avro

JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.

Parquet

Columnar storage format

ORC

Fast Columnar storage format

RCFile

Data placement format for Rational tables

SequenceFile

Binary data format with a record of specific data types

Unstructured

Hadoop also supports various unstructured data formats

Single Node Installation

Requirement

Task

Command

Requirement

Task

Command

Java Installation

Check version

java -version

 

Install

sudo apt-get -y update && sudo apt-get -y install default-jdk

Create User and Permissions

Create User

useradd hadoop passwd hadoop mkdir /home/hadoop chown -R hadoop:hadoop /home/hadoop

 

Create keys

su - hadoop ssh-keygen -t rsa && cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && chmod 0600 ~/.ssh/authorized_keys

Install from source

 

wget {*}{_}http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz_* && tar xzf hadoop-2.7.2.tar.gz && mv hadoop-2.7.2 hadoop

Environment

Env Vars

source ~/.bashrc export HADOOP_HOME=/home/hadoop/hadoopexport HADOOP_INSTALL=$HADOOP_HOMEexport HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

 

 

vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh export JAVA_HOME=/opt/jdk1.8.0_05/

Configuration files

 

core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

Format NameNode

 

hdfs namenode -format

Start System

 

cd $HADOOP_HOME/sbin/ start-dfs.sh start-yarn.sh

Test System

 

bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/hadoop bin/hdfs dfs -put /var/log/httpd logs

Multi-node Installation

Task

Command

Task

Command

Configure hosts on each node

> vi /etc/hosts 192.168.1.11 hadoop-master 192.168.1.12 hadoop-slave-1 192.168.1.13 hadoop-slave-2

Enable cross node authentication

> su – hadoop > ssh-keygen -t rsa > ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master > ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1 192.168.1.12 hadoop-slave-1 > ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2 > chmod 0600 ~/.ssh/authorized_keys > exit

Copy system

> su - hadoop > cd /opt/hadoop
> scp -r hadoop hadoop-slave-1:/opt/hadoop
> scp -r hadoop hadoop-slave-2:/opt/hadoop

Configure Master

> su - hadoop > cd /opt/hadoop/hadoop > vi conf/masters hadoop-master > vi conf/slaves hadoop-slave-1 hadoop-slave-2 > su - hadoop > cd /opt/hadoop/hadoop > bin/hadoop namenode -format

Start system

bin/start-all.sh

Backup HDFS Metadata

Task

Command

Task

Command

Stop the cluster

stop-all.sh

Perform cold backup to metadata directories

cd /data/dfs/nn tar -cvf /tmp/backup.tar.gz

Start the cluster

start-all.sh

HDFS Basic Commands

Task

Command

Task

Command

List the content of the home directory

hdfs dfs -ls /data/

Upload a file from the local file system to HDFS

hdfs dfs -put logs. csv /data/

Read the content of the file from HDFS

hdfs dfs -cat /data/ logs.csv

Change the permission of a file

hdfs dfs -chmod 744 /data/logs.csv

Set the replication factor of a file to 3

hdfs dfs -setrep -w 3 /data/logs.csv

Check the size of the file

hdfs dfs -du -h /data/logs.csv

Move the file to the newly-created sub-directory

hdfs dfs -mv logs.csv logs/

Remove directory from HDFS

hdfs dfs -rm -r logs

HDFS Administration

Task

Command

Task

Command

Balance the cluster storage

hdfs balancer -threshold

Run the NameNode

hdfs namenode

Run the secondary NameNode

hdfs secondarynamenode

Run a datanode

hdfs datanode

Run the NFS3 gateway

hdfs nfs3

Run the RPC portmap for the NFS3 gateway

hdfs portmap

YARN

Task

Command

Task

Command

Show yarn help

yarn

Define configuration file

yarn [--config confdir]

Define log level

yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG, and TRACE

User commands

 

Show Hadoop classpath

yarn classpath

Show and kill application

yarn application

Show application attempt

yarn applicationattempt

Show container information

yarn container

Show node information

yarn node

Show queue information

yarn queue

Administration commands

 

Start NodeManager

yarn nodemanager

Start Proxy web server

yarn proxyserver

Start ResourceManager

yarn resourcemanager

Run ResourceManager admin client

yarn rmadmin

Start Shared Cache Manager

yarn sharedcachemanager

Start TimeLineServer

yarn timelineserver

MapReduce

Task

Command

Task

Command

Submit the WordCount MapReduce job to the cluster

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input logs-output

Check the output of this job in HDFS

hadoop fs -cat logs -output/*

Submit a scalding job

hadoop jar scalding.jar com.twitter.scalding.Tool Scalding

Kill a MapReduce job

yarn application -kill

Resource Manager UI

Resource

Resource