Preparing for Installing Jethro on Hadoop

Jethro is designed to work with all modern Hadoop 2.x distributions, and supports both unsecured and secured clusters (using Kerberos).

Hadoop Installation Prerequisites

You can skip these instructions if you allocate a Hadoop DataNode to be used as Jethro host, because in this case your host already has the correct version of your Hadoop binaries and libraries installed and configured, and its version of Java is identical to that of the Hadoop cluster. Before re-using an existing Hadoop DataNode for Jethro installation, ensure to first decommission the DataNode from the Hadoop cluster, thereby preventing Jethro's or Hadoop's processes from being killed under occasional memory pressure.

Before starting the installation of Jethro on Hadoop, ensure that the conditions specified below are met:

Java Installation

The host must have Java installed, with the same version that is installed on the Hadoop cluster.

Supported Hadoop Distributions

Jethro is designed to work with all modern Hadoop 2.x distributions, and supports both unsecured and secured clusters (using Kerberos). To work with Jethro, you only need to allow Jethro to access HDFS, because Jehtro does not run MapReduce / Tez / Spark jobs on the cluster. In addition, Jethro works on its own dedicated nodes, and does not install any software on the Hadoop cluster.
The current release is certified on the following Hadoop distributions:

  • Cloudera CDH 4.x and 5.x
  • Hortonworks HDP 2.x
  • Amazon EMR 3.x
  • MapR 4.0.x

    MapR support is delivered by a dedicated installer (a separate RPM).

Cloud Instance

If you install Jethro on a cloud instance, use a memory-optimized instance with at least the following configuration:

  • 8 virtual processors (vCPUs); it is advisable to have at least 16 vCPUs
  • 64GB RAM; it is advisable to have at least 128 GB RAM
  • Local SSD

Additional requirements apply to the specific type of instance mentioned below:

  • Amazon AWS: Minimal instance type is r3.2xlarge, recommended r3.4xlarge / r3.8xlarge.
  • Microsoft Azure: Minimal instance type is D13, recommended D14.
  • When using Amazon Elastic MapReduce (EMR) as the Hadoop distribution, it is best to allocate the Jethro hosts as part of the master instance group. This ensures that the EMR Hadoop client software is pre-installed and pre-configured. For details, see Instance Groups on Amazon AWS documentation.

Physical Hardware

If you use a physical server for Jethro software, ensure that your hardware meets the following criteria:

  • CPU – Use at least a modern 2-socket server (for example: 2 x 8-core Intel server).
  • Memory – Minimum 64GB of RAM, recommended 128GB/256GB (or higher).
  • Disk – At least 5GB free in both /opt and /var. (indexes are stored on HDFS, not locally).

In addition, allocate storage for local caching. It is recommended to provide a dedicated SSD drive (200 Gb or more) on its own mount point.

Network Bandwidth

The network bandwidth requirements are:

  • For Hadoop cluster - At least 1Gb/s link, recommended 10Gb/s link.
  • For SQL clients (BI tools etc) – Recommended 1Gb/s link.

Location

Jethro software should be installed on a host located next to the Hadoop cluster, namely:

  • For on premise installations - Same data center
  • For AWS installation - Same region and availability zone.
  • For cloud environments other than AWS - The equivalent of same region and availability zone.

Operating System

Jethro is certified on the following Linux flavors:

  • 64-bit RedHat/CentOS 6.x
  • Amazon Linux (when using Amazon AWS)

Configuring Jethro Connection with Hadoop


You can skip this document if you have decommissioned a Hadoop DataNode and allocated it to be used as Jethro host, because in this case your host already has the correct version of your Hadoop binaries and libraries installed and configured, and its version of Java is identical to that of the Hadoop cluster.

This section specifies how to install the distribution-specific set of Hadoop Client libraries, depending on your distribution, and then configure the newly installed Hadoop client to point to your Hadoop cluster.
Connecting Jethro to the Hadoop cluster requires the Jethro host to meet the following criteria:

  • Having Java installed, with the same version that is installed on the Hadoop cluster.
  • Having a Hadoop distribution-specific set of Hadoop Client libraries installed.
  • Configuring the newly installed new Hadoop Client to point to your Hadoop cluster.

Cloudera CDH 4.x

To install and configure Cloudera CDH 4.x:

  1. Switch to user root.
  2. Run the following commands:

    rpm --quiet --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
    wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm
    yum -y localinstall cloudera-cdh-4-0.x86_64.rpm
    yum -y install hadoop-client
    yum -y install hive-jdbc
  3. Go to the /etc/hadoop/conf/ directory on the Jethro host.

  4. Copy the files core-site.xml and hdfs-site.xml from any DataNode of your Hadoop cluster and override these files on the /etc/hadoop/conf/ directory, to configure the Hadoop Client to point to your Hadoop cluster.
  5. Validate connectivity to the Hadoop cluster by running the command hadoop fs –ls /jethro, which displays files under the directory /jethro on the Hadoop cluster.

Cloudera CDH5.x

To install and configure Cloudera CDH 5.x:

  1. Switch to user root.
  2. Retrieve the installation repo file by running the following commands:

    rpm --quiet --import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
    wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm
  3. Install hadoop client components as follows:

    yum -y localinstall cloudera-cdh-5-0.x86_64.rpm
    yum -y install hadoop-client
    yum -y install hive-jdbc
  4. Go to the /etc/hadoop/conf/ directory on the Jethro host.
  5. Copy the files core-site.xml and hdfs-site.xml from any DataNode of your Hadoop cluster and override these files on the /etc/hadoop/conf/ directory, to configure the Hadoop Client to point to your Hadoop cluster.
  6. Validate connectivity to the Hadoop cluster by running the command hadoop fs –ls /jethro, which displays files under the directory /jethro on the Hadoop cluster.

Hortonworks HDP

To install and configure Hortonworks HDP:

  1. Switch to user root

  2. Retrieve the installation repo file via wget command specifying the repo url with the specific OS version and HDP version.
    URL template: 

    URL: http://public-repo-1.hortonworks.com/HDP/{OS-version}/2.x/updates/{hdp-version:X.X.X.X}/hdp.repo

    For specific versions search for: "HDP Configure the Remote repositories"

    # Examples:
    # For Hortonworks HDP 2.4.2 / Centos 6.x
    wget -nv http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.4.2.0/hdp.repo -O /etc/yum.repos.d/hdp.repo
    
    # For Hortonworks HDP 2.6.5 / Centos 7.x
    wget -nv http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.5.0/hdp.repo -O /etc/yum.repos.d/hdp.repo
  3. Install hadoop client components as follows:

    yum -y install hadoop-client
    yum -y install hive
  4. Go to the /etc/hadoop/conf/ directory on the Jethro host.

  5. Copy the files core-site.xml and hdfs-site.xml from any DataNode of your Hadoop cluster and replace these files on the /etc/hadoop/conf/ directory, to configure the Hadoop Client to point to your Hadoop cluster.

  6. Validate connectivity to the Hadoop cluster by running the command hadoop fs –ls /jethro, which displays files under the directory /jethro on the Hadoop cluster.

MapR 4.x (all versions)

All instructions below are based on section "MapR Client" on MapR website. To install and configure MapR 4.x :

  1. Switch to user root.
  2. Go to http://package.mapr.com/ and run the wget command to download the most updated 64-bit MapR Client software for your environment; for example:

    wget http://package.mapr.com/releases/v4.0.2/redhat/mapr-client-4.0.2.29870.GA-1.x86_64.rpm

    For other options, see Preparing Packages and Repositories on MapR website.

  3. Remove any existing MapR software, and install the mapr-client.x86_64 software you downloaded in the previous step, for example:

    rpm -e mapr-fileserver mapr-core mapr-client
  4.  Install the Hadoop client by running:

    yum install mapr-client-4.0.1.27334.GA-1.x86_64.rpm
  5.  Configure the MapR Hadoop Client by running:

    configure.sh
    
    
    # Use the -c option for client configuration and also -C option, listing all the CLDB nodes:
    /opt/mapr/server/configure.sh -c -C mynode01:7222 
    
    
    #If using Kerberos, add -secure option after -c (see configure.sh" on MapR website for details)

    If using Kerberos, add -secure option after -c (see configure.sh" on MapR website for details)

  6. Validate connectivity to the Hadoop cluster by running the following command to display files under the directory /jethro on the Hadoop cluster: 

    hadoop fs –ls/jethro

Amazon EMR (all versions)

To have the appropriate Hadoop libraries installed, it is recommended to create the Jethro hosts(s) as Master hosts in EMR. This will ensure that the Jethro host has the right Hadoop software and that the Hadoop configuration files point to the Hadoop cluster.
For details, see
Instance Groups in Amazon AWS documentation.

Preparing the Hadoop Cluster

Configuring connection to the Hadoop cluster requires carrying out the steps described in the following sections:

Creating jethro OS User for the Hadoop Cluster

  • Non-Kerberos - Create a Hadoop OS user and a group called jethro on the NameNode, and on any node that may serve as NameNode (such as a Standby NameNode, if configured).
  • If jethro OS user and group does not exist on the NameNode, each access from Jethro to the NameNode will generate a warning and a stack trace in the NameNode logs, flooding the NameNode's log and significantly impacting its performance. For details, see HADOOP-10755.
  • Kerberos - create a Kerberos principal called jethro and generate a keytab file for it.

For example, if using MIT KDC:

kadmin.local -q "addprinc -randkey jethro"
kadmin.local -q "ktadd -k jethro.hadoop.keytab jethro"

Later on you will securely copy the keytab file to the Jethro host and make it owned by the jethro OS user. It will be used to run the "JethroAdmin set-host-config" command, as described in Managing Jethro..
 

If you cannot use jethro as the Hadoop user name, contact us for a workaround.

Creating Directories on HDFS

As Hadoop HDFS user, create a root HDFS directory for Jethro files, owned by jethro Hadoop user. In this example the directory was created under the name /user/jethro/instances:

hadoop fs -mkdir /user/jethro 
hadoop fs -mkdir /user/jethro/instances
hadoop fs -chmod -R 740 /user/jethro
hadoop fs -chown -R jethro /user/jethro


Verifying Network Access

Verify that there is access from the Jethro host to the NameNode and all DataNodes; it might require changing firewall rules to open ports.

Testing HDFS Connectivity

Verify write permissions as jethro O/S user on the HDFS directory. Use the following commands to create and delete a new file from the HDFS directory:

hadoop fs -ls /user/jethro/instances/testfile
hadoop fs -touchz /user/jethro/instances/testfile
hadoop fs -ls /user/jethro/instances/testfile
hadoop fs -rm -skipTrash /user/jethro/instances/testfile
hadoop fs -ls /user/jethro/instances/testfile


Ensure that the testfile was successfully created and then deleted.
Note: If using Kerberos, run kinit to enable HDFS to access your terminal session:

kinit -k -t ~/jethro.hadoop.keytab jethro