Install a Multi Node Hadoop Cluster on Ubuntu 14.04

This article is about multi-node installation of Hadoop cluster.  You would need minimum of 2 ubuntu machines or virtual images to complete a multi-node installation.  If you want to just try out a single node cluster, follow this article on Installing Hadoop on Ubuntu 14.04.

I used Hadoop Stable version 2.6.0 for this article. I did this setup on a 3 node cluster.  For simplicity, i will designate one node as master, and 2 nodes as slaves (slave-1, and slave-2). Make sure all slave nodes are reachable from master node.  To avoid any unreachable hosts error, make sure you add the slave hostnames and ip addresses in /etc/hosts file. Similarly, slave nodes should be able to resolve master hostname.

Installing Java on Master and Slaves

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
# Updata Java runtime
$ sudo update-java-alternatives -s java-7-oracle

Disable IPv6

As of now Hadoop does not support IPv6, and is tested to work only on IPv4 networks.   If you are using IPv6, you need to switch Hadoop host machines to use IPv4.  The Hadoop Wiki link provides a one liner command to disable the IPv6.  If you are not using IPv6, skip this step:

sudo sed -i 's/net.ipv6.bindv6only\ =\ 1/net.ipv6.bindv6only\ =\ 0/' \
/etc/sysctl.d/bindv6only.conf && sudo invoke-rc.d procps restart

Setting up a Hadoop User

Hadoop talks to other nodes in the cluster using no-password ssh.   By having Hadoop run under a specific user context, it will be easy to distribute the ssh keys around in the Hadoop cluster.  Lets’s create a user hadoopuser on master as well as slave nodes.

# Create hadoopgroup
$ sudo addgroup hadoopgroup
# Create hadoopuser user
$ sudo adduser —ingroup hadoopgroup hadoopuser

Our next step will be to generate a ssh key for password-less login between master and slave nodes.  Run the following commands only on master node.  Run the last two commands for each slave node.  Password less ssh should be working before you can proceed with further steps.

# Login as hadoopuser
$ su - hadoopuser
#Generate a ssh key for the user
$ ssh-keygen -t rsa -P ""
#Authorize the key to enable password less ssh 
$ cat /home/hadoopuser/.ssh/id_rsa.pub >> /home/hadoopuser/.ssh/authorized_keys
$ chmod 600 authorized_keys
#Copy this key to slave-1 to enable password less ssh 
$ ssh-copy-id -i ~/.ssh/id_rsa.pub slave-1
#Make sure you can do a password less ssh using following command.
$ ssh slave-1

Download and Install Hadoop binaries on Master and Slave nodes

Pick the best mirror site to download the binaries from Apache Hadoop, and download the stable/hadoop-2.6.0.tar.gz for your installation.  Do this step on master and every slave node.  You can download the file once and the distribute to each slave node using scp command.

$ cd /home/hadoopuser
$ wget http://www.webhostingjams.com/mirror/apache/hadoop/core/stable/hadoop-2.2.0.tar.gz
$ tar xvf hadoop-2.2.0.tar.gz
$ mv hadoop-2.2.0 hadoop

Setup Hadoop Environment on Master and Slave Nodes

Copy and paste following lines into your .bashrc file under /home/hadoopuser. Do this step on master and every slave node.

# Set HADOOP_HOME
export HADOOP_HOME=/home/hduser/hadoop
# Set JAVA_HOME 
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Add Hadoop bin and sbin directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin;$HADOOP_HOME/sbin

Update hadoop-env.sh on Master and Slave Nodes

Update JAVA_HOME in /home/hadoopuser/hadoop/etc/hadoop/hadoop_env.sh to following. Do this step on master and every slave node.

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Common Terminologies
Before we start getting into configuration details, lets discuss some of the basic terminologies used in Hadoop.

  • Hadoop Distributed File System: A distributed file system that provides high-throughput access to application data. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. If you compare HDFS to a traditional storage structures ( e.g. FAT, NTFS), then NameNode is analogous to a Directory Node structure, and DataNode is analogous to actual file storage blocks.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Update Configuration Files
Add/update core-site.xml on Master and Slave nodes with following options.  Master and slave nodes should all be using the same value for this property fs.defaultFS,  and should be pointing to master node only.

  /home/hadoopuser/hadoop/etc/hadoop/core-site.xml (Other Options)
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoopuser/tmp</value>
  <description>Temporary Directory.</description>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://master:54310</value>
  <description>Use HDFS as file storage engine</description>
</property>

 

Add/update mapred-site.xml on Master node only with following options.

  /home/hadoopuser/hadoop/etc/hadoop/mapred-site.xml (Other Options)
<property>
 <name>mapreduce.jobtracker.address</name>
 <value>master:54311</value>
 <description>The host and port that the MapReduce job tracker runs
  at. If “local”, then jobs are run in-process as a single map
  and reduce task.
</description>
</property>
<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 <description>The framework for running mapreduce jobs</description>
</property>

 

Add/update hdfs-site.xml on Master and Slave Nodes. We will be adding following three entries to the file.

  • dfs.replication– Here I am using a replication factor of 2. That means for every file stored in HDFS, there will be one redundant replication of that file on some other node in the cluster.
  • dfs.namenode.name.dir – This directory is used by Namenode to store its metadata file.  Here i manually created this directory /hadoop-data/hadoopuser/hdfs/namenode on master and slave node, and use the directory location for this configuration.
  • dfs.datanode.data.dir – This directory is used by Datanode to store hdfs data blocks.  Here i manually created this directory /hadoop-data/hadoopuser/hdfs/datanode on master and slave node, and use the directory location for this configuration.
  /home/hadoopuser/hadoop/etc/hadoop/hdfs-site.xml (Other Options)
<property>
 <name>dfs.replication</name>
 <value>2</value>
 <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
 </description>
</property>
<property>
 <name>dfs.namenode.name.dir</name>
 <value>/hadoop-data/hadoopuser/hdfs/namenode</value>
 <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
 </description>
</property>
<property>
 <name>dfs.datanode.data.dir</name>
 <value>/hadoop-data/hadoopuser/hdfs/datanode</value>
 <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
 </description>
</property>

 

Add yarn-site.xml on Master and Slave Nodes.  This file is required for a Node to work as a Yarn Node.  Master and slave nodes should all be using the same value for the following properties,  and should be pointing to master node only.

  /home/hadoopuser/hadoop/etc/hadoop/yarn-site.xml
<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>master:8030</value>
</property> 
<property>
 <name>yarn.resourcemanager.address</name>
 <value>master:8032</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>master:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>master:8031</value>
</property>
<property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>master:8033</value>
</property>

 

Add/update slaves file on Master node only.  Add just name, or ip addresses of master and all slave node.  If file has an entry for localhost, you can remove that.  This file is just helper file that are used by hadoop scripts to start appropriate services on master and slave nodes.

  /home/hadoopuser/hadoop/etc/hadoop/slave
master
slave-1
slave-2

Format the Namenode
Before starting the cluster, we need to format the Namenode. Use the following command only on master node:

$ hdfs namenode -format

Start the Distributed Format System 

Run the following on master node command to start the DFS.

$ ./home/hadoopuser/hadoop/sbin/start-dfs.sh

You should observe the output to ascertain that it tries to start datanode on slave nodes one by one.   To validate the success, run following command on master nodes, and slave node.

$ su - hadoopuser
$ jps

The output of this command should list NameNode, SecondaryNameNode, DataNode on master node, and DataNode on all slave nodes.  If you don’t see the expected output, review the log files listed in Troubleshooting section.

Start the Yarn MapReduce Job tracker

Run the following command to start the Yarn mapreduce framework.

$ ./home/hadoopuser/hadoop/sbin/start-yarn.sh

To validate the success, run jps command again on master nodes, and slave node.The output of this command should list NodeManager, ResourceManager on master node, and NodeManager, on all slave nodes.  If you don’t see the expected output, review the log files listed in Troubleshooting section.

Review Yarn Web console

If all the services started successfully on all nodes, then you should see all of your nodes listed under Yarn nodes.  You can hit the following url on your browser and verify that:

http://master:8088/cluster/nodes

Lets’s execute a MapReduce example now

You should be all set to run a MapReduce example now. Run the following command

$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 30 100

Once the job is submitted you can validate that its running on the cluster by accessing following url.

http://master:8088/cluster/apps

Troubleshooting
Hadoop uses $HADOOP_HOME/logs directory. In case you get into any issues with your installation, that should be the first point to look at. In case, you need help with anything else, do leave me a comment.

Feedback and Questions?

if you have any feedback, or questions do leave a comment

Related Articles

Installing Hadoop on Ubuntu 14.04 ( Single Node Installation)

Hadoop Java HotSpot execstack warning

References

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html

 

Advertisements

124 thoughts on “Install a Multi Node Hadoop Cluster on Ubuntu 14.04

  1. $ sudo adduser —ingroup hadoopgroup hadoopuser
    adduser: Only one or two names allowed.
    sir i’m getting this first
    and this error
    cat /home/hadoopuser/.ssh/id_rsa.pub >> /home/hadoopuser/.ssh/authorized_keys
    bash: /home/hadoopuser/.ssh/authorized_keys: No such file or directory
    pls help

    Like

  2. Starting namenodes on [hadoopmaster]
    hadoopmaster: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    hadoopmaster: @ WARNING: POSSIBLE DNS SPOOFING DETECTED! @
    hadoopmaster: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    hadoopmaster: The ECDSA host key for hadoopmaster has changed,
    hadoopmaster: and the key for the corresponding IP address 192.168.30.150
    hadoopmaster: is unknown. This could either mean that
    hadoopmaster: DNS SPOOFING is happening or the IP address for the host
    hadoopmaster: and its host key have changed at the same time.
    hadoopmaster: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    hadoopmaster: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
    hadoopmaster: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    hadoopmaster: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
    hadoopmaster: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
    hadoopmaster: It is also possible that a host key has just been changed.
    hadoopmaster: The fingerprint for the ECDSA key sent by the remote host is
    hadoopmaster: e6:f7:96:6b:c9:13:45:5a:f3:06:fd:b2:f4:c6:94:a0.
    hadoopmaster: Please contact your system administrator.

    Even getting port 22:connection refused error

    In master node Namenode and nodemanager services are not running.Resourcemanager is running in master node and in slave node datanode is running successfully
    Please help ASAP..your help will be appreciated

    Like

  3. Hi, thanks for this tutorial.

    when I try to submit a new mapreduce task, It fails and I get these messages :

    Number of Maps = 16
    Samples per Map = 10000000
    16/03/10 11:41:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    Wrote input for Map #0
    Wrote input for Map #1
    Wrote input for Map #2
    Wrote input for Map #3
    Wrote input for Map #4
    Wrote input for Map #5
    Wrote input for Map #6
    Wrote input for Map #7
    Wrote input for Map #8
    Wrote input for Map #9
    Wrote input for Map #10
    Wrote input for Map #11
    Wrote input for Map #12
    Wrote input for Map #13
    Wrote input for Map #14
    Wrote input for Map #15
    Starting Job
    16/03/10 11:41:27 INFO client.RMProxy: Connecting to ResourceManager at diagnostix/192.168.68.31:8032
    16/03/10 11:41:27 INFO input.FileInputFormat: Total input paths to process : 16
    16/03/10 11:41:27 INFO mapreduce.JobSubmitter: number of splits:16
    16/03/10 11:41:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1457628055978_0001
    16/03/10 11:41:28 INFO impl.YarnClientImpl: Submitted application application_1457628055978_0001
    16/03/10 11:41:28 INFO mapreduce.Job: The url to track the job: http://diagnostix:8088/proxy/application_1457628055978_0001/
    16/03/10 11:41:28 INFO mapreduce.Job: Running job: job_1457628055978_0001
    16/03/10 11:42:00 INFO mapreduce.Job: Job job_1457628055978_0001 running in uber mode : false
    16/03/10 11:42:00 INFO mapreduce.Job: map 0% reduce 0%
    16/03/10 11:42:05 INFO mapreduce.Job: map 13% reduce 0%
    16/03/10 11:42:05 INFO mapreduce.Job: Task Id : attempt_1457628055978_0001_m_000010_0, Status : FAILED
    null

    16/03/10 11:42:05 INFO mapreduce.Job: Task Id : attempt_1457628055978_0001_m_000009_0, Status : FAILED
    null

    16/03/10 11:42:06 INFO mapreduce.Job: map 0% reduce 0%
    16/03/10 11:42:07 INFO mapreduce.Job: Task Id : attempt_1457628055978_0001_m_000010_1, Status : FAILED
    null

    16/03/10 11:42:09 INFO mapreduce.Job: Task Id : attempt_1457628055978_0001_m_000009_1, Status : FAILED
    null

    16/03/10 11:42:09 INFO mapreduce.Job: Task Id : attempt_1457628055978_0001_m_000010_2, Status : FAILED
    null

    Any idea ?

    thanks

    Like

  4. Hi sumit,
    I had set up a two node cluster and I tried to execute the traditional word count program which shows the following error…

    16/03/16 11:21:40 INFO mapreduce.Job: map 0% reduce 0%
    16/03/16 11:21:40 INFO mapreduce.Job: Job job_1458107299826_0002 failed with state FAILED due to: Application application_1458107299826_0002 failed 2 times due to Error launching appattempt_1458107299826_0002_000002. Got exception: java.net.ConnectException: Call From abc-OptiPlex-3020/127.0.1.1 to abc-OptiPlex-3020:52890 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl….:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorI…:45)

    my /etc/hosts file is as follows:

    127.0.0.1 localhost
    127.0.1.1 abc-OptiPlex-3020

    # The following lines are desirable for IPv6 capable hosts
    ::1 ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    192.168.12.106 HadoopMaster
    192.168.12.105 HadoopSlave1

    What can be the root cause for this? Any suggestions to rectify this will be much appreciated..

    Like

    • Hi John

      Did you check on what port are the hadoop processes listening? Doing following command should help

      netstat -anp tcp | grep 52890

      Did you check output of jps command? Another option i would try is to put the actual IP in /etc/hosts for abc-OptiPlex-3020 instead of 127.0.1.1.

      Like

  5. when I run map reduce wordcount program, it stops at the point
    INFO mapreduce.job: The url to track the job: http://master:8088/proxy/application_1459310483348_001/
    INFO mapreduce.job: Running job_1459310483348_001

    no error at all. But when I checked the url “master:8088/cluster/nodes” ,yarnapplicationstate is “ACCEPTED:waiting for AM container to be allocated, launched and register with RM”. It shows “memory total” is 0 bytes and active nodes as 0 and unhealthy nodes as 1. jps command lists “namenode, datanode, resourcemanager, nodemanager and historyserver” on master and on slave “datanode and nodemanager”

    Please reply as soon as possible

    Like

  6. /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

    /usr/bin/ssh-copy-id: ERROR: ssh: connect to host hadoopmaster port 22: No route to host

    what to do know

    Like

  7. hi
    thanks for the wonderful post….
    the cluster is up and running….
    but i do find that the cluster is too slow….
    when i tried to run the sample
    $ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 30 100
    the cluster takes about 15 min to give result….
    is this timing normal…….
    note
    i am running the master node on a 32 bit machine

    Like

  8. Hi,
    I have 3 node cluster (1 master, 2 slaves). I have successfully executed a word count program but it is not showing any job information in resource manager (locahost:8088). Please can you help me. Thank you.

    Like

  9. I tried to setup multinode cluster. I have executed all the steps successfully. But when i try to start-dfs.sh command, i’m experiencing connection to master and slave nodes are timed out with port no: 22. Can you help me how to address this?

    Like

  10. Everything worked for me and i am able to see the DataNode. But on the log i am getting
    “2016-11-28 17:40:43,849 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: hadoop-master:54310”

    I have the infrastructure on Digitalocean

    Like

  11. this is a bullshit process
    User created was hadoopuser: sudo adduser —ingroup hadoopgroup hadoopuser
    and then in HOME variable it hduser
    export HADOOP_HOME=/home/hduser/hadoop

    in the hdfs-site.xml config:
    hadoop-data/hadoopuser/hdfs/datanode
    Wahts is hadoop-data???

    completely copy pasted code…

    WARNING: don’t use this crap process

    Like

  12. i am new in hadoop. i want to know that for multi node in hadoop , multi machines will be used? or will it be install on single machine with multi nodes?

    Like

  13. hi !!
    commande:
    nn@nn-VirtualBox:~/hadoop-2.6.3/share/hadoop/mapreduce$ ./hadoop-mapreduce-examples-2.6.3.jar wordcount bigtext.txt output
    errore:
    bash: ./hadoop-mapreduce-examples-2.6.3.jar: Permission non accordée !!!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s