This article is about multi-node installation of Hadoop cluster. You would need minimum of 2 ubuntu machines or virtual images to complete a multi-node installation. If you want to just try out a single node cluster, follow this article on Installing Hadoop on Ubuntu 14.04.
I used Hadoop Stable version 2.6.0 for this article. I did this setup on a 3 node cluster. For simplicity, i will designate one node as master, and 2 nodes as slaves (slave-1, and slave-2). Make sure all slave nodes are reachable from master node. To avoid any unreachable hosts error, make sure you add the slave hostnames and ip addresses in /etc/hosts file. Similarly, slave nodes should be able to resolve master hostname.
Installing Java on Master and Slaves
$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java7-installer # Updata Java runtime $ sudo update-java-alternatives -s java-7-oracle
As of now Hadoop does not support IPv6, and is tested to work only on IPv4 networks. If you are using IPv6, you need to switch Hadoop host machines to use IPv4. The Hadoop Wiki link provides a one liner command to disable the IPv6. If you are not using IPv6, skip this step:
sudo sed -i 's/net.ipv6.bindv6only\ =\ 1/net.ipv6.bindv6only\ =\ 0/' \ /etc/sysctl.d/bindv6only.conf && sudo invoke-rc.d procps restart
Setting up a Hadoop User
Hadoop talks to other nodes in the cluster using no-password ssh. By having Hadoop run under a specific user context, it will be easy to distribute the ssh keys around in the Hadoop cluster. Lets’s create a user hadoopuser on master as well as slave nodes.
# Create hadoopgroup $ sudo addgroup hadoopgroup # Create hadoopuser user $ sudo adduser —ingroup hadoopgroup hadoopuser
Our next step will be to generate a ssh key for password-less login between master and slave nodes. Run the following commands only on master node. Run the last two commands for each slave node. Password less ssh should be working before you can proceed with further steps.
# Login as hadoopuser $ su - hadoopuser #Generate a ssh key for the user $ ssh-keygen -t rsa -P "" #Authorize the key to enable password less ssh $ cat /home/hadoopuser/.ssh/id_rsa.pub >> /home/hadoopuser/.ssh/authorized_keys $ chmod 600 authorized_keys #Copy this key to slave-1 to enable password less ssh $ ssh-copy-id -i ~/.ssh/id_rsa.pub slave-1 #Make sure you can do a password less ssh using following command. $ ssh slave-1
Download and Install Hadoop binaries on Master and Slave nodes
Pick the best mirror site to download the binaries from Apache Hadoop, and download the stable/hadoop-2.6.0.tar.gz for your installation. Do this step on master and every slave node. You can download the file once and the distribute to each slave node using scp command.
$ cd /home/hadoopuser $ wget http://www.webhostingjams.com/mirror/apache/hadoop/core/stable/hadoop-2.2.0.tar.gz $ tar xvf hadoop-2.2.0.tar.gz $ mv hadoop-2.2.0 hadoop
Setup Hadoop Environment on Master and Slave Nodes
Copy and paste following lines into your .bashrc file under /home/hadoopuser. Do this step on master and every slave node.
# Set HADOOP_HOME export HADOOP_HOME=/home/hduser/hadoop # Set JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-7-oracle # Add Hadoop bin and sbin directory to PATH export PATH=$PATH:$HADOOP_HOME/bin;$HADOOP_HOME/sbin
Update hadoop-env.sh on Master and Slave Nodes
Update JAVA_HOME in /home/hadoopuser/hadoop/etc/hadoop/hadoop_env.sh to following. Do this step on master and every slave node.
Before we start getting into configuration details, lets discuss some of the basic terminologies used in Hadoop.
- Hadoop Distributed File System: A distributed file system that provides high-throughput access to application data. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. If you compare HDFS to a traditional storage structures ( e.g. FAT, NTFS), then NameNode is analogous to a Directory Node structure, and DataNode is analogous to actual file storage blocks.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Update Configuration Files
Add/update core-site.xml on Master and Slave nodes with following options. Master and slave nodes should all be using the same value for this property fs.defaultFS, and should be pointing to master node only.
Add/update mapred-site.xml on Master node only with following options.
Add/update hdfs-site.xml on Master and Slave Nodes. We will be adding following three entries to the file.
- dfs.replication– Here I am using a replication factor of 2. That means for every file stored in HDFS, there will be one redundant replication of that file on some other node in the cluster.
- dfs.namenode.name.dir – This directory is used by Namenode to store its metadata file. Here i manually created this directory /hadoop-data/hadoopuser/hdfs/namenode on master and slave node, and use the directory location for this configuration.
- dfs.datanode.data.dir – This directory is used by Datanode to store hdfs data blocks. Here i manually created this directory /hadoop-data/hadoopuser/hdfs/datanode on master and slave node, and use the directory location for this configuration.
Add yarn-site.xml on Master and Slave Nodes. This file is required for a Node to work as a Yarn Node. Master and slave nodes should all be using the same value for the following properties, and should be pointing to master node only.
Add/update slaves file on Master node only. Add just name, or ip addresses of master and all slave node. If file has an entry for localhost, you can remove that. This file is just helper file that are used by hadoop scripts to start appropriate services on master and slave nodes.
master slave-1 slave-2
Format the Namenode
Before starting the cluster, we need to format the Namenode. Use the following command only on master node:
$ hdfs namenode -format
Start the Distributed Format System
Run the following on master node command to start the DFS.
You should observe the output to ascertain that it tries to start datanode on slave nodes one by one. To validate the success, run following command on master nodes, and slave node.
$ su - hadoopuser $ jps
The output of this command should list NameNode, SecondaryNameNode, DataNode on master node, and DataNode on all slave nodes. If you don’t see the expected output, review the log files listed in Troubleshooting section.
Start the Yarn MapReduce Job tracker
Run the following command to start the Yarn mapreduce framework.
To validate the success, run jps command again on master nodes, and slave node.The output of this command should list NodeManager, ResourceManager on master node, and NodeManager, on all slave nodes. If you don’t see the expected output, review the log files listed in Troubleshooting section.
Review Yarn Web console
If all the services started successfully on all nodes, then you should see all of your nodes listed under Yarn nodes. You can hit the following url on your browser and verify that:
Lets’s execute a MapReduce example now
You should be all set to run a MapReduce example now. Run the following command
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 30 100
Once the job is submitted you can validate that its running on the cluster by accessing following url.
Hadoop uses $HADOOP_HOME/logs directory. In case you get into any issues with your installation, that should be the first point to look at. In case, you need help with anything else, do leave me a comment.
Feedback and Questions?
if you have any feedback, or questions do leave a comment