In this article, I wanted to document my first hand experience of installing Hadoop on Ubuntu 14.04. I am using the Hadoop Stable version 2.2.0 for this article. This article covers a single node installation of Hadoop. If you want to do a multi-node installation, follow my other article here – Install a Multi Node Hadoop Cluster on Ubuntu 14.04
$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java7-installer # Updata Java runtime $ sudo update-java-alternatives -s java-7-oracle
As of now Hadoop does not support IPv6, and is tested to work only on IPv4 networks. If you are using IPv6, you need to switch Hadoop host machines to use IPv4. The Hadoop Wiki link provides a one liner command to disable the IPv6. If you are not using IPv6, skip this step:
sudo sed -i 's/net.ipv6.bindv6only\ =\ 1/net.ipv6.bindv6only\ =\ 0/' \ /etc/sysctl.d/bindv6only.conf && sudo invoke-rc.d procps restart
Setting up a Hadoop User
Hadoop talks to other nodes in the cluster using no-password ssh. By having Hadoop run under a specific user context, it will be easy to distribute the ssh keys around in the hadoop cluster
# Create hadoopgroup $ sudo addgroup hadoopgroup # Create hadoopuser user $ sudo adduser —ingroup hadoopgroup hadoopuser # Login as hadoopuser $ su - hadoopuser #Generate a ssh key for the user $ ssh-keygen -t rsa -P "" #Authorize the key to enable password less ssh $ cat /home/hadoopuser/.ssh/id_rsa.pub >> /home/hadoopuser/.ssh/authorized_keys $ chmod 600 authorized_keys
Download and Install Hadoop
Pick the best mirror site to download the binaries from Apache Hadoop, and download the stable/hadoop-2.2.0.tar.gz for your installation.
$ cd /home/hadoopuser $ wget http://www.webhostingjams.com/mirror/apache/hadoop/core/stable/hadoop-2.2.0.tar.gz $ tar xvf hadoop-2.2.0.tar.gz $ mv hadoop-2.2.0 hadoop
Setup Hadoop Environment
Copy and paste following lines into your .bashrc file under /home/hadoopuser.
# Set HADOOP_HOME export HADOOP_HOME=/home/hduser/hadoop # Set JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-7-oracle # Add Hadoop bin and sbin directory to PATH export PATH=$PATH:$HADOOP_HOME/bin;$HADOOP_HOME/sbin
Update JAVA_HOME in /home/hadoopuser/hadoop/etc/hadoop/hadoop_env.sh to following
Before we start getting into configuration details, lets discuss some of the basic terminologies used in Hadoop.
- Hadoop Distributed File System: A distributed file system that provides high-throughput access to application data. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. If you compare HDFS to a traditional storage structures ( e.g. FAT, NTFS), then NameNode is analogous to a Directory Node structure, and DataNode is analogous to actual file storage blocks.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Update Configuration Files
Hadoop Wiki provides with set of configurations that are needed to start a single node cluster. The documentation is outdated, and file structure has changed since that document was written. Add following setting to respective files under <configuration> section to do the settings in new file scheme. Make sure to replace machine-name with the name of your machine.
Format the Namenode
Before starting the cluster, we need to format the Namenode. Use the following command:
$ hdfs namenode -format
Start the Distributed Format System
Run the following command to start the DFS.
After this command is successfully run, you can run command jps, and see that you have NameNode, SecondaryNameNode, DataNode running now.
Start the Yarn MapReduce Job tracker
Run the following command to start the DFS.
After this command is successfully run, you can run command jps, and see that you have NodeManager, ResourceManager running now.
Lets’s execute a MapReduce example now
You should be all set to run a MapReduce example now. Run the following command
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 3 10
Feedback and Questions?
if you have any feedback, or questions do leave a comment
Hadoop uses $HADOOP_HOME/logs directory. In case you get into any issues with your installation, that should be the first point to look at. In case, you need help with anything else, do leave me a comment.