Hadoop – GC overhead limit exceeded error

In our Hadoop setup, we ended up having more than 1 million files in a single folder.  The folder had so many files, that any hdfs dfs command like -ls, -copyToLocal on the files was giving following error:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:2367)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
        at java.lang.StringBuffer.append(StringBuffer.java:237)
        at java.net.URI.appendAuthority(URI.java:1852)
        at java.net.URI.appendSchemeSpecificPart(URI.java:1890)
        at java.net.URI.toString(URI.java:1922)
        at java.net.URI.<init>(URI.java:749)
        at org.apache.hadoop.fs.Path.initialize(Path.java:203)
        at org.apache.hadoop.fs.Path.<init>(Path.java:116)
        at org.apache.hadoop.fs.Path.<init>(Path.java:94)
        at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:230)
        at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.makeQualified(HdfsFileStatus.java:263)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:732)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
        at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
        at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
        at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268)
        at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347)
        at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:291)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:308)
        at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278)
        at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:243)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)
        at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:220)
        at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)

After doing some research, we added following environment variable to update Hadoop runtime options.

export HADOOP_OPTS="-XX:-UseGCOverheadLimit"

Adding this option fixed the GC error, but started throwing the following error, citing the lack of Java Heap space.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1351)
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1413)
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1524)
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1533)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:557)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy15.getListing(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:724)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
        at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
        at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
        at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268)
        at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347)
        at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:291)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:308)
        at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278)
        at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:243)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)
        at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:220)
        at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

We modified the above export, and tried following instead.  Note that instead of  HADOOP_OPTS,  we needed to set HADOOP_CLIENT_OPTS fix this error. This was needed because all the hadoop commands run as a client.  HADOOP_OPTS needs to be setup for modifying actual Hadoop run time, and HADOOP_CLIENT_OPTS is needed to be setup for modifying run time for Hadoop command line client.

export HADOOP_CLIENT_OPTS="-XX:-UseGCOverheadLimit -Xmx4096m"

 

How to fix NameNode – SafeModeException

When you try to do any HDFS operation, you get following exception:

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): 
Cannot create directory /user/hadoopuser/dir/in. 
Name node is in safe mode.

What is Safe Mode in Hadoop?

Safe Mode in Hadoop is a maintenance state of NameNode.  During Safe Mode, HDFS cluster is read-only, and does not allow any changes. It doesn’t replicate or delete blocks.

When the Namenode Starts, it automatically enters the Safe mode, and it performs following initialization tasks:

  • Loads the file system namespace from the last known saved fsimage,
  • Loads the edit log file.
  • Applies edits log file changes on fsimage,  and created in new file system namespace.
  • Receives block reports containing information about block locations from all Datanodes

To leave Safe Mode, NameNode should collect reports for at least a specified threshold percentage of blocks and these should satisfy minimum replication condition.Even though this threshold may be reached fast, safe mode will extend to the configurable amount of time . This is make sure that remaining DataNodes check in before it starts replicating missing blocks or deleting over replicated blocks. After completion of block replication maintenance activity, the name node leaves safe mode automatically.

 

You can check if your Hadoop cluster by running following command:

hdfs dfsadmin -safemode get

If you just restarted your cluster, you should give it ample time to recover from Safemode.  This time can vary based on size of your cluster.  If its stuck in dhat state, then that can be fixed by using following command:

hdfs dfsadmin -safemode leave

Install a Multi Node Hadoop Cluster on Ubuntu 14.04

This article is about multi-node installation of Hadoop cluster.  You would need minimum of 2 ubuntu machines or virtual images to complete a multi-node installation.  If you want to just try out a single node cluster, follow this article on Installing Hadoop on Ubuntu 14.04.

I used Hadoop Stable version 2.6.0 for this article. I did this setup on a 3 node cluster.  For simplicity, i will designate one node as master, and 2 nodes as slaves (slave-1, and slave-2). Make sure all slave nodes are reachable from master node.  To avoid any unreachable hosts error, make sure you add the slave hostnames and ip addresses in /etc/hosts file. Similarly, slave nodes should be able to resolve master hostname.

Installing Java on Master and Slaves

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
# Updata Java runtime
$ sudo update-java-alternatives -s java-7-oracle

Disable IPv6

As of now Hadoop does not support IPv6, and is tested to work only on IPv4 networks.   If you are using IPv6, you need to switch Hadoop host machines to use IPv4.  The Hadoop Wiki link provides a one liner command to disable the IPv6.  If you are not using IPv6, skip this step:

sudo sed -i 's/net.ipv6.bindv6only\ =\ 1/net.ipv6.bindv6only\ =\ 0/' \
/etc/sysctl.d/bindv6only.conf && sudo invoke-rc.d procps restart

Setting up a Hadoop User

Hadoop talks to other nodes in the cluster using no-password ssh.   By having Hadoop run under a specific user context, it will be easy to distribute the ssh keys around in the Hadoop cluster.  Lets’s create a user hadoopuser on master as well as slave nodes.

# Create hadoopgroup
$ sudo addgroup hadoopgroup
# Create hadoopuser user
$ sudo adduser —ingroup hadoopgroup hadoopuser

Our next step will be to generate a ssh key for password-less login between master and slave nodes.  Run the following commands only on master node.  Run the last two commands for each slave node.  Password less ssh should be working before you can proceed with further steps.

# Login as hadoopuser
$ su - hadoopuser
#Generate a ssh key for the user
$ ssh-keygen -t rsa -P ""
#Authorize the key to enable password less ssh 
$ cat /home/hadoopuser/.ssh/id_rsa.pub >> /home/hadoopuser/.ssh/authorized_keys
$ chmod 600 authorized_keys
#Copy this key to slave-1 to enable password less ssh 
$ ssh-copy-id -i ~/.ssh/id_rsa.pub slave-1
#Make sure you can do a password less ssh using following command.
$ ssh slave-1

Download and Install Hadoop binaries on Master and Slave nodes

Pick the best mirror site to download the binaries from Apache Hadoop, and download the stable/hadoop-2.6.0.tar.gz for your installation.  Do this step on master and every slave node.  You can download the file once and the distribute to each slave node using scp command.

$ cd /home/hadoopuser
$ wget http://www.webhostingjams.com/mirror/apache/hadoop/core/stable/hadoop-2.2.0.tar.gz
$ tar xvf hadoop-2.2.0.tar.gz
$ mv hadoop-2.2.0 hadoop

Setup Hadoop Environment on Master and Slave Nodes

Copy and paste following lines into your .bashrc file under /home/hadoopuser. Do this step on master and every slave node.

# Set HADOOP_HOME
export HADOOP_HOME=/home/hduser/hadoop
# Set JAVA_HOME 
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Add Hadoop bin and sbin directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin;$HADOOP_HOME/sbin

Update hadoop-env.sh on Master and Slave Nodes

Update JAVA_HOME in /home/hadoopuser/hadoop/etc/hadoop/hadoop_env.sh to following. Do this step on master and every slave node.

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Common Terminologies
Before we start getting into configuration details, lets discuss some of the basic terminologies used in Hadoop.

  • Hadoop Distributed File System: A distributed file system that provides high-throughput access to application data. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. If you compare HDFS to a traditional storage structures ( e.g. FAT, NTFS), then NameNode is analogous to a Directory Node structure, and DataNode is analogous to actual file storage blocks.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Update Configuration Files
Add/update core-site.xml on Master and Slave nodes with following options.  Master and slave nodes should all be using the same value for this property fs.defaultFS,  and should be pointing to master node only.

  /home/hadoopuser/hadoop/etc/hadoop/core-site.xml (Other Options)
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoopuser/tmp</value>
  <description>Temporary Directory.</description>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://master:54310</value>
  <description>Use HDFS as file storage engine</description>
</property>

 

Add/update mapred-site.xml on Master node only with following options.

  /home/hadoopuser/hadoop/etc/hadoop/mapred-site.xml (Other Options)
<property>
 <name>mapreduce.jobtracker.address</name>
 <value>master:54311</value>
 <description>The host and port that the MapReduce job tracker runs
  at. If “local”, then jobs are run in-process as a single map
  and reduce task.
</description>
</property>
<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 <description>The framework for running mapreduce jobs</description>
</property>

 

Add/update hdfs-site.xml on Master and Slave Nodes. We will be adding following three entries to the file.

  • dfs.replication– Here I am using a replication factor of 2. That means for every file stored in HDFS, there will be one redundant replication of that file on some other node in the cluster.
  • dfs.namenode.name.dir – This directory is used by Namenode to store its metadata file.  Here i manually created this directory /hadoop-data/hadoopuser/hdfs/namenode on master and slave node, and use the directory location for this configuration.
  • dfs.datanode.data.dir – This directory is used by Datanode to store hdfs data blocks.  Here i manually created this directory /hadoop-data/hadoopuser/hdfs/datanode on master and slave node, and use the directory location for this configuration.
  /home/hadoopuser/hadoop/etc/hadoop/hdfs-site.xml (Other Options)
<property>
 <name>dfs.replication</name>
 <value>2</value>
 <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
 </description>
</property>
<property>
 <name>dfs.namenode.name.dir</name>
 <value>/hadoop-data/hadoopuser/hdfs/namenode</value>
 <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
 </description>
</property>
<property>
 <name>dfs.datanode.data.dir</name>
 <value>/hadoop-data/hadoopuser/hdfs/datanode</value>
 <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
 </description>
</property>

 

Add yarn-site.xml on Master and Slave Nodes.  This file is required for a Node to work as a Yarn Node.  Master and slave nodes should all be using the same value for the following properties,  and should be pointing to master node only.

  /home/hadoopuser/hadoop/etc/hadoop/yarn-site.xml
<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>master:8030</value>
</property> 
<property>
 <name>yarn.resourcemanager.address</name>
 <value>master:8032</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>master:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>master:8031</value>
</property>
<property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>master:8033</value>
</property>

 

Add/update slaves file on Master node only.  Add just name, or ip addresses of master and all slave node.  If file has an entry for localhost, you can remove that.  This file is just helper file that are used by hadoop scripts to start appropriate services on master and slave nodes.

  /home/hadoopuser/hadoop/etc/hadoop/slave
master
slave-1
slave-2

Format the Namenode
Before starting the cluster, we need to format the Namenode. Use the following command only on master node:

$ hdfs namenode -format

Start the Distributed Format System 

Run the following on master node command to start the DFS.

$ ./home/hadoopuser/hadoop/sbin/start-dfs.sh

You should observe the output to ascertain that it tries to start datanode on slave nodes one by one.   To validate the success, run following command on master nodes, and slave node.

$ su - hadoopuser
$ jps

The output of this command should list NameNode, SecondaryNameNode, DataNode on master node, and DataNode on all slave nodes.  If you don’t see the expected output, review the log files listed in Troubleshooting section.

Start the Yarn MapReduce Job tracker

Run the following command to start the Yarn mapreduce framework.

$ ./home/hadoopuser/hadoop/sbin/start-yarn.sh

To validate the success, run jps command again on master nodes, and slave node.The output of this command should list NodeManager, ResourceManager on master node, and NodeManager, on all slave nodes.  If you don’t see the expected output, review the log files listed in Troubleshooting section.

Review Yarn Web console

If all the services started successfully on all nodes, then you should see all of your nodes listed under Yarn nodes.  You can hit the following url on your browser and verify that:

http://master:8088/cluster/nodes

Lets’s execute a MapReduce example now

You should be all set to run a MapReduce example now. Run the following command

$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 30 100

Once the job is submitted you can validate that its running on the cluster by accessing following url.

http://master:8088/cluster/apps

Troubleshooting
Hadoop uses $HADOOP_HOME/logs directory. In case you get into any issues with your installation, that should be the first point to look at. In case, you need help with anything else, do leave me a comment.

Feedback and Questions?

if you have any feedback, or questions do leave a comment

Related Articles

Installing Hadoop on Ubuntu 14.04 ( Single Node Installation)

Hadoop Java HotSpot execstack warning

References

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html

 

Hadoop Java HotSpot execstack warning

Hadoop binaries are built for 32 bit runtime.   If you run the hadoop binaries on a 64 bit machine, you might notice this warning on running any hadoop command.

For help on how to install Hadoop Cluster, checkout my article on hadoop.

[Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/hadoopuser/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It’s highly recommended that you fix the library with ‘execstack -c <libfile>’, or link it with ‘-z noexecstack’.
0.0.0.0]

The warning could further manifest into following messages:

[Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/hadoopuser/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
0.0.0.0]
sed: -e expression #1, char 6: unknown option to `s'
64-Bit: ssh: Could not resolve hostname 64-bit: Name or service not known
warning:: ssh: Could not resolve hostname warning:: Name or service not known
HotSpot(TM): ssh: Could not resolve hostname hotspot(tm): Name or service not known
VM: ssh: Could not resolve hostname vm: Name or service not known
You: ssh: Could not resolve hostname you: Name or service not known
Java: ssh: Could not resolve hostname java: Name or service not known
have: ssh: Could not resolve hostname have: Name or service not known
Server: ssh: Could not resolve hostname server: Name or service not known
loaded: ssh: Could not resolve hostname loaded: Name or service not known
which: ssh: Could not resolve hostname which: Name or service not known
library: ssh: Could not resolve hostname library: Name or service not known
might: ssh: Could not resolve hostname might: Name or service not known
disabled: ssh: Could not resolve hostname disabled: Name or service not known
have: ssh: Could not resolve hostname have: Name or service not known
stack: ssh: Could not resolve hostname stack: Name or service not known
guard.: ssh: Could not resolve hostname guard.: Name or service not known
will: ssh: Could not resolve hostname will: Name or service not known
The: ssh: Could not resolve hostname the: Name or service not known
VM: ssh: Could not resolve hostname vm: Name or service not known
try: ssh: Could not resolve hostname try: Name or service not known
to: ssh: Could not resolve hostname to: Name or service not known
-c: Unknown cipher type 'cd'
fix: ssh: Could not resolve hostname fix: Name or service not known
stack: ssh: Could not resolve hostname stack: Name or service not known
guard: ssh: Could not resolve hostname guard: Name or service not known
the: ssh: Could not resolve hostname the: Name or service not known
now.: ssh: Could not resolve hostname now.: Name or service not known
It's: ssh: Could not resolve hostname it's: Name or service not known
recommended: ssh: Could not resolve hostname recommended: Name or service not known
highly: ssh: Could not resolve hostname highly: Name or service not known
that: ssh: Could not resolve hostname that: Name or service not known
you: ssh: Could not resolve hostname you: Name or service not known
the: ssh: Could not resolve hostname the: Name or service not known
fix: ssh: Could not resolve hostname fix: Name or service not known
with: ssh: Could not resolve hostname with: Name or service not known
library: ssh: Could not resolve hostname library: Name or service not known
'execstack: ssh: Could not resolve hostname 'execstack: Name or service not known
',: ssh: Could not resolve hostname ',: Name or service not known
or: ssh: Could not resolve hostname or: Name or service not known
it: ssh: Could not resolve hostname it: Name or service not known
link: ssh: Could not resolve hostname link: Name or service not known
with: ssh: Could not resolve hostname with: Name or service not known
'-z: ssh: Could not resolve hostname '-z: Name or service not known
noexecstack'.: ssh: Could not resolve hostname noexecstack'.: Name or service not known

To fix this annoying warning, update your hadoop-env.sh file under etc/hadoop, and replace following line

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

with the following line, by adding  -XX:-PrintWarnings

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -XX:-PrintWarnings -Djava.net.preferIPv4Stack=true"

Related Articles

Installing Hadoop on Ubuntu 14.04

Install a Multi Node Hadoop Cluster on Ubuntu 14.04

Installing Hadoop on Ubuntu 14.04

In this article, I wanted to document my first hand experience of installing Hadoop on Ubuntu 14.04.   I am using the Hadoop Stable version 2.2.0 for this article.  This article covers a single node installation of Hadoop.  If you want to do a multi-node installation, follow my other article here – Install a Multi Node Hadoop Cluster on Ubuntu 14.04

Installing Java 

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
# Updata Java runtime
$ sudo update-java-alternatives -s java-7-oracle

Disable IPv6

As of now Hadoop does not support IPv6, and is tested to work only on IPv4 networks.   If you are using IPv6, you need to switch Hadoop host machines to use IPv4.  The Hadoop Wiki link provides a one liner command to disable the IPv6.  If you are not using IPv6, skip this step:

sudo sed -i 's/net.ipv6.bindv6only\ =\ 1/net.ipv6.bindv6only\ =\ 0/' \
/etc/sysctl.d/bindv6only.conf && sudo invoke-rc.d procps restart

Setting up a Hadoop User

Hadoop talks to other nodes in the cluster using no-password ssh.   By having Hadoop run under a specific user context, it will be easy to distribute the ssh keys around in the hadoop cluster

# Create hadoopgroup
$ sudo addgroup hadoopgroup
# Create hadoopuser user
$ sudo adduser —ingroup hadoopgroup hadoopuser
# Login as hadoopuser
$ su - hadoopuser
#Generate a ssh key for the user
$ ssh-keygen -t rsa -P ""
#Authorize the key to enable password less ssh 
$ cat /home/hadoopuser/.ssh/id_rsa.pub >> /home/hadoopuser/.ssh/authorized_keys
$ chmod 600 authorized_keys

Download and Install Hadoop

Pick the best mirror site to download the binaries from Apache Hadoop, and download the stable/hadoop-2.2.0.tar.gz for your installation.

$ cd /home/hadoopuser
$ wget http://www.webhostingjams.com/mirror/apache/hadoop/core/stable/hadoop-2.2.0.tar.gz
$ tar xvf hadoop-2.2.0.tar.gz
$ mv hadoop-2.2.0 hadoop

Setup Hadoop Environment

Copy and paste following lines into your .bashrc file under /home/hadoopuser.

# Set HADOOP_HOME
export HADOOP_HOME=/home/hduser/hadoop
# Set JAVA_HOME 
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Add Hadoop bin and sbin directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin;$HADOOP_HOME/sbin

Update hadoop-env.sh

Update JAVA_HOME in /home/hadoopuser/hadoop/etc/hadoop/hadoop_env.sh to following

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Common Terminologies
Before we start getting into configuration details, lets discuss some of the basic terminologies used in Hadoop.

  • Hadoop Distributed File System: A distributed file system that provides high-throughput access to application data. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. If you compare HDFS to a traditional storage structures ( e.g. FAT, NTFS), then NameNode is analogous to a Directory Node structure, and DataNode is analogous to actual file storage blocks.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Update Configuration Files
Hadoop Wiki provides with set of configurations that are needed to start a single node cluster.   The documentation is outdated, and file structure has changed since that document was written.  Add following setting to respective files under <configuration> section to do the settings in new file scheme. Make sure to replace machine-name with the name of your machine.

  /home/hadoopuser/hadoop/etc/hadoop/core-site.xml (Other Options)
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoopuser/tmp</value>
  <description>Temporary Directory.</description>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://machine-name:54310</value>
  <description>Use HDFS as file storage engine</description>
</property>
  /home/hadoopuser/hadoop/etc/hadoop/mapred-site.xml (Other Options)
<property>
 <name>mapreduce.jobtracker.address</name>
 <value>machine-name:54311</value>
 <description>The host and port that the MapReduce job tracker runs
  at. If “local”, then jobs are run in-process as a single map
  and reduce task.
</description>
</property>
  /home/hadoopuser/hadoop/etc/hadoop/hdfs-site.xml (Other Options)
<property>
 <name>dfs.replication</name>
 <value>1</value>
 <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
 </description>
</property>

Format the Namenode
Before starting the cluster, we need to format the Namenode. Use the following command:

$ hdfs namenode -format

Start the Distributed Format System

Run the following command to start the DFS.

$ ./home/hadoopuser/hadoop/sbin/start-dfs.sh

After this command is successfully run, you can run command jps, and see that you have NameNode, SecondaryNameNode, DataNode running now.

Start the Yarn MapReduce Job tracker

Run the following command to start the DFS.

$ ./home/hadoopuser/hadoop/sbin/start-yarn.sh

After this command is successfully run, you can run command jps, and see that you have NodeManager, ResourceManager running now.

Lets’s execute a MapReduce example now

You should be all set to run a MapReduce example now. Run the following command

$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 3 10

Feedback and Questions?

if you have any feedback, or questions do leave a comment

Troubleshooting
Hadoop uses $HADOOP_HOME/logs directory. In case you get into any issues with your installation, that should be the first point to look at. In case, you need help with anything else, do leave me a comment.

Related Articles

Installing a Multi Node Hadoop Cluster on Ubuntu 14.04

Hadoop Java HotSpot execstack warning

References
http://wiki.apache.org/hadoop/GettingStartedWithHadoop
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/