Hadoop Installation using a single Node cluster



Hadoop is a technology which enables distributed processing of large set of data sets across clusters ranging from 1 server to thousands of server ensuring a high degree of Fault Tolerence.

Hadoop is a framework which consists of following basic modules

1) Hadoop Common – contains libraries and utilities needed by other Hadoop modules
2) Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
3) Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
4) Hadoop MapReduce – a programming model for large scale data processing

In this Tutorial we will install & configure Hadoop on Ubuntu(12.10/13.04/13.10).Follow the below steps :

Step:1 Update your machine

root@hadoop1:~# apt-get update

Install python-software-properties module

root@hadoop1:~# apt-get install python-software-properties

Add the sun – java repository

root@hadoop1:~# add-apt-repository ppa:webupd8team/java 
root@hadoop1:~# apt-get update && sudo apt-get upgrade
root@hadoop1:~# apt-get install oracle-java6-installer

Check the installed java version

root@hadoop1:~# java -version 
java version "1.6.0_45" 
Java(TM) SE Runtime Environment (build 1.6.0_45-b06) 
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

If there are 2 versions of java as seen from the update-java-alternatives command

Then run the following command to set this java to the latest versions

root@hadoop1:~# update-alternatives --config java

There is only one alternative in link group java: /usr/lib/jvm/java-6-oracle/jre/bin/java
Nothing to configure.

Step:2  Adding a group to the system

root@hadoop1:~#  addgroup hadoopgroup

Adding a hadoop user to the earlier created group

root@hadoop1:~# adduser --ingroup hadoopgroup hadoopuser 
Adding user `hadoopuser' ... 
Adding new user `hadoopuser' (1001) with group `hadoopgroup' ... 
Creating home directory `/home/hadoopuser' ... 
Copying files from `/etc/skel' ... 
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully 
Changing the user information for hadoopuser 
Enter the new value, or press ENTER for the default 
Full Name []: HADOOP USER 
Room Number []: 
Work Phone []: 
Home Phone []: 
Other []: 
Is the information correct? [Y/n] Y

 Step:3 Create Passwordless authentication

root@hadoop1:~# su - hadoopuser
hadoopuser@hadoop1:~$ ssh-keygen -t rsa -P "" 
Generating public/private rsa key pair. 
Enter file in which to save the key (/home/hadoopuser/.ssh/id_rsa): 
Created directory '/home/hadoopuser/.ssh'. 
Your identification has been saved in /home/hadoopuser/.ssh/id_rsa. 
Your public key has been saved in /home/hadoopuser/.ssh/id_rsa.pub. 
The key fingerprint is: 
82:a0:cb:f4:fa:1f:ac:f5:29:54:34:e7:56:ee:b0:9f hadoopuser@hadoop1.example.com 
The key's randomart image is: 
+--[ RSA 2048]----+ 
|                 | 
|       o . .     | 
|  .   . + o      | 
| . . . . + .     | 
|..  . o S +      | 
|o.. .. . . .     | 
|.. ..+    . .    | 
|  . o.o .  E     | 
| ..o...o         | 
hadoopuser@hadoop1:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 
hadoopuser@hadoop1:~$ ssh localhost 
we will get below message after login
The authenticity of host 'localhost (' can't be established. 
ECDSA key fingerprint is ee:be:18:ef:e6:3d:e3:8d:8a:17:ca:d1:a3:d6:d6:49. 
Are you sure you want to continue connecting (yes/no)? yes 
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. 
Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-23-generic x86_64)

Step:4 Disable ipv6

As a root Append the file /etc/sysctl.conf and add the following lines

root@hadoop1:~# vi /etc/sysctl.conf
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1 
Save & exit
root@hadoop1:~# sysctl -p 
net.ipv6.conf.all.disable_ipv6 = 1 
net.ipv6.conf.default.disable_ipv6 = 1 
net.ipv6.conf.lo.disable_ipv6 = 1

To verify the ipv6 has been disabled

root@hadoop1:~# cat /proc/sys/net/ipv6/conf/all/disable_ipv6 

Step:5 Adding the hadoop repository

root@hadoop1:~# add-apt-repository ppa:hadoop-ubuntu/stable

Update and upgrade

root@hadoop1:~# apt-get update && apt-get upgrade

Step:6 Now Install Hadoop

root@hadoop1:~# apt-get install hadoop

Verify the hadoopuser details

root@hadoop1:~# id hadoopuser 
uid=1001(hadoopuser) gid=1002(hadoopgroup) groups=1002(hadoopgroup)

Add the hadoop user to the sudo file so that it will have the root level permissions

root@hadoop1:~# visudo
and add the following line :
hadoopuser ALL=(ALL:ALL) ALL

Set the environment in the hadoop user .bashrc file as follows

root@hadoop1:~# vi /home/hadoopuser/.bashrc 
and add the following line
# Set Hadoop-related environment variables   
export HADOOP_HOME=/home/hadoopuser/hadoop  
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)`
export JAVA_HOME=/usr/lib/jvm/java-6-oracle/ 
# Add Hadoop bin/ directory to PATH  
export PATH=$PATH:$HADOOP_HOME/bin  
export PATH=$PATH:/usr/lib/hadoop/bin/

#Set some aliased

unalias fs &> /dev/null   
alias fs="hadoop fs"    
unalias hls &> /dev/null  
alias hls="fs -ls"

Step:7 Now Configure the Hadoop

root@hadoop1:~# chown -R hadoopuser:hadoopgroup /var/log/hadoop/ 
root@hadoop1:~# chmod -R 755 /var/log/hadoop/ 
root@hadoop1:~# cd /usr/lib/hadoop/conf/ 
root@hadoop1:/usr/lib/hadoop/conf# ls -ltr 
total 76 
-rw-r--r-- 1 root hadoop  382 Mar 24  2012 taskcontroller.cfg 
-rw-r--r-- 1 root hadoop 1195 Mar 24  2012 ssl-server.xml.example 
-rw-r--r-- 1 root hadoop 1243 Mar 24  2012 ssl-client.xml.example 
-rw-r--r-- 1 root hadoop   10 Mar 24  2012 slaves 
-rw-r--r-- 1 root hadoop   10 Mar 24  2012 masters 
-rw-r--r-- 1 root hadoop  178 Mar 24  2012 mapred-site.xml 
-rw-r--r-- 1 root hadoop 2033 Mar 24  2012 mapred-queue-acls.xml 
-rw-r--r-- 1 root hadoop 4441 Mar 24  2012 log4j.properties 
-rw-r--r-- 1 root hadoop  178 Mar 24  2012 hdfs-site.xml 
-rw-r--r-- 1 root hadoop 4644 Mar 24  2012 hadoop-policy.xml 
-rw-r--r-- 1 root hadoop 1488 Mar 24  2012 hadoop-metrics2.properties 
-rw-r--r-- 1 root hadoop 2237 Mar 24  2012 hadoop-env.sh 
-rw-r--r-- 1 root hadoop  327 Mar 24  2012 fair-scheduler.xml 
-rw-r--r-- 1 root hadoop  178 Mar 24  2012 core-site.xml 
-rw-r--r-- 1 root hadoop  535 Mar 24  2012 configuration.xsl 
-rw-r--r-- 1 root hadoop 7457 Mar 24  2012 capacity-scheduler.xml 

But before we start using them, we need to modify several files in the /conf folder.


replace the JAVA_HOME line with the below line.
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Create a base for temporary files
root@hadoop1:/usr/lib/hadoop/conf# mkdir /home/hadoopuser/tmp 
root@hadoop1:/usr/lib/hadoop/conf# chown hadoopuser:hadoopgroup /home/hadoopuser/tmp/ 
root@hadoop1:/usr/lib/hadoop/conf# chmod 755 /home/hadoopuser/tmp/ 

Replace the original contents with

root@hadoop1:/usr/lib/hadoop/conf# cat core-site.xml 
<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<description>A base for other temporary directories.</description> 
<description>The name of the default file system.  A URI whose  scheme and authority determine the FileSystem implementation.  The  uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class.  The uri's authority is used to 
determine the host, port, etc. for a filesystem.</description> 


root@hadoop1:/usr/lib/hadoop/conf# cat mapred-site.xml
<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<description>The host and port that the MapReduce job tracker runs 
at.  If "local", then jobs are run in-process as a single map 
and reduce task. 


root@hadoop1:/usr/lib/hadoop/conf# cat hdfs-site.xml 
<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<description>Default block replication. 
The actual number of replications can be specified when the file is created. 
The default is used if replication is not specified in create time. 

root@hadoop1:/usr/lib# chown -R hadoopuser:hadoopgroup /usr/lib/hadoop/ 
root@hadoop1:/usr/lib# ls -ltr hadoop/ 
total 16 
lrwxrwxrwx 1 hadoopuser hadoopgroup   15 Apr 24  2012 pids -> /var/run/hadoop 
lrwxrwxrwx 1 hadoopuser hadoopgroup   15 Apr 24  2012 logs -> /var/log/hadoop 
lrwxrwxrwx 1 hadoopuser hadoopgroup   41 Apr 24  2012 hadoop-tools-1.0.2.jar -> ../../share/hadoop/hadoop-tools-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   40 Apr 24  2012 hadoop-test-1.0.2.jar -> ../../share/hadoop/hadoop-test-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   44 Apr 24  2012 hadoop-examples-1.0.2.jar -> ../../share/hadoop/hadoop-examples-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   40 Apr 24  2012 hadoop-core.jar -> ../../share/hadoop/hadoop-core-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   40 Apr 24  2012 hadoop-core-1.0.2.jar -> ../../share/hadoop/hadoop-core-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   39 Apr 24  2012 hadoop-ant-1.0.2.jar -> ../../share/hadoop/hadoop-ant-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   26 Apr 24  2012 contrib -> ../../share/hadoop/contrib 
lrwxrwxrwx 1 hadoopuser hadoopgroup   16 Apr 24  2012 conf -> /etc/hadoop/conf 
drwxr-xr-x 9 hadoopuser hadoopgroup 4096 Dec 15 05:16 webapps 
drwxr-xr-x 2 hadoopuser hadoopgroup 4096 Dec 15 05:16 libexec 
drwxr-xr-x 2 hadoopuser hadoopgroup 4096 Dec 15 05:16 bin 
drwxr-xr-x 3 hadoopuser hadoopgroup 4096 Dec 15 05:16 lib 
root@hadoop1:/etc/hadoop/conf# chown -R hadoopuser:hadoopgroup /etc/hadoop/ 
root@hadoop1:~# su - hadoopuser 
hadoopuser@hadoop1:~$ mkdir hadoop

Commands To Manage Hadoop Services:

  • start-dfs.sh – Starts the Hadoop DFS daemons, the namenode and datanodes. Use this before start-mapred.sh
  • stop-dfs.sh – Stops the Hadoop DFS daemons.
  • start-mapred.sh – Starts the Hadoop Map/Reduce daemons, the jobtracker and tasktrackers.
  • stop-mapred.sh – Stops the Hadoop Map/Reduce daemons.
  • start-all.sh – Starts all Hadoop daemons, the namenode, datanodes, the jobtracker and tasktrackers. Deprecated; use start-dfs.sh then start-mapred.sh
  • stop-all.sh – Stops all Hadoop daemons. Deprecated; use stop-mapred.sh then stop-dfs.sh

Format the hadoop file system

root@hadoop1:/usr/lib/hadoop/conf# su - hadoopuser 
hadoopuser@hadoop1:~$ hadoop namenode -format 
13/12/15 19:53:16 INFO namenode.NameNode: STARTUP_MSG: 
STARTUP_MSG: Starting NameNode 
STARTUP_MSG:   host = java.net.UnknownHostException: hadoop1.example.com: hadoop1.example.com 
STARTUP_MSG:   args = [-format] 
STARTUP_MSG:   version = 1.0.2 
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.2 -r 1304954; compiled by 'hortonfo' on Sat Mar 24 23:58:21 UTC 2012 
13/12/15 19:53:17 INFO util.GSet: VM type       = 64-bit 
13/12/15 19:53:17 INFO util.GSet: 2% max memory = 19.33375 MB 
13/12/15 19:53:17 INFO util.GSet: capacity      = 2^21 = 2097152 entries 
13/12/15 19:53:17 INFO util.GSet: recommended=2097152, actual=2097152 
13/12/15 19:53:37 INFO namenode.FSNamesystem: fsOwner=hadoopuser 
13/12/15 19:53:37 INFO namenode.FSNamesystem: supergroup=supergroup 
13/12/15 19:53:37 INFO namenode.FSNamesystem: isPermissionEnabled=true 
13/12/15 19:53:37 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 
13/12/15 19:53:37 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 
13/12/15 19:53:37 INFO namenode.NameNode: Caching file names occuring more than 10 times 
13/12/15 19:53:58 INFO common.Storage: Image file of size 116 saved in 0 seconds. 
13/12/15 19:53:58 INFO common.Storage: Storage directory /home/hadoopuser/tmp/dfs/name has been successfully formatted. 
13/12/15 19:53:58 INFO namenode.NameNode: SHUTDOWN_MSG: 
SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: hadoop1.example.com: hadoop1.example.com 

Starting the Hadoop Cluster using start-all.sh 

hadoopuser@hadoop1:~$ start-all.sh 
Warning: $HADOOP_HOME is deprecated. 
starting namenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-namenode-hadoop1.example.com.out 
localhost: starting datanode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-datanode-hadoop1.example.com.out 
localhost: starting secondarynamenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-secondarynamenode-hadoop1.example.com.out 
starting jobtracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-jobtracker-hadoop1.example.com.out 
localhost: starting tasktracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-tasktracker-hadoop1.example.com.out 

or run


To check if hadoop is running or not, use the below command

hadoopuser@hadoop1:~$ jps 
35401 NameNode 
35710 JobTracker 
35627 SecondaryNameNode 
35928 Jps