Overview:
Hadoop is a technology which enables distributed processing of large set of data sets across clusters ranging from 1 server to thousands of server ensuring a high degree of Fault Tolerence.
Hadoop is a framework which consists of following basic modules
1) Hadoop Common – contains libraries and utilities needed by other Hadoop modules
2) Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
3) Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
4) Hadoop MapReduce – a programming model for large scale data processing
In this Tutorial we will install & configure Hadoop on Ubuntu(12.10/13.04/13.10).Follow the below steps :
Step:1 Update your machine
root@hadoop1:~# apt-get update
Install python-software-properties module
root@hadoop1:~# apt-get install python-software-properties
Add the sun – java repository
root@hadoop1:~# add-apt-repository ppa:webupd8team/java root@hadoop1:~# apt-get update && sudo apt-get upgrade root@hadoop1:~# apt-get install oracle-java6-installer
Check the installed java version
root@hadoop1:~# java -version java version "1.6.0_45" Java(TM) SE Runtime Environment (build 1.6.0_45-b06) Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)
If there are 2 versions of java as seen from the update-java-alternatives command
Then run the following command to set this java to the latest versions
root@hadoop1:~# update-alternatives --config java
There is only one alternative in link group java: /usr/lib/jvm/java-6-oracle/jre/bin/java
Nothing to configure.
Step:2 Adding a group to the system
root@hadoop1:~# addgroup hadoopgroup
Adding a hadoop user to the earlier created group
root@hadoop1:~# adduser --ingroup hadoopgroup hadoopuser Adding user `hadoopuser' ... Adding new user `hadoopuser' (1001) with group `hadoopgroup' ... Creating home directory `/home/hadoopuser' ... Copying files from `/etc/skel' ... Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully Changing the user information for hadoopuser Enter the new value, or press ENTER for the default Full Name []: HADOOP USER Room Number []: Work Phone []: Home Phone []: Other []: Is the information correct? [Y/n] Y
Step:3 Create Passwordless authentication
root@hadoop1:~# su - hadoopuser
hadoopuser@hadoop1:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoopuser/.ssh/id_rsa):
Created directory '/home/hadoopuser/.ssh'.
Your identification has been saved in /home/hadoopuser/.ssh/id_rsa.
Your public key has been saved in /home/hadoopuser/.ssh/id_rsa.pub.
The key fingerprint is:
82:a0:cb:f4:fa:1f:ac:f5:29:54:34:e7:56:ee:b0:9f hadoopuser@hadoop1.example.com
The key's randomart image is:
+--[ RSA 2048]----+
| |
| o . . |
| . . + o |
| . . . . + . |
|.. . o S + |
|o.. .. . . . |
|.. ..+ . . |
| . o.o . E |
| ..o...o |
+-----------------+
hadoopuser@hadoop1:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
hadoopuser@hadoop1:~$ ssh localhost
we will get below message after login
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is ee:be:18:ef:e6:3d:e3:8d:8a:17:ca:d1:a3:d6:d6:49.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-23-generic x86_64)
Step:4 Disable ipv6
As a root Append the file /etc/sysctl.conf and add the following lines
root@hadoop1:~# vi /etc/sysctl.conf
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Save & exit
root@hadoop1:~# sysctl -p
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
To verify the ipv6 has been disabled
root@hadoop1:~# cat /proc/sys/net/ipv6/conf/all/disable_ipv6 1
Step:5 Adding the hadoop repository
root@hadoop1:~# add-apt-repository ppa:hadoop-ubuntu/stable
Update and upgrade
root@hadoop1:~# apt-get update && apt-get upgrade
Step:6 Now Install Hadoop
root@hadoop1:~# apt-get install hadoop
Verify the hadoopuser details
root@hadoop1:~# id hadoopuser uid=1001(hadoopuser) gid=1002(hadoopgroup) groups=1002(hadoopgroup)
Add the hadoop user to the sudo file so that it will have the root level permissions
root@hadoop1:~# visudo
and add the following line :
hadoopuser ALL=(ALL:ALL) ALL
Set the environment in the hadoop user .bashrc file as follows
root@hadoop1:~# vi /home/hadoopuser/.bashrc and add the following line # Set Hadoop-related environment variables export HADOOP_HOME=/home/hadoopuser/hadoop # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)` export JAVA_HOME=/usr/lib/jvm/java-6-oracle/ # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:/usr/lib/hadoop/bin/
#Set some aliased
unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls"
Step:7 Now Configure the Hadoop
root@hadoop1:~# chown -R hadoopuser:hadoopgroup /var/log/hadoop/ root@hadoop1:~# chmod -R 755 /var/log/hadoop/ root@hadoop1:~# cd /usr/lib/hadoop/conf/ root@hadoop1:/usr/lib/hadoop/conf# ls -ltr total 76 -rw-r--r-- 1 root hadoop 382 Mar 24 2012 taskcontroller.cfg -rw-r--r-- 1 root hadoop 1195 Mar 24 2012 ssl-server.xml.example -rw-r--r-- 1 root hadoop 1243 Mar 24 2012 ssl-client.xml.example -rw-r--r-- 1 root hadoop 10 Mar 24 2012 slaves -rw-r--r-- 1 root hadoop 10 Mar 24 2012 masters -rw-r--r-- 1 root hadoop 178 Mar 24 2012 mapred-site.xml -rw-r--r-- 1 root hadoop 2033 Mar 24 2012 mapred-queue-acls.xml -rw-r--r-- 1 root hadoop 4441 Mar 24 2012 log4j.properties -rw-r--r-- 1 root hadoop 178 Mar 24 2012 hdfs-site.xml -rw-r--r-- 1 root hadoop 4644 Mar 24 2012 hadoop-policy.xml -rw-r--r-- 1 root hadoop 1488 Mar 24 2012 hadoop-metrics2.properties -rw-r--r-- 1 root hadoop 2237 Mar 24 2012 hadoop-env.sh -rw-r--r-- 1 root hadoop 327 Mar 24 2012 fair-scheduler.xml -rw-r--r-- 1 root hadoop 178 Mar 24 2012 core-site.xml -rw-r--r-- 1 root hadoop 535 Mar 24 2012 configuration.xsl -rw-r--r-- 1 root hadoop 7457 Mar 24 2012 capacity-scheduler.xml root@hadoop1:/usr/lib/hadoop/conf#
But before we start using them, we need to modify several files in the /conf folder.
hadoop-env.sh
replace the JAVA_HOME line with the below line.
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
core-site.xml Create a base for temporary files root@hadoop1:/usr/lib/hadoop/conf# mkdir /home/hadoopuser/tmp root@hadoop1:/usr/lib/hadoop/conf# chown hadoopuser:hadoopgroup /home/hadoopuser/tmp/ root@hadoop1:/usr/lib/hadoop/conf# chmod 755 /home/hadoopuser/tmp/ root@hadoop1:/usr/lib/hadoop/conf#
Replace the original contents with
root@hadoop1:/usr/lib/hadoop/conf# cat core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoopuser/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration>
root@hadoop1:/usr/lib/hadoop/conf# cat mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>
root@hadoop1:/usr/lib/hadoop/conf# cat hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration>
root@hadoop1:/usr/lib# chown -R hadoopuser:hadoopgroup /usr/lib/hadoop/ root@hadoop1:/usr/lib# ls -ltr hadoop/ total 16 lrwxrwxrwx 1 hadoopuser hadoopgroup 15 Apr 24 2012 pids -> /var/run/hadoop lrwxrwxrwx 1 hadoopuser hadoopgroup 15 Apr 24 2012 logs -> /var/log/hadoop lrwxrwxrwx 1 hadoopuser hadoopgroup 41 Apr 24 2012 hadoop-tools-1.0.2.jar -> ../../share/hadoop/hadoop-tools-1.0.2.jar lrwxrwxrwx 1 hadoopuser hadoopgroup 40 Apr 24 2012 hadoop-test-1.0.2.jar -> ../../share/hadoop/hadoop-test-1.0.2.jar lrwxrwxrwx 1 hadoopuser hadoopgroup 44 Apr 24 2012 hadoop-examples-1.0.2.jar -> ../../share/hadoop/hadoop-examples-1.0.2.jar lrwxrwxrwx 1 hadoopuser hadoopgroup 40 Apr 24 2012 hadoop-core.jar -> ../../share/hadoop/hadoop-core-1.0.2.jar lrwxrwxrwx 1 hadoopuser hadoopgroup 40 Apr 24 2012 hadoop-core-1.0.2.jar -> ../../share/hadoop/hadoop-core-1.0.2.jar lrwxrwxrwx 1 hadoopuser hadoopgroup 39 Apr 24 2012 hadoop-ant-1.0.2.jar -> ../../share/hadoop/hadoop-ant-1.0.2.jar lrwxrwxrwx 1 hadoopuser hadoopgroup 26 Apr 24 2012 contrib -> ../../share/hadoop/contrib lrwxrwxrwx 1 hadoopuser hadoopgroup 16 Apr 24 2012 conf -> /etc/hadoop/conf drwxr-xr-x 9 hadoopuser hadoopgroup 4096 Dec 15 05:16 webapps drwxr-xr-x 2 hadoopuser hadoopgroup 4096 Dec 15 05:16 libexec drwxr-xr-x 2 hadoopuser hadoopgroup 4096 Dec 15 05:16 bin drwxr-xr-x 3 hadoopuser hadoopgroup 4096 Dec 15 05:16 lib root@hadoop1:/usr/lib# root@hadoop1:/etc/hadoop/conf# chown -R hadoopuser:hadoopgroup /etc/hadoop/ root@hadoop1:/etc/hadoop/conf# root@hadoop1:~# su - hadoopuser hadoopuser@hadoop1:~$ mkdir hadoop
Commands To Manage Hadoop Services:
- start-dfs.sh – Starts the Hadoop DFS daemons, the namenode and datanodes. Use this before start-mapred.sh
- stop-dfs.sh – Stops the Hadoop DFS daemons.
- start-mapred.sh – Starts the Hadoop Map/Reduce daemons, the jobtracker and tasktrackers.
- stop-mapred.sh – Stops the Hadoop Map/Reduce daemons.
- start-all.sh – Starts all Hadoop daemons, the namenode, datanodes, the jobtracker and tasktrackers. Deprecated; use start-dfs.sh then start-mapred.sh
- stop-all.sh – Stops all Hadoop daemons. Deprecated; use stop-mapred.sh then stop-dfs.sh
Format the hadoop file system
root@hadoop1:/usr/lib/hadoop/conf# su - hadoopuser hadoopuser@hadoop1:~$ hadoop namenode -format 13/12/15 19:53:16 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = java.net.UnknownHostException: hadoop1.example.com: hadoop1.example.com STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.0.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.2 -r 1304954; compiled by 'hortonfo' on Sat Mar 24 23:58:21 UTC 2012 ************************************************************/ 13/12/15 19:53:17 INFO util.GSet: VM type = 64-bit 13/12/15 19:53:17 INFO util.GSet: 2% max memory = 19.33375 MB 13/12/15 19:53:17 INFO util.GSet: capacity = 2^21 = 2097152 entries 13/12/15 19:53:17 INFO util.GSet: recommended=2097152, actual=2097152 13/12/15 19:53:37 INFO namenode.FSNamesystem: fsOwner=hadoopuser 13/12/15 19:53:37 INFO namenode.FSNamesystem: supergroup=supergroup 13/12/15 19:53:37 INFO namenode.FSNamesystem: isPermissionEnabled=true 13/12/15 19:53:37 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 13/12/15 19:53:37 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 13/12/15 19:53:37 INFO namenode.NameNode: Caching file names occuring more than 10 times 13/12/15 19:53:58 INFO common.Storage: Image file of size 116 saved in 0 seconds. 13/12/15 19:53:58 INFO common.Storage: Storage directory /home/hadoopuser/tmp/dfs/name has been successfully formatted. 13/12/15 19:53:58 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: hadoop1.example.com: hadoop1.example.com ************************************************************/
Starting the Hadoop Cluster using start-all.sh
hadoopuser@hadoop1:~$ start-all.sh Warning: $HADOOP_HOME is deprecated. starting namenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-namenode-hadoop1.example.com.out localhost: starting datanode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-datanode-hadoop1.example.com.out localhost: starting secondarynamenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-secondarynamenode-hadoop1.example.com.out starting jobtracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-jobtracker-hadoop1.example.com.out localhost: starting tasktracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-tasktracker-hadoop1.example.com.out hadoopuser@hadoop1:~$
or run
start-dfs.sh start-mapred.sh
To check if hadoop is running or not, use the below command
hadoopuser@hadoop1:~$ jps 35401 NameNode 35710 JobTracker 35627 SecondaryNameNode 35928 Jps hadoopuser@hadoop1:~$