Dharamart: How to Setup Hadoop on ubantu

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

This article will help you for step by step install and configure single node hadoop cluster using Hadoop 1.2.1.

Step 1. Install Java

Steps to Install JAVA 7 on CentOS and RHEL 5/6

# java -version

java version "1.7.0_75"
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

Step 2. Create User Account

Now create a system user account to use for hadoop installation. We prefer to use name as “hadoop”

$sudo useradd hadoop
$sudo passwd hadoop

Step 3. Configuring Key Based Login

It’s required to setup hadoop user to ssh itself without password. Using following method it will enable key based login for hadoop user.

$sudo su - hadoop
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
$ exit

Step 4. Download and Extract Hadoop Source

download hadoop latest available version from its official site, and follow below steps.

$sudo mkdir /opt/hadoop
$sudo cd /opt/hadoop/
$sudo wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
$sudo tar -xzf hadoop-1.2.1.tar.gz
$sudo mv hadoop-1.2.1 hadoop
$sudo chown -R hadoop /opt/hadoop
$sudo cd /opt/hadoop/hadoop/

Step 5: Configure Hadoop

First edit hadoop configuration files and make following changes.
5.1 Edit core-site.xml

$sudo vim conf/core-site.xml

#Add the following inside the configuration tag

    fs.default.name
    hdfs://localhost:9000/


    dfs.permissions
    false

5.2 Edit hdfs-site.xml

$sudo vim conf/hdfs-site.xml

# Add the following inside the configuration tag

    dfs.data.dir
    /opt/hadoop/hadoop/dfs/name/data
    true


    dfs.name.dir
    /opt/hadoop/hadoop/dfs/name
    true


    dfs.replication
    2

5.3 Edit mapred-site.xml

$sudo vim conf/mapred-site.xml

# Add the following inside the configuration tag

        mapred.job.tracker
    localhost:9001

5.4 Edit hadoop-env.sh

$sudo vim conf/hadoop-env.sh

export JAVA_HOME=/opt/jdk1.7.0_75
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Set JAVA_HOME path as per your system configuration for java.

Next to format Name Node

$sudo su - hadoop
$sudo cd /opt/hadoop/hadoop
$ bin/hadoop namenode -format

13/06/02 22:53:48 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = srv1.tecadmin.net/192.168.1.90
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiled by 'hortonfo' on Mon May  6 06:59:37 UTC 2013
STARTUP_MSG:   java = 1.7.0_75
************************************************************/
13/06/02 22:53:48 INFO util.GSet: Computing capacity for map BlocksMap
13/06/02 22:53:48 INFO util.GSet: VM type       = 32-bit
13/06/02 22:53:48 INFO util.GSet: 2.0% max memory = 1013645312
13/06/02 22:53:48 INFO util.GSet: capacity      = 2^22 = 4194304 entries
13/06/02 22:53:48 INFO util.GSet: recommended=4194304, actual=4194304
13/06/02 22:53:49 INFO namenode.FSNamesystem: fsOwner=hadoop
13/06/02 22:53:49 INFO namenode.FSNamesystem: supergroup=supergroup
13/06/02 22:53:49 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/06/02 22:53:49 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/06/02 22:53:49 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/06/02 22:53:49 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
13/06/02 22:53:49 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/06/02 22:53:49 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/06/02 22:53:49 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/opt/hadoop/hadoop/dfs/name/current/edits
13/06/02 22:53:49 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/opt/hadoop/hadoop/dfs/name/current/edits
13/06/02 22:53:49 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name has been successfully formatted.
13/06/02 22:53:49 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at srv1.tecadmin.net/192.168.1.90
************************************************************/

Step 6: Start Hadoop Services

Use the following command to start all hadoop services.

$ bin/start-all.sh

[sample output]

starting namenode, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-namenode-ns1.tecadmin.net.out
localhost: starting datanode, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-ns1.tecadmin.net.out
localhost: starting secondarynamenode, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-ns1  .tecadmin.net.out
starting jobtracker, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-ns1.tecadmin.net.out
localhost: starting tasktracker, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-ns1.tecadmin.ne  t.out

Step 7: Test and Access Hadoop Services

Use ‘jps‘ command to check if all services are started well.

$ jps
or
$ $JAVA_HOME/bin/jps

26049 SecondaryNameNode
25929 DataNode
26399 Jps
26129 JobTracker
26249 TaskTracker
25807 NameNode

Web Access URLs for Services

  http://srv1.tecadmin.net:50030/   for the Jobtracker
  http://srv1.tecadmin.net:50070/   for the Namenode
  http://srv1.tecadmin.net:50060/   for the Tasktracker

Hadoop JobTracker: