Monday, January 2, 2012

Large Scale & Big Data analysis with Hadoop cluster ( Hadoop cluster step on Ubuntu 11.10 server )

Welcome back guys, i promise a setup of Hadoop cluster on Ubuntu Server(64 bit)
last time I post about Openstack and Devstack. Finally again from scratch a install and running tutorial is their

While  Everybody Celebrating New Year ,then I am setting up  my Hadoop cluster  and Processing unstructured data with Hadoop .

Welcome Administrators , Developer , Students, specially Hobbyist and passionates guys interesting in learning cloud .


The purpose of this document is to help you get a single-node Hadoop installation up and running very quickly so that you can get a flavour of the Hadoop Distributed File System (see HDFS Architecture) and the Map/Reduce framework; that is, perform simple operations on HDFS and run example jobs.

Our Pre-requisites for setup :-

1) We Must need a Ubuntu Server (64 bit) or Debian (64bit ) {I am using same this tutorial}
2)  You must have there installed (in main root)
     $ sudo apt-get install ssh (openssh )     
     $ sudo apt-get install rsync  
3) JavaTM 1.6.x, preferably from Sun, must be installed.

    Installing Java
    $ sudo add-apt-repository "deb lucid partner"
    $ sudo apt-get update 
    $ sudo apt-get install sun-java6-jdk sun-java6-plugin  ( while after Downloading  during installation plz press tab to accept JAVA terms and condition while a tab for conditions open ) 
Check is Sun java is there :)

root@ruhil:~# sudo apt-get install sun-java6-jdk sun-java6-plugin 
Reading package lists... Done
Building dependency tree      
Reading state information... Done
sun-java6-plugin is already the newest version.
sun-java6-jdk is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
root@ruhil:~# java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

Add a Dedicated User 

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Configure and check is ssh working for local-host

(Please press the enter , you need not specify the name  for File for  and Public key )
root@ruhil:~# su - hduser
hduser@ruhil:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ruhil
The key's randomart image is:

After do this step carefully

hduser@ruhil:~$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_key

hduse@ruhil:~$ ssh localhost
The authenticity of host 'localhost (' can't be established.
ECDSA key fingerprint is b8:be:26:41:44:7d:9b:82:02:fd:13:61:3c:ac:d4:0a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ruhil 3.0.0-14-server #23-Ubuntu SMP Mon Nov 21 20:49:05 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

IN Ubuntu 11.10
open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.

You can check whether IPv6 is enabled on your machine with the following command:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).

Installation of Hadoop (perform this action in your main root like root@ruhil)
{Note Downloading 0.20 version is stable other not stable mainly 0.23,Would like go with 0.20  :)}
root@ruhil:~# mkdir -p /usr/local
root@ruhil:~# cd /usr/local
root@ruhil:~# wget -O hadoop-0.20.2.tar.gz
root@ruhil:~# sudo tar xzf hadoop-0.20.2.tar.gz
root@ruhil:~# mv hadoop-0.20.2 hadoop
root@ruhil:~$ sudo chown -R hduser:hadoop hadoop

Create .Bashrc or If have already paste  below for Hadoop(Note :-you need paste in root and hduser ,if you like you can paste for all,My Paste of .bashrc is :-

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
# Requires installed 'lzop' command.
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less

# Add Hadoop bin/ directory to PATH

Configuration(Note all the configuration setting you need made in hduser ) :-

The following picture gives an overview of the most important HDFS components.

HDFS Architecture (source:

Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open /conf/ in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/ and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.


# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun


# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-6-sun

Now we create the directory and set the required ownerships and permissions:

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 755 /app/hadoop/tmp
{Set Your chmod according to your settings  }

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
Note for all given below we need perform all this  below config files

In file conf/core-site.xml:(cd /usr/local/hadoop there all this config )

<!-- In: conf/core-site.xml -->
  <description>A base for other temporary directories.</description>

  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>

In file conf/mapred-site.xml:

<!-- In: conf/mapred-site.xml -->
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.

In file conf/hdfs-site.xml:

<!-- In: conf/hdfs-site.xml -->
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.

hduser@ruhil:~$ /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this:

hduser@ruhil:/usr/local/hadoop$ bin/hadoop namenode -format
1/01/12 1:30:41 INFO namenode.NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ruhil/
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = -r 911707; compiled by 'ruhil' on Sun Jan 1 01:30:41 UTC 2012
1/01/12 1:30:41 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
1/01/12 1:30:41  INFO namenode.FSNamesystem: supergroup=supergroup
1/01/12 1:30:41  INFO namenode.FSNamesystem: isPermissionEnabled=true
1/01/12 1:30:41  INFO common.Storage: Image file of size 96 saved in 0 seconds.
1/01/12 1:30:41  INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
1/01/12 1:30:41 5/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at ruhil/

Starting your single-node cluster
hduser@ruhil://usr/local/hadoop$ jps
all this shown below in bash too

Stop hadoop using below command :-

hduser@ruhil:~$ /usr/local/hadoop/bin/
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

Now Run a Map-reduce job :-

Just watch bash carefully

NOW Finally Lock your Browser Yeah :--)

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

Enjoy :) , process your data with ease and super speed :)
Looking for any kind Help (on Hadoop IRC  with #cloudgeek)
Feel free to mail me