Sunday, 6 October 2013

Running Hadoop on CentOS Linux (Multi-Node Cluster)




In this tutorial I will describe the required steps for setting up a distributed, multi-node Apache Hadoopcluster backed by the Hadoop Distributed File System (HDFS), running on CentOS Linux.


Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReducecomputing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to
In a previous tutorial, I described how to setup up a Hadoop single-node cluster on an CentOS box. The main goal of this tutorial is to get a more sophisticated Hadoop installation up and running, namely building a multi-node cluster using two CentOS boxes.
This tutorial has been tested with the following software versions:
  • CentOS 5.6 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04)
  • Hadoop 1.0.3, released May 2012




The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Our earlier article about hadoop was describing to how to setup single node cluster. This article will help you for step by 

step installing and configuring Hadoop Multi-Node Cluster on CentOS/RHEL 6.

Setup Details:

Hadoop Master: 192.168.1.15 ( hadoop-master )
Hadoop Slave : 192.168.1.16 ( hadoop-slave-1 )
Hadoop Slave : 192.168.1.17 ( hadoop-slave-2 )

Step 1. Install Java

Before installing hadoop make sure you have java installed on all your systems. If you do not have java installed use 

following article to install Java.

Steps to install JAVA on CentOS 5/6 or RHEL 5/6

Step 1: Download Archive File

Download latest version of java from 
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html.

# cd /opt/
# wget http://download.oracle.com/otn-pub/java/jdk/7u25-b15/jdk-7u25-linux-i586.tar.gz?AuthParam=1372657186_d532b6d28fdb7f35ec7150a1d6df6778


Extract downloaded archive using following command.

# tar xzf jdk-7u25-linux-i586.tar.gz

Step 2: Install JAVA using Alternatives

After extracting java archive file, we just need to setup to use newer version of java using alternatives. Use the following commands to do it.

# cd /opt/jdk1.7.0_25
# alternatives --install /usr/bin/java java /opt/jdk1.7.0_25/bin/java 2
# alternatives --config java

Step 3: Check Version of JAVA .

Use following command to check which version of java is currently being used by system.

# java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) Client VM (build 23.25-b01, mixed mode)

Step 4: Setup Environment Variables

Most of java based application’s uses environment variables to work. Use following commands to setup it.

Setup JAVA_HOME Variable
# export JAVA_HOME=/opt/jdk1.7.0_25
Setup JRE_HOME Variable
# export JRE_HOME=/opt/jdk1.7.0_25/jre
Setup PATH Variable
# export PATH=$PATH:/opt/jdk1.7.0_25/bin:/opt/jdk1.7.0_25/jre/bin
I hope above steps will help you for installation java on your Linux system. You can follow above steps to install multiple version of java as same time but you can use only one version at a time.

Step 5: Add FQDN Mapping

Edit /etc/hosts file on all master and slave servers and add following entries.

# vi /etc/hosts
192.168.1.15 hadoop-master
192.168.1.16 hadoop-slave-1
192.168.1.17 hadoop-slave-2

Step 6: Configuring Key Based Login

Its required to set up hadoop user to ssh itself without password. Use following commands to confiure auto login between all hadoop cluster servers..

# su - hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2
$ chmod 0600 ~/.ssh/authorized_keys

$ exit


Step 7: Download and Extract Hadoop Source

Download hadoop latest available version from its official site at hadoop-master server only.

# mkdir /opt/hadoop
# cd /opt/hadoop/
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.0/hadoop-1.2.0.tar.gz
# tar -xzf hadoop-1.2.0.tar.gz
# mv hadoop-1.2.0 hadoop
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/

Step 8: Configure Hadoop

First edit hadoop configuration files and make following changes.

8.1 Edit core-site.xml

# vi conf/core-site.xml
#Add the following inside the configuration tag
<property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop-master:9000/</value>
</property>
<property>
    <name>dfs.permissions</name>
    <value>false</value>
</property>

8.2 Edit hdfs-site.xml

# vi conf/hdfs-site.xml
# Add the following inside the configuration tag
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hadoop/dfs/name/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

8.3 Edit mapred-site.xml

# vi conf/mapred-site.xml
# Add the following inside the configuration tag
<property>
        <name>mapred.job.tracker</name>
<value>hadoop-master:9001</value>

</property>



8.4 Edit hadoop-env.sh

# vi conf/hadoop-env.sh
export JAVA_HOME=/opt/jdk1.7.0_17
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf
Set JAVA_HOME path as per your system configuration for java.

Step 9: Copy Hadoop Source to Slave Servers

After updating above configuration, we need to copy the source files to all slaves servers.

# su - hadoop
$ cd /opt/hadoop
$ scp -r hadoop hadoop-slave-1:/opt/hadoop
$ scp -r hadoop hadoop-slave-2:/opt/hadoop

Step 10: Configure Hadoop on Master Server Only

Go to hadoop source folder on hadoop-master and do following settings.

# su - hadoop
$ cd /opt/hadoop/hadoop
$ vi conf/masters

hadoop-master
$ vi conf/slaves

hadoop-slave-1
hadoop-slave-2


11:Format Name Node on Hadoop Master only

# su - hadoop
$ cd /opt/hadoop/hadoop

$ bin/hadoop namenode -format

13/07/13 10:58:07 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop-master/192.168.1.15
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.0
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiled by 'hortonfo' on Mon May  6 06:59:37 UTC 2013
STARTUP_MSG:   java = 1.7.0_25
************************************************************/
13/07/13 10:58:08 INFO util.GSet: Computing capacity for map BlocksMap
13/07/13 10:58:08 INFO util.GSet: VM type       = 32-bit
13/07/13 10:58:08 INFO util.GSet: 2.0% max memory = 1013645312
13/07/13 10:58:08 INFO util.GSet: capacity      = 2^22 = 4194304 entries
13/07/13 10:58:08 INFO util.GSet: recommended=4194304, actual=4194304
13/07/13 10:58:08 INFO namenode.FSNamesystem: fsOwner=hadoop
13/07/13 10:58:08 INFO namenode.FSNamesystem: supergroup=supergroup
13/07/13 10:58:08 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/07/13 10:58:08 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/07/13 10:58:08 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/07/13 10:58:08 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
13/07/13 10:58:08 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/07/13 10:58:08 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/07/13 10:58:08 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/opt/hadoop/hadoop/dfs/name/current/edits
13/07/13 10:58:08 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/opt/hadoop/hadoop/dfs/name/current/edits
13/07/13 10:58:08 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name has been successfully formatted.
13/07/13 10:58:08 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15

************************************************************/


Step 12: Start Hadoop Services

Use the following command to start all hadoop services on Hadoop-Master


$ bin/start-all.sh




2 comments:

  1. Finally i got the information about Big data analytics with examples, Needful information thank you.Thank you for giving information. hadoop Online Training Low price Offer Running!

    ReplyDelete
  2. thank you for the information its really useful hadoop Online Training

    ReplyDelete