Join our Discord Server
Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 570+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 8900+ members and discord server close to 2200+ members. You can follow him on Twitter(@ajeetsraina).

Running Hadoop on Ubuntu 14.04 ( Multi-Node Cluster)

7 min read

This is an introductory post on Hadoop for new begineers who want step by step instruction for deploying Hadoop on the latest Ubuntu 14.04 box.Hadoop allows for the distribution processing of large data sets across clusters of computers. It uses Map Reduce programming model. It is designed to scale up from single servers to 1000 of machines, each offering local computation and storage.

HDFS is the distributed file system that is available with Hadoop.MapReduce tasks use HDFS to read and write data.HDFS deployment includes a single Name Node and multiple Data Nodes.  In this section, we will setup a Name Node and multiple Data Nodes.

Hadoop Architecture Design:

Machine IP Type of Node Hostname
192.168.1.5 Master Node master.hadoopnode.com
192.168.1.6 Data Node 1 datanode1.hadoopnode.com
192.168.1.4 Data Node 2 datanode2.hadoopnode.com

Let’s talk about YARN..

In a simple language, YARN is basically a Hadoop Next Generation Map Reduce called Map Reduce v2.In short, it is a cluster management technology. YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes.

The fundamental idea of YARN is to split up the two major functionalities of the Job Tracker, resource management and job scheduling/monitoring, into separate daemons. YARN split up the two major responsibilities of the Job Tracker/Task Tracker into separate entities:

  • a global Resource Manager
  • a per-application Application Master
  • a per-node slave Node Manager
  • a per-application Container running on a Node Manager

Putting together, the YARN component can be visualized as shown below:

Hadoop-Post1-1

What are the pre-requisites:

1. Install 3 number of VMs of Ubuntu 14.01.1 on Virtual box. While installing ensure that OpenSSH server package is selected which configures SSH service automatically.

image2.png

Ensure that the Bridge Adapter option is configured (as shown below). This ensures that all the nodes can communicate with each other.

image3.png
Setting up Master Node

1. Login to master.hadoopnode.com as normal user through putty. As you see below, the master node has IP address 192.168.1.5. Ensure that the full FQDN name is provided for this host.

image4.png

As shown above, I logged in as user1 which was created by default during the installation time.We are soon going to create a user and group for Hadoop.

image5.png\

2. Open /etc/hosts file through vi editor and add the following entries:

image6.png

As shown above, you need to add the hostname and IP Address of each nodes so that they can identify and ping each other through hostname and IP address both.

Setting up User and Group for Hadoop

3. Let’s create a user for Hadoop. First you need to create a group called hadoop and add a new user called hduser to the newly created hadoop group as shown below:

Point4

4. Ensure that the newly created hadoop user is added to the sudo user(shown below):

The above step is an important step and shouldn’t be skipped.

Enabling Password-less SSH

5.Make sure that hduser can SSH to its own account without password.

Point5

  1. For the first time, try SSH to localhost running ssh hduser@localhost.It will ask for password so as to add this host in the list of known hosts. Run the exit command and try to SSH again. This time it shouldn’t ask for password (as shown below).

As shown above, the hduser can SSH to its own account without any password.

 Disabling IPv6

 [OPTIONAL] It is always recommended to disable the IPv6 since the system is going to use 0.0.0.0 for different Hadoop configuration. Follow the below steps to disable IPv6 on the master node.

Point6

8. Reboot the machine to let the system update with the kernel parameters correctly.

Point6

Remember that you might skip the above IPv6 under certain conditions where you have just testing environment.

9.Re-SSH to the master node through putty again.

Configuring JAVA

1. Download the JDK from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html( shown below)

image12.png

I downloaded jdk-7u71-linux-i586.tar.gz as per my machine architecture. If you are running x86_64 architecture machine, you will need to download x64.rpm from the same link.

11.Create a directory called java under /usr/local through mkdir utility.

Point6

12.Upload the Oracle JDK binaries into java directory of the Ubuntu machine through WinSCP or other whichever available.

13. Unpack the compressed JDK software as shown below:

Point6

Once unzipped, you will see the following listing of files.

Point6

14.Copy the Oracle JDK unzipped binaries into /usr/local/java directory as shown below:

Point7

Verify that all the binaries are copied.

Point6

15. Setup the environmental variable for JAVA_HOME. Open /etc/profile through nano or VI editor and add the following lines at the end of the file.

16. Save the file.

17. Run the following command to point out to the correct Oracle JDK location.

Point7

18. For JDK to be available for use, run the following command.

Point7

19. It is very important to run the below command to reload your system wide PATH under /etc/profile.

Point7

20. You can also verify if JAVA_HOME is working or not.

Point7

NOTE: We need to configure JAVA the same way we followed above for all the nodes.

21. Before configuring Hadoop, we need to make data node 1 and data node 2 ready. Let’s configure them too.

Setting up Data Node 1

1. Login to one of the data node say, datanode1.hadoopnode.com as a normal user through putty. As you see below, this machine has IP address of 192.168.1.6.

Point7

As shown above, I logged in as user2 which was created by default during the installation time. We are soon going to create user and group for Hadoop.

2. Open /etc/hosts file through vi editor and add the following entries:

Point7

NOTE: Follow the above step on Data Node 2 too.

3. As similar as we created hduser and hadoop group for masternode, follow the same steps for data node 1 and data node 2 too.

Point7

Ensure you don’t miss the below step for allowing sudo access for the hduser.

Point7

4. This is an important part of data node configuration. We are going to configure passwordless SSH so that masternode can SSH to all datanodes without password.

Note: Run the below step on Master Node only.

Point7

Try logging to the slave node from master node without password as shown below:

Point7

5. Follow step 10 to 20 discussed above for configuring JAVA_HOME on this node too.

Setting up Data Node 2:

1. Login to data node 2 as shown below:

Point7

2. Configure /etc/hosts as similar as what we configured for data node 1.

Point7

3. Configure User and group for hadoop.

Point7

4. Again, this is an important step which IS TO BE RUN ON MASTER ONLY.

Point7

The above command lets passwordless SSH from master to the data node 2.

5.Follow step 10 to 20 discussed above for configuring JAVA_HOME on this node too. Once you configure on both the data node you might see something like shown below:

Point7

Configuring Hadoop: 

NOTE: The below commands to be run on all the master and data nodes.

22. Download Hadoop binaries from http://mirror.olnevhost.net/pub/apache/hadoop/core/hadoop-2.3.0/hadoop-2.3.0.tar.gz.  Run the wget utility(shown below) to download Hadoop binaries from remote Hadoop website.

Point7

23. Unzip the hadoop binaries as shown below:

Point7

Once you run the above command it will extract the binaries and place it under the same location. You need to copy it to /usr/local directory.

24. Unzip the hadoop tar directly into /usr/local/hadoop-2.3.0 folder :

Point7

25. Create a symbolic link for hadoop directory under /usr/local/hadoop as shown below:

Point7

26. Provide the ownership to hduser and hadoop group to execute the hadoop binaries:

Point7

27. Switch to hduser through the following command:

28.Open .bashrc placed under home directory of hduser and add the following entries at the end of the file:

Point7

29. Save the file. Run the command called bash to let the environment variable effective as shown below:

Point7

30. Edit the following hadoop-env.sh for letting hadoop know where does JAVA_HOME resides. Once the entry is done, you should be able to able to see the following results.

Point7

31.Verify that the hadoop installation through the following command:

Point7

This shows that Hadoop is properly configured.

32.As the above Hadoop configuration is run on all the nodes, ensure that /usr/local/hadoop is the path where Hadoop resides on all the nodes. Follow the same steps for all the data nodes too. For example, if you follow the steps above on data node 1 you should expect the following results:

Point7

Now we have master node and data node ready with the basic Hadoop installation.

PLEASE NOTE: In a newer version of hadoop, there is one slight additional steps for environmental variable setup for JAVA_HOME to work. Open the file hadoop-config.sh under /usr/local/hadoop/libexec and make the following entry too.

Configuring Master Node:

33. Let us first configure Master Node File configurations.

34.Ensure that you are logged in as hduser and running the below commands.

Point7

Create required files as shown below on master node.

5. Open the hdfs.xml file and add the following entries:

Point7
36. Open the file $HADOOP_INSTALL/etc/hadoop/core-site.xml and let hadoop module know where master node(name node) resides.

Point7

Put the entries only inside the configuration tab and not outside.

37.Format the HDFS filesystem on the master node as shown below:

Point7

It takes few seconds and the final results gets displayed:

Point7

38. Edit the file $HADOOP_INSTALL/etc/hadoop/slaves with all the data nodes entries into master node.

Point7

Configuring Data Nodes

39. Login to one of the data node(say datanode1) and create the following files:

Point7

Once you make the entry , the file should look like as shown below:

Point7

41. Let the data node know where does master node (namenode) resides through editing the core-site.xml file.

Point7

Instead of IP address above, you can use hostname of master node if you have correct entries under /etc/hosts.

42. Now open the master node session and run the below command:

As shown below it tries to ssh to data node and start the respective services in the data nodes.

Point7

Ensure that the required services (name node, data node and YARN ) are running through jps command:

43. One can verify

Point7\

44. Ensure that the required services are running at the data node end too.

Point7

Also, verify on data node 2 as shown below:

Point7

45. You can access Hadoop Name Node details under http://<masternode>:50070.

Under this link, you can access Datanodes , logs for each namenodes, snapshot of events and overall DFS details too.

Point7

Point7

Point7

As shown above, there are two Data Node represented as Live Nodes.

Click on Data Node section on the top of the Web URL to see the data node status

Under this link, you can access Datanodes , logs for each namenodes, snapshot of events and overall DFS details too.

Point7

You can visualize the secondary namenode through the following URL:

Point7

You can see the datanode 1 status under the URL:

Point7

In the similar manner, you can see the data node 2 status through URL:

Point7

Before I wrap up..

We will now look into basic HDFS shell commands which forms the basis for running Map Reduce jobs. Hadoop file system usually referred as fs shell commands are used to perform various file operations. For Example: copy, changing permissions, viewing the file contents, changing ownership of files, creating directories and much more. You can see the various options available through the below command:

Point7

Listing the size of DFS:

Point7

Creating a directory in HDFS. This is very similar to unix command.

Point7

Listing the HDFS file system:

Point7

Copy a file from your local system to HDFS file system:

Point7

As shown above, first create a empty file called alpha in some local folder. Add some contents to it through editor. Use –copyFromLocal option to copy it from local file system to HDFS file system.

Copy a file from HDFS to local system:

Point7

As shown above, first delete the alpha file from your local machine. Now try running –copyToLocal option to copy it from HDFS to your local machine.

Displaying the length of the file contained in a directory:

Point7

Displaying the stat information for a HDFS path:

Point7

You can always refer to HDFS man pages for detailed options for command line utilities for operation on HDFS.

In our next episode, we will cover the various ecosystem of Hadoop.

Have Queries? Join https://launchpass.com/collabnix

Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 570+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 8900+ members and discord server close to 2200+ members. You can follow him on Twitter(@ajeetsraina).
Join our Discord Server