This is an introductory post on Hadoop for new begineers who want step by step instruction for deploying Hadoop on the latest Ubuntu 14.04 box.Hadoop allows for the distribution processing of large data sets across clusters of computers. It uses Map Reduce programming model. It is designed to scale up from single servers to 1000 of machines, each offering local computation and storage.
HDFS is the distributed file system that is available with Hadoop.MapReduce tasks use HDFS to read and write data.HDFS deployment includes a single Name Node and multiple Data Nodes. In this section, we will setup a Name Node and multiple Data Nodes.
Hadoop Architecture Design:
Machine IP | Type of Node | Hostname |
192.168.1.5 | Master Node | master.hadoopnode.com |
192.168.1.6 | Data Node 1 | datanode1.hadoopnode.com |
192.168.1.4 | Data Node 2 | datanode2.hadoopnode.com |
Let’s talk about YARN..
In a simple language, YARN is basically a Hadoop Next Generation Map Reduce called Map Reduce v2.In short, it is a cluster management technology. YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes.
The fundamental idea of YARN is to split up the two major functionalities of the Job Tracker, resource management and job scheduling/monitoring, into separate daemons. YARN split up the two major responsibilities of the Job Tracker/Task Tracker into separate entities:
- a global Resource Manager
- a per-application Application Master
- a per-node slave Node Manager
- a per-application Container running on a Node Manager
Putting together, the YARN component can be visualized as shown below:
What are the pre-requisites:
1. Install 3 number of VMs of Ubuntu 14.01.1 on Virtual box. While installing ensure that OpenSSH server package is selected which configures SSH service automatically.
Ensure that the Bridge Adapter option is configured (as shown below). This ensures that all the nodes can communicate with each other.
1. Login to master.hadoopnode.com as normal user through putty. As you see below, the master node has IP address 192.168.1.5. Ensure that the full FQDN name is provided for this host.
As shown above, I logged in as user1 which was created by default during the installation time.We are soon going to create a user and group for Hadoop.
2. Open /etc/hosts file through vi editor and add the following entries:
As shown above, you need to add the hostname and IP Address of each nodes so that they can identify and ping each other through hostname and IP address both.
Setting up User and Group for Hadoop
3. Let’s create a user for Hadoop. First you need to create a group called hadoop and add a new user called hduser to the newly created hadoop group as shown below:
4. Ensure that the newly created hadoop user is added to the sudo user(shown below):
The above step is an important step and shouldn’t be skipped.
Enabling Password-less SSH
5.Make sure that hduser can SSH to its own account without password.
- For the first time, try SSH to localhost running ssh hduser@localhost.It will ask for password so as to add this host in the list of known hosts. Run the exit command and try to SSH again. This time it shouldn’t ask for password (as shown below).
As shown above, the hduser can SSH to its own account without any password.
Disabling IPv6
[OPTIONAL] It is always recommended to disable the IPv6 since the system is going to use 0.0.0.0 for different Hadoop configuration. Follow the below steps to disable IPv6 on the master node.
8. Reboot the machine to let the system update with the kernel parameters correctly.
Remember that you might skip the above IPv6 under certain conditions where you have just testing environment.
9.Re-SSH to the master node through putty again.
Configuring JAVA
1. Download the JDK from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html( shown below)
I downloaded jdk-7u71-linux-i586.tar.gz as per my machine architecture. If you are running x86_64 architecture machine, you will need to download x64.rpm from the same link.
11.Create a directory called java under /usr/local through mkdir utility.
12.Upload the Oracle JDK binaries into java directory of the Ubuntu machine through WinSCP or other whichever available.
13. Unpack the compressed JDK software as shown below:
Once unzipped, you will see the following listing of files.
14.Copy the Oracle JDK unzipped binaries into /usr/local/java directory as shown below:
Verify that all the binaries are copied.
15. Setup the environmental variable for JAVA_HOME. Open /etc/profile through nano or VI editor and add the following lines at the end of the file.
16. Save the file.
17. Run the following command to point out to the correct Oracle JDK location.
18. For JDK to be available for use, run the following command.
19. It is very important to run the below command to reload your system wide PATH under /etc/profile.
20. You can also verify if JAVA_HOME is working or not.
NOTE: We need to configure JAVA the same way we followed above for all the nodes.
21. Before configuring Hadoop, we need to make data node 1 and data node 2 ready. Let’s configure them too.
Setting up Data Node 1
1. Login to one of the data node say, datanode1.hadoopnode.com as a normal user through putty. As you see below, this machine has IP address of 192.168.1.6.
As shown above, I logged in as user2 which was created by default during the installation time. We are soon going to create user and group for Hadoop.
2. Open /etc/hosts file through vi editor and add the following entries:
NOTE: Follow the above step on Data Node 2 too.
3. As similar as we created hduser and hadoop group for masternode, follow the same steps for data node 1 and data node 2 too.
Ensure you don’t miss the below step for allowing sudo access for the hduser.
4. This is an important part of data node configuration. We are going to configure passwordless SSH so that masternode can SSH to all datanodes without password.
Note: Run the below step on Master Node only.
Try logging to the slave node from master node without password as shown below:
5. Follow step 10 to 20 discussed above for configuring JAVA_HOME on this node too.
Setting up Data Node 2:
1. Login to data node 2 as shown below:
2. Configure /etc/hosts as similar as what we configured for data node 1.
3. Configure User and group for hadoop.
4. Again, this is an important step which IS TO BE RUN ON MASTER ONLY.
The above command lets passwordless SSH from master to the data node 2.
5.Follow step 10 to 20 discussed above for configuring JAVA_HOME on this node too. Once you configure on both the data node you might see something like shown below:
Configuring Hadoop:
NOTE: The below commands to be run on all the master and data nodes.
22. Download Hadoop binaries from http://mirror.olnevhost.net/pub/apache/hadoop/core/hadoop-2.3.0/hadoop-2.3.0.tar.gz. Run the wget utility(shown below) to download Hadoop binaries from remote Hadoop website.
23. Unzip the hadoop binaries as shown below:
Once you run the above command it will extract the binaries and place it under the same location. You need to copy it to /usr/local directory.
24. Unzip the hadoop tar directly into /usr/local/hadoop-2.3.0 folder :
25. Create a symbolic link for hadoop directory under /usr/local/hadoop as shown below:
26. Provide the ownership to hduser and hadoop group to execute the hadoop binaries:
27. Switch to hduser through the following command:
28.Open .bashrc placed under home directory of hduser and add the following entries at the end of the file:
29. Save the file. Run the command called bash to let the environment variable effective as shown below:
30. Edit the following hadoop-env.sh for letting hadoop know where does JAVA_HOME resides. Once the entry is done, you should be able to able to see the following results.
31.Verify that the hadoop installation through the following command:
This shows that Hadoop is properly configured.
32.As the above Hadoop configuration is run on all the nodes, ensure that /usr/local/hadoop is the path where Hadoop resides on all the nodes. Follow the same steps for all the data nodes too. For example, if you follow the steps above on data node 1 you should expect the following results:
Now we have master node and data node ready with the basic Hadoop installation.
PLEASE NOTE: In a newer version of hadoop, there is one slight additional steps for environmental variable setup for JAVA_HOME to work. Open the file hadoop-config.sh under /usr/local/hadoop/libexec and make the following entry too.
Configuring Master Node:
33. Let us first configure Master Node File configurations.
34.Ensure that you are logged in as hduser and running the below commands.
Create required files as shown below on master node.
5. Open the hdfs.xml file and add the following entries:
36. Open the file $HADOOP_INSTALL/etc/hadoop/core-site.xml and let hadoop module know where master node(name node) resides.
Put the entries only inside the configuration tab and not outside.
37.Format the HDFS filesystem on the master node as shown below:
It takes few seconds and the final results gets displayed:
38. Edit the file $HADOOP_INSTALL/etc/hadoop/slaves with all the data nodes entries into master node.
Configuring Data Nodes
39. Login to one of the data node(say datanode1) and create the following files:
Once you make the entry , the file should look like as shown below:
41. Let the data node know where does master node (namenode) resides through editing the core-site.xml file.
Instead of IP address above, you can use hostname of master node if you have correct entries under /etc/hosts.
42. Now open the master node session and run the below command:
As shown below it tries to ssh to data node and start the respective services in the data nodes.
Ensure that the required services (name node, data node and YARN ) are running through jps command:
43. One can verify
44. Ensure that the required services are running at the data node end too.
Also, verify on data node 2 as shown below:
45. You can access Hadoop Name Node details under http://<masternode>:50070.
Under this link, you can access Datanodes , logs for each namenodes, snapshot of events and overall DFS details too.
As shown above, there are two Data Node represented as Live Nodes.
Click on Data Node section on the top of the Web URL to see the data node status
Under this link, you can access Datanodes , logs for each namenodes, snapshot of events and overall DFS details too.
You can visualize the secondary namenode through the following URL:
You can see the datanode 1 status under the URL:
In the similar manner, you can see the data node 2 status through URL:
Before I wrap up..
We will now look into basic HDFS shell commands which forms the basis for running Map Reduce jobs. Hadoop file system usually referred as fs shell commands are used to perform various file operations. For Example: copy, changing permissions, viewing the file contents, changing ownership of files, creating directories and much more. You can see the various options available through the below command:
Listing the size of DFS:
Creating a directory in HDFS. This is very similar to unix command.
Listing the HDFS file system:
Copy a file from your local system to HDFS file system:
As shown above, first create a empty file called alpha in some local folder. Add some contents to it through editor. Use –copyFromLocal option to copy it from local file system to HDFS file system.
Copy a file from HDFS to local system:
As shown above, first delete the alpha file from your local machine. Now try running –copyToLocal option to copy it from HDFS to your local machine.
Displaying the length of the file contained in a directory:
Displaying the stat information for a HDFS path:
You can always refer to HDFS man pages for detailed options for command line utilities for operation on HDFS.
In our next episode, we will cover the various ecosystem of Hadoop.
Comments are closed.