Running Hadoop over Lustre

Estimated Reading Time: 7 minutes

This blog post describes how to run Apache Hadoop over the open source distributed parallel file system like Lustre.

Hadoop is a large-scale distributed, open source framework for parallel processing of data on large cluster built of commodity hardware. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.

Hadoop is built on two main parts. A special file system called Hadoop Distributed File System (HDFS) and the MapReduce Framework. HDFS is an optimized file system for distributed processing of very large datasets on commodity hardware. HDFS stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

Hadoop runs on master-slave architecture. HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodes usually one per node in a cluster. The DataNodes manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. DataNodes serves read, write requests, and performs block creation, deletion, and replication upon instruction from Namenode.
On the other hand, Lustre is an open-source distributed parallel filesystem.Lustre is a scalable, secure, robust, and highly-available cluster file system that addresses the I/O needs, such as low latency and extreme performance, of large computing clusters. Lustre is basically an object-based file system. It is composed of three functional components: Metadata servers (MDSs), object storage servers (OSSs), and clients.MDS (metadata servers) provides metadata services MDS store file system metadata such as file names, directories, and permissions. An availability of the Meta Data Server is critical for file system data. In a typical configuration, there are two MDSs configured for high-availability failover. Since an MDS stores only metadata, the storage (MDT) attached to the file system need only store hundreds of gigabytes for a multi-terabyte file system. One MDS per file system manages one metadata target (MDT). Each MDT stores file metadata, such as file names, directory structures, and access permissions.

Why Lustre and not HDFS?

Hadoop distributed File System fits well for general statistics application but it might show performance bottleneck for computational complexity applications. For example, High Performance computing based application which generates large even increasing outputs. Secondly, HDFS is not POSIX complaint, which means it cannot be used as a normal file system which makes it difficult to extend. Also, HDFS has a WORM (write-once, read-many) access model, changing a small part of a file requires that all file data be copied, resulting in very time-intensive file modifications. Hadoop implements a computational paradigm named Map/Reduce where Reduce node uses HTTP to shuffle all related big Map Task outputs before real task begins. This will surely consume mass resources and generate lot of I/O and merge spill operation.

Setting up Hadoop over Lustre

Image-1-Lustre

Image-2-Lustre

Setting up Metadata Server

I assume that you have CentOS/RHEL 6.x installed on your hardware. I have RHEL 6.x available and will be using it for this demonstration. This should work for CentOS 6.x versions too. Firewall and SELinux both need to be disabled. You can disable firewall through iptables command whereas SELinux can be disabled through the below command:

Lustre-1

Install the lustre related packages through YUM repository.

lustre-2

Reboot the machine to boot into Lustre kernel. Once the machine is up, verify the lustre kernel through the following command:

lustre-3

Create a LVM partition on /dev/sda2 (ensure this is a LVM partition type at your end)

lustre-4

Run mkfs.lustre utility to create a file system named lustre on the server including the metadata target and the management server as shown below:

lustre-0

Now, create a mount point and start MDS node as shown below:

lustre-01

Run the mount command to verify the overall setting:

lustre-7

Ensure the MDS related services are enabled:

[root@MDS ~] # modprobe lnet
[root@MDS ~] # lctl network up
[root@MDS ~] # lctl list_nids
192.168.1.185@tcp

Hence, this completes MDS configuration.

Setting up Object Storage Server (OSS1)

Object Storage Server need to be setup on a separate machine. I assume lustre is installed on a separate box and booted into lustre kernel. As we did earlier for MDS, we need to create several logical volumes namely ost1 to ost6 as shown below:

lustre-8

Use the mkfs.lustre command to create the lustre file systems as shown below:
[root@oss1-0 ~] # mkfs.lustre –fsname lustre –ost –mgsnode=192.168.1.185@tcp0 /dev/vg00/ost1

lustre-9

Run the above command for all the ost1-ost6 in similar way.
Verify the various logical volumes created as shown below:

#mount –t lustre /dev/vg00/ost1 /mnt/ost1
mkfs.lustre –fsname lustre –ost –mgsnode=192.168.1.185@tcp0 /dev/vg00/ost2
mkfs.lustre –fsname lustre –ost –mgsnode=192.168.1.185@tcp0 /dev/vg00/ost3
mkfs.lustre –fsname lustre –ost –mgsnode=192.168.1.185@tcp0 /dev/vg00/ost4
mkfs.lustre –fsname lustre –ost –mgsnode=192.168.1.185@tcp0 /dev/vg00/ost5
mkfs.lustre –fsname lustre –ost –mgsnode=192.168.1.185@tcp0 /dev/vg00/ost6

It’s time to start the OSS by mounting the OSTs to the corresponding mount point.

Finally the mount command displays the following logical volumes.

Verify the relative devices displays as shown:

This completes OSS1 configuration.
Follow the similar step for OSS2 (shown above). It is always recommended to perform striping over all the OSTs run this command on lustreclient:

#lfs setstripe -c -1 /mnt/lustre

Setting up Lustre Client #1
All client mounts to the same file system identified by the MDS. Use the following commands, specifying the IP address of the MDS server:
#mount –t lustre 192.168.1.185@tcp0:/lustre /mnt/lustre

You can use lfs utility to manage the entire files system information at the client system.(as shown below).

It shows that the overall file system size of /mnt/lustre is around 70GB. Striping of data is an important aspect of the scalability and performance of Lustre File System. The data gets stripped over the blocks of multiple OSTs. The stripe count can be set on a file system, directory or file level.
You can view the stripping details through the below command:

[root@lustreclient1 ~] # lfs getstripe /mnt/lustre
/mnt/lustre
Stripe_count: 1 stripe size: 1048576 stripe_offset: -1

Hence we have Lustre setup ready. Now, it’s time to run Hadoop over lustre instead of HDFS.
I assume that Hadoop is already running with 1 name node and 4 data nodes.
On the master node, let’s perform the following file configuration changes.

File: /usr/local/hadoop/conf/core-site.xml

core-site1

File: /usr/local/hadoop/conf/mapred-site.xml

mapred-1

On every slave nodes (data nodes),, let’s perform the following configuration changes

File: /usr/local/hadoop/conf/core-site.xml

core-site2

File: /usr/local/hadoop/conf/mapred-site.xml

mapred-2

We are good to start the hadoop related services.
On master node, run the mapred service without starting HDFS (since we are going to use lustre only) as shown below:

mapred_3

You can ensure the service running through the jps utility:

mapred-4

Start the tasktracker on the slave nodes through the following command:

mapred-5

You can just run a simple hadoop word count example (as shown below):

mapred-Last

Hence, we have successfully run Hadoop runs over Lustre.

Reference:
http://hadoop.apache.org/docs/r1.1.1/mapred_tutorial.html
http://wiki.lustre.org/index.php/Configuring_the_Lustre_File_System
http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf

Clap