Using Hadoop on Bigdat

Directions for using HOD, Hadoop On Demand, on the NMSU CS Bigdat cluster
written by Jonathan Cook, May 29, 2014, joncook@nmsu.edu

These are the basic, scripted directions. I am not an expert of nor even a normal user of Hadoop, so I don't have any insight past the knowledge that this scripted example works. If you are going to use HOD regularly, you will need to become an HOD and Hadoop expert. I am sure that there are many many options and configuration possibilities that will help you!

1. In your home directory, make a ".hod" directory

>  mkdir .hod

2. Copy ~act/.hod/hodrc into your .hod directory

>  cp ~act/.hod/hodrc .hod/

3. Edit the hodrc file and change "act" to your username everywhere

4. Edit your .bashrc or .cshrc (whichever shell you are using) and add the directory /usr/share/hadoop/contrib/hod/bin/ to your PATH. E.g., add this line to the end of your .bashrc (.cshrc uses different syntax!)

export PATH=$PATH:/usr/share/hadoop/contrib/hod/bin/

Source your .bashrc to make the change current.

5. Make a directory in which you want Hadoop to manage its instance; you can call it anything you want and put it anywhere you want

>  mkdir hadoop

6. Use the "hod" command to create and start up a Hadoop instance, such as

>  hod allocate -d ~/hadoop/ -n 4

This uses Torque to allocate the number of nodes requested (4 in this case), starts a Hadoop instance with a new HDFS filesystem (but you can probably tell it to use some existing predefined filesystem with data in it -- this would be in the hodrc file, plus some other parameters that I don't know). You can see that there is a Torque job running by using the "qstat" command.

7. Use your Hadoop instance, always referencing your configuration directory using "--config <yourdirectory>". For example, we did this:

See that it works and list available commands:

>  hadoop --config ~/hadoop

List root of our HDFS filesystem:

>  hadoop --config ~/hadoop fs -ls /

Put a file into HDFS as the name test.dat:

>  hadoop --config ~/hadoop fs -put ./hod-jcook.log /test.dat

See that it is there:

>  hadoop --config ~/hadoop fs -ls /

Run the sample wordcount Hadoop program on our file:

>  hadoop --config ~/hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar wordcount /test.dat /testout

See that the result directory "testout" was created by the wordcount program:

>  hadoop --config ~/hadoop fs -ls /

Look at what is in it:

>  hadoop --config ~/hadoop fs -ls /testout

Look at one of the output files that contains the wordcount:

>  hadoop --config ~/hadoop fs -cat /testout/part-r-00000

8. When you are done with your Hadoop instance, deallocate it:

>  hod deallocate -d ~/hadoop

If you want to somehow keep your HDFS filesystem, you may need some options, I just don't know. You will need to read up on HOD and Hadoop.

9. Use the "qstat" command to see if your Torque job for the HOD instance has finished -- IF IT SHOWS UP it is still running! I would wait for a little bit, but if it does not finish on its own YOU MUST DELETE IT MANUALLY. Our evidence shows that HOD sometimes leaves the job running.

>  qstat
>  qdel 142   /** 142 is an example job number **/

Directions for using HOD, Hadoop On Demand, on the NMSU CS Bigdat cluster written by Jonathan Cook, May 29, 2014, joncook@nmsu.edu

Directions for using HOD, Hadoop On Demand, on the NMSU CS Bigdat cluster
written by Jonathan Cook, May 29, 2014, joncook@nmsu.edu