With the start of a new month comes some new numbers out of Valve as part of their Steam hardware/software survey…
Conky is a light-weight system monitoring tool combined with a fully-customized desktop theme, which can completely change your Linux desktop experience. Using Conky, you will get a fully personalized desktop theme, populated with an eye-catching smart clock, current date and time, as well as the current status of your Linux…
Learning map/reduce frameworks like Hadoop is a useful skill to have but not everyone has the resources to implement and test a full system. Thanks to cheap arm based boards, it is now more feasible for developers to set up a full Hadoop cluster. I coupled my existing knowledge of setting up and running single node Hadoop installs with my BeagleBone cluster from my previous post to create my second project. This tutorial goes through the steps I took to set up and run Hadoop on my Ubuntu cluster. It may not be a practical application for everyone to learn but currently distributed map/reduce experience is a good skill to have. All the machines in my cluster are already set up with Java and SSH from my first project so you may need to install them if you don’t have them.
The first step naturally is to download Hadoop from apache’s site on each machine in the cluster and untar it. I used version 1.2.1 but version 2.0 and above is now available. I placed the resulting files in /usr/local and named the directory hadoop. With the files on the system, we can create a new user called hduser to actual run our jobs and a group for it called hadoop:
sudo addgroup hadoop sudo adduser --ingroup hadoop hduser
With the user created, we will make hduser the owner of the directory containing hadoop:
sudo chown -R hduser:hadoop /usr/local/hadoop
Then we create a temp directory to hold files and make the hduser the owner of it as well:
sudo mkdir -p /hadooptemp sudo chown hduser:hadoop /hadooptemp
With the directories set up, log in as hduser. We will start by updating the .bashrc file in the home directory. We need to add two export lines at the top to point to our hadoop and java locations. My java installation was openjdk7 but yours may be different:
# Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin
Next we can navigate to our hadoop installation directory and locate the conf directory. Once there we need to edit the hadoop-env.sh file and uncomment the java line to point to the location of the java installation again:
Next we can update core-site.xml to point to the temp location we created above and specify the root of the file system. Note that the name in the default url is the name of the master node:
<description>Root temporary directory.</description>
<description>URI pointing to the default file system.</description>
Next we can edit mapred-site.xml:
mapred.job.trackerbeaglebone1:54311The host and port that the MapReduce job tracker runs at.
Finally we can edit hdfs-site.xml to list how many replication nodes we want. In this case I chose all 3:
dfs.replication3Default block replication.
Once this has been done on all machines we can edit some configuration files on the master to tell it which nodes are slaves. Start by editing the masters file to make sure it just contains our host name:
Now edit the slaves file to list all the nodes in our cluster:
beaglebone1 beaglebone2 beaglebone3
Next we need to make sure that our master can communicate to its slaves. Make sure hosts file on your nodes contains the names of all the nodes in your cluster. Now on the master, create a key and copy it to the authorized_keys:
ssh-keygen -t rsa -P "" cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Then copy the key to other nodes:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@beaglebone2
Finally we can test the connections from the master to itself and others using ssh.
The first step is to format HDFS on the master. From the bin directory in hadoop run the following:
hadoop namenode -format
Once that is finished Now we can start the NameNode and JobTracker daemons on the master. We simply need to execute these two commands from the bin directory:
To stop them later we can use:
Running an Example
With Hadoop running on the master and slaves, we can test out one of the examples. First we need to create some files on our system. Create some text files in a directory of your chosing with a few words in each. We can then copy the files to hdfs. From the main hadoop directory, execute the following:
bin/hadoop dfs -mkdir /user/hduser/test bin/hadoop dfs -copyFromLocal /tmp/*.txt /user/hduser/test
The first line will call mkdir on HDFS to create a directory /user/hduser/test. The second line will copy files I created in /tmp to the new HDFS directory. Now we can run the wordcount sample against it:
bin/hadoop hadoop-examples-1.2.1.jar wordcount /user/hduser/test /user/hduser/testout
The jar file name will vary based on what version of hadoop you downoaded. Once the job is finished, it will output the results in HDFS to /user/hduser/testout. To view the resulting files we can do this:
bin/hadoop dfs -ls /user/hduser/testout
We can then use the cat command to show the contents of the output:
bin/hadoop dfs -cat /user/hduser/testout/part-r-00000
This file will show us each word found and the number of times it was found. If we want to see proof that the job ran on all nodes, we can view the logs on the slaves from the hadoop/logs directory. For example, on the beaglebone2 node I can do this:
When I examined the file, I could see messages at the end showing the jobname and data received and sent, letting me know that all was well.
If you through all of this and it worked, congratulations on setting up a working cluster. Due to the slow performance of the BeagleBone’s SD card, it is not the best device for getting actual work done. However, these steps are applicable to faster arm devices as they come along. In the meantime, the BeagleBone Black is a great platform for practice and learning how to set up distributed systems.
Midmarket companies often have neither the resource nor the staff to construct their own complete solutions and are increasingly seeking cloud computing solutions. An IBM sponsored Cloud Forum brought together journalists, analysts, developers and midmarket executives to discuss the future of cloud computing.