In this tutorial, Michael will describe how to setup a single-node Hadoop cluster.
- What we want to do
- Prerequisites
- Sun Java 6
- Adding a dedicated Hadoop system user
- Configuring SSH
- Disabling IPv6
- Alternative
- Hadoop
- Installation
- Alternative
- Excursus: Hadoop Distributed File System (HDFS)
- Configuration
- hadoop-env.sh
- conf/*-site.xml
- Formatting the name node
- Starting your single-node cluster
- Stopping your single-node cluster
- Running a MapReduce job
- Download example input data
- Restart the Hadoop cluster
- Copy local example data to HDFS
- Run the MapReduce job
- Retrieve the job result from HDFS
- Hadoop Web Interfaces
- MapReduce Job Tracker Web Interface
- What’s next?
- Related Links
- Changelog
What we want to do
In this short tutorial, I will describe the required steps for setting up a single-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux.
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
The main goal of this tutorial is to get a ”simple” Hadoop installation up and running so that you can play around with the software and learn more about it.
This tutorial has been tested with the following software versions:
- Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04)
- Hadoop 0.20.2, released February 2010 (deprecated: 0.13.x – 0.19.x)
You can find the time of the last document update at the very bottom of this page.
Prerequisites
Sun Java 6
Hadoop requires a working Java 1.5.x (aka 5.0.x) installation. However, using Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop. For the sake of this tutorial, I will therefore describe the installation of Java 1.6.
In Ubuntu 10.04 LTS, the package sun-java6-jdk has been dropped from the Multiverse section of the Ubuntu archive. You have to perform the following four steps to install the package.
1. Add the Canonical Partner Repository to your apt repositories:
$ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
2. Update the source list
$ sudo apt-get update
3. Install sun-java6-jdk
$ sudo apt-get install sun-java6-jdk
4. Select Sun’s Java as the default on your machine.
$ sudo update-java-alternatives -s java-6-sun
The full JDK which will be placed in /usr/lib/jvm/java-6-sun (well, this directory is actually a symlink on Ubuntu).
After installation, make a quick check whether Sun’s JDK is correctly set up:
user@ubuntu:~# java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)
Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hadoop
This will add the user hadoop and the group hadoop to your local machine.
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hadoop user we create in the previous section.
I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several guides available.
First, we have to generate an SSH key for the hadoop user.
user@ubuntu:~$ su - hadoop hadoop@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hadoop@ubuntu The key's randomart image is: [...snipp...] hadoop@ubuntu:~$
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).
Second, you have to enable SSH access to your local machine with this newly created key.
hadoop@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with the hadoop user. The step is also needed to save your local machine’s host key fingerprint to the hadoop user’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).
hadoop@ubuntu:~$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...] hadoop@ubuntu:~$
If the SSH connect should fail, these general tips might help:
- Enable debugging with ssh -vvv localhost and investigate the error in detail.
- Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hadoop user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.
Disabling IPv6
One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box.
In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:
#disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).
Alternative
You can also disable IPv6 only for Hadoop as documented in HADOOP-3437. You can do so by adding the following line to conf/hadoop-env.sh
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Hadoop
Installation
You have to download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hadoop user and group, for example:
$ cd /usr/local $ sudo tar xzf hadoop-0.20.2.tar.gz $ sudo mv hadoop-0.20.2 hadoop $ sudo chown -R hadoop:hadoop hadoop
(just to give you the idea, YMMV – personally, I create a symlink from hadoop-0.20.2 to hadoop)
Alternative
Update March 2010: I have been notified by some readers that they’ve run into problems using the Cloudera package for setting up multi-node Hadoop clusters according to my tutorials. Falling back to installing from source solved their problems.
Update June 2009: The folks over at Cloudera notified me that they have bundled up Hadoop as an open source Deb package. If you add their repository to APT, you can use apt-get to install the needed packages for Hadoop and related subprojects like Pig or Hive. According to Jeff Hammerbacher from Cloudera, they are actually working with the Canonical team to get these packages added to the vanilla distribution of Ubuntu.
Excursus: Hadoop Distributed File System (HDFS)
From The Hadoop Distributed File System: Architecture and Design:
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.
The following picture gives an overview of the most important HDFS components.
Configuration
Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.
hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open /conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.
Change
# The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
to
# The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun
conf/*-site.xml
Note: As of Hadoop 0.20.0, the configuration settings previously found in hadoop-site.xml were moved to core-site.xml (hadoop.tmp.dir, fs.default.name), mapred-site.xml (mapred.job.tracker) and hdfs-site.xml (dfs.replication).
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.
You can leave the settings below ”as is” with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice, for example /usr/local/hadoop-datastore/hadoop-${user.name}. Hadoop will expand ${user.name} to the system user which is running Hadoop, so in our case this will be hadoop and thus the final path will be /usr/local/hadoop-datastore/hadoop-hadoop.
Note: Depending on your choice of location, you might have to create the directory manually with sudo mkdir /your/path; sudo chown hadoop:hadoop /your/path (and maybe also sudo chmod 750 /your/path) in case the hadoop user does not have the required permissions to do so (otherwise, you will see a java.io.IOException when you try to format the name node in the next section).
In file conf/core-site.xml:
<!-- In: conf/core-site.xml --> <property> <name>hadoop.tmp.dir</name> <value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
In file conf/mapred-site.xml:
<!-- In: conf/mapred-site.xml --> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>
In file conf/hdfs-site.xml:
<!-- In: conf/hdfs-site.xml --> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>
See Getting Started with Hadoop and the documentation in Hadoop’s API Overview if you have any questions about Hadoop’s configuration options.
Formatting the name node
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. ”’Do not format a running Hadoop filesystem, this will cause all your data to be erased.”’
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command
hadoop@ubuntu:~$ /hadoop/bin/hadoop namenode -format
The output will look like this:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format 10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. 10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hadoop/dfs/name has been successfully formatted. 10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/ hadoop@ubuntu:/usr/local/hadoop$
Starting your single-node cluster
Run the command:
hadoop@ubuntu:~$ /bin/start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:
hadoop@ubuntu:/usr/local/hadoop$ bin/start-all.sh starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-ubuntu.out localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-ubuntu.out localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-ubuntu.out localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ubuntu.out hadoop@ubuntu:/usr/local/hadoop$
A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0). See also How to debug MapReduce programs.
hadoop@ubuntu:/usr/local/hadoop$ jps 2287 TaskTracker 2149 JobTracker 1938 DataNode 2085 SecondaryNameNode 2349 Jps 1788 NameNode
You can also check with netstat if Hadoop is listening on the configured ports.
hadoop@ubuntu:~$ sudo netstat -plten | grep java tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/java tcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java tcp 0 0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/java tcp 0 0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java hadoop@ubuntu:~$
If there are any errors, examine the log files in the /logs/ directory.
Stopping your single-node cluster
Run the command
hadoop@ubuntu:~$ /bin/stop-all.sh
to stop all the daemons running on your machine.
Exemplary output:
hadoop@ubuntu:/usr/local/hadoop$ bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode hadoop@ubuntu:/usr/local/hadoop$
Running a MapReduce job
We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki.
Download example input data
We will use three ebooks from Project Gutenberg for this example:
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
Download each ebook as plain text files in us-ascii encoding and store the uncompressed files in a temporary directory of choice, for example /tmp/gutenberg.
hadoop@ubuntu:~$ ls -l /tmp/gutenberg/ total 3592 -rw-r--r-- 1 hadoop hadoop 674425 2007-01-22 12:56 20417-8.txt -rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt -rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt hadoop@ubuntu:~$
Restart the Hadoop cluster
Restart your Hadoop cluster if it’s not running already.
hadoop@ubuntu:~$ /bin/start-all.sh
Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls Found 1 items drwxr-xr-x - hadoop supergroup 0 2010-05-08 17:40 /user/hadoop/gutenberg hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg Found 3 items -rw-r--r-- 1 hadoop supergroup 674762 2010-05-08 17:40 /user/hadoop/gutenberg/20417.txt -rw-r--r-- 1 hadoop supergroup 1573044 2010-05-08 17:40 /user/hadoop/gutenberg/4300.txt -rw-r--r-- 1 hadoop supergroup 1391706 2010-05-08 17:40 /user/hadoop/gutenberg/7ldvc10.txt hadoop@ubuntu:/usr/local/hadoop$
Run the MapReduce job
Now, we actually run the WordCount example job.
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount gutenberg gutenberg-output
This command will read all the files in the HDFS directory gutenberg, process it, and store the result in the HDFS directory gutenberg-output.
Exemplary output of the previous command in the console:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount gutenberg gutenberg-output 10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3 10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001 10/05/08 17:43:02 INFO mapred.JobClient: map 0% reduce 0% 10/05/08 17:43:14 INFO mapred.JobClient: map 66% reduce 0% 10/05/08 17:43:17 INFO mapred.JobClient: map 100% reduce 0% 10/05/08 17:43:26 INFO mapred.JobClient: map 100% reduce 100% 10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001 10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17 10/05/08 17:43:28 INFO mapred.JobClient: Job Counters 10/05/08 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1 10/05/08 17:43:28 INFO mapred.JobClient: Launched map tasks=3 10/05/08 17:43:28 INFO mapred.JobClient: Data-local map tasks=3 10/05/08 17:43:28 INFO mapred.JobClient: FileSystemCounters 10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026 10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512 10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3687918 10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880330 10/05/08 17:43:28 INFO mapred.JobClient: Map-Reduce Framework 10/05/08 17:43:28 INFO mapred.JobClient: Reduce input groups=82290 10/05/08 17:43:28 INFO mapred.JobClient: Combine output records=102286 10/05/08 17:43:28 INFO mapred.JobClient: Map input records=77934 10/05/08 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796 10/05/08 17:43:28 INFO mapred.JobClient: Reduce output records=82290 10/05/08 17:43:28 INFO mapred.JobClient: Spilled Records=255874 10/05/08 17:43:28 INFO mapred.JobClient: Map output bytes=6076267 10/05/08 17:43:28 INFO mapred.JobClient: Combine input records=629187 10/05/08 17:43:28 INFO mapred.JobClient: Map output records=629187 10/05/08 17:43:28 INFO mapred.JobClient: Reduce input records=102286
Check if the result is successfully stored in HDFS directory gutenberg-output:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls Found 2 items drwxr-xr-x - hadoop supergroup 0 2010-05-08 17:40 /user/hadoop/gutenberg drwxr-xr-x - hadoop supergroup 0 2010-05-08 17:43 /user/hadoop/gutenberg-output hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg-output Found 2 items drwxr-xr-x - hadoop supergroup 0 2010-05-08 17:43 /user/hadoop/gutenberg-output/_logs -rw-r--r-- 1 hadoop supergroup 880330 2010-05-08 17:43 /user/hadoop/gutenberg-output/part-r-00000 hadoop@ubuntu:/usr/local/hadoop$
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount -D mapred.reduce.tasks=16 gutenberg gutenberg-output
An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat gutenberg-output/part-r-00000
to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.
hadoop@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge gutenberg-output /tmp/gutenberg-output hadoop@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output "(Lo)cra" 1 "1490 1 "1498," 1 "35" 1 "40," 1 "A 2 "AS-IS". 1 "A_ 1 "Absoluti 1 "Alack! 1 hadoop@ubuntu:/usr/local/hadoop$
Note that in this specific output the quote signs (“) enclosing the words in the head output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
- http://localhost:50030/ – web UI for MapReduce job tracker(s)
- http://localhost:50060/ – web UI for task tracker(s)
- http://localhost:50070/ – web UI for HDFS name node(s)
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.
MapReduce Job Tracker Web Interface
The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ”local machine’s” Hadoop log files (the machine on which the web UI is running on).
By default, it’s available at http://localhost:50030/.
Task Tracker Web Interface
The task tracker web UI shows you running and non-running tasks. It also gives access to the ”local machine’s” Hadoop log files.
By default, it’s available at http://localhost:50060/.
HDFS Name Node Web Interface
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the ”local machine’s” Hadoop log files.
By default, it’s available at http://localhost:50070/.
What’s next?
If you’re feeling comfortable, you can continue your Hadoop experience with my follow-up tutorial Running Hadoop On Ubuntu Linux (Multi-Node Cluster) where I describe how to build a Hadoop ”multi-node” cluster with two Ubuntu boxes (this will increase your current cluster size by 100% :-P).
In addition, I wrote a tutorial on how to code a simple MapReduce job in the Python programming language which can serve as the basis for writing your own MapReduce programs.
Related Links
From quuxlabs:
From other people:
- Hadoop home page
- Project Description @ Hadoop Wiki
- Getting Started with Hadoop @ Hadoop Wiki
- How to debug MapReduce programs @ Hadoop Wiki
- Hadoop API Overview
Changelog
Only major changes are listed here. For the full changelog, click on the “History” link in the footer at the very bottom of this web page.
- 2010-05-08: updated tutorial for Hadoop 0.20.2 and Ubuntu 10.04 LTS
- 2009-11-16: clarified configuration sections
- 2009-07-06: tested tutorial with Hadoop 0.20.0
- 2009-01-04: tested tutorial with Hadoop 0.19.0
- 2008-08-31: tested tutorial with Hadoop 0.18.0
- 2008-07-14: tested tutorial with Hadoop 0.17.1
- 2008-03-03: tested tutorial with Hadoop 0.16.0
- 2008-01-09: tested tutorial with Hadoop 0.15.2
- 2007-10-26: updated tutorial for Hadoop 0.14.2 (formerly 0.14.1)
- 2007-09-26: added screenshots of Hadoop web interfaces
- 2007-09-21: updated tutorial for Hadoop 0.14.1 (formerly 0.13.0)