Getting Started: HDFS
This section describes how to get started with a pre-instrumented version of HDFS that has all of the instrumentation libraries added to it. We will run a version of Hadoop 2.7.2 instrumented with Baggage, and run an example X-Trace, Retro, and Pivot Tracing application.
Clone the git repository
git clone git@github.com:brownsys/tracing-framework.git
From the directory you cloned into, build and install with the following command:
mvn clean package install -DskipTests
Next, download our pre-instrumented fork of Hadoop 2.7.2
git clone git@github.com:brownsys/hadoop.git
We want to use branch brownsys-pivottracing-2.7.2
, which should be the default branch.
Build and install Hadoop using the following command:
mvn clean package install -Pdist -DskipTests -Dmaven.javadoc.skip="true"
If you encounter any problems while building, try building the non-instrumented branch branch-2.7.2
. This will determine whether it is a Hadoop build issue, or a problem with our extra instrumentation.
Configuring HDFS
The following instructions will configure a minimally working version of HDFS. From the base directory for the Hadoop git repository, the built version of hadoop will be located in hadoop-dist/target/hadoop-2.7.2
. Within the build directory, etc/hadoop
will contain the default config. Copy this directory to somewhere outside of the HDFS build directory, otherwise any changes you make will be overwritten any time you build HDFS. Edit core-site.xml
to the following:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>
Set the following two environment variables:
HADOOP_HOME
to the build directory, eg.hadoop-dist/target/hadoop-2.7.2
HADOOP_CONF_DIR
to the location of your copied HDFS config.
Running PubSub
All of the instrumentation libraries use a pub sub system to communicate. Agents running in the instrumented Hadoop will receive commands over pubsub, and send their output back over the pubsub. The pubsub project is in tracingplane/pubsub
The following command will start an X-Trace server which also includes a pub sub server:
tracingplane/pubsub/target/appassembler/bin/server
To check which Java process are running, use the jps
command. This is an easy way to check if something has crashed. You should expect to see a process called StandaloneServer
.
Start HDFS
From the base directory for the Hadoop git repository, the built version of hadoop will be located in hadoop-dist/target/hadoop-2.7.2
. Within the build directory, bin
contains various command line utilities, while sbin
contains some useful scripts for starting and stopping processes. Before starting HDFS, we must format its data dir:
${HADOOP_HOME}/bin/hdfs --config $HADOOP_CONF_DIR namenode -format
Then start an HDFS NameNode and HDFS DataNode:
${HADOOP_HOME}/sbin/hadoop-daemon.sh start namenode
${HADOOP_HOME}/sbin/hadoop-daemon.sh start datanode
You should expect to see messages in the output stream along the lines of the following:
Pivot Tracing initialized
Resource reporting executor started
ZmqReporter: QUEUE-disk -> queue
ZmqReporter: DISK- -> disk
ZmqReporter: CPU- -> cpu
/META-INF/lib/libthreadcputimer.dylib extracted to temporary file /var/folders/d3/38f0syys5yjcvynz8p4n3t4w0000gn/T/jni_file_2410736199844535966.dll
Again, check which Java processes are running using the jps
command. In addition to XTraceServer
, you should expect to see NameNode
and DataNode
For reference, HDFS runs a Web UI by default at localhost:50070.
Optional: Start HDFS with multiple DataNodes
A more interesting HDFS setup will have you running one NameNode process and multiple DataNode processes. This can be done on the same machine, however, you must specify configurations for each datanode you start (otherwise they might try to use the same data directory, for example). The following script will start multiple datanodes. Change the NUM_DATANODES
and BASE_DATA_DIR
variables as appropriate.
NUM_DATANODES=3;
BASE_DATA_DIR=/Users/jon/deploy
echo "===== Starting HDFS with $NUM_DATANODES datanodes =====";
${HADOOP_HOME}/sbin/hadoop-daemon.sh start namenode;
for i in $(seq $NUM_DATANODES)
do
export HADOOP_LOG_DIR=$BASE_DATA_DIR/logs/datanode_$i
export HADOOP_PID_DIR=$BASE_DATA_DIR/pid/datanode_$i
export HADOOP_OPTS="\
-Dhadoop.tmp.dir=$BASE_DATA_DIR/data/datanode_$i \
-Ddfs.datanode.address=0.0.0.0:5001$i \
-Ddfs.datanode.http.address=0.0.0.0:5008$i \
-Ddfs.datanode.ipc.address=0.0.0.0:5002$i"
${HADOOP_HOME}/sbin/hadoop-daemon.sh --script bin/hdfs start datanode $HADOOP_OPTS
done