Publish Spark Streaming System and Application Metrics From AWS EMR to Datadog - Part 2

This post is the second part in the series to get an AWS EMR cluster, running spark streaming application, ready for deploying in the production environment by enabling monitoring. In the first part in this series we looked at how to enable EMR specific metrics to be published to datadog service. In this post I will show you how to set up your EMR cluster to enable spark check that will publish spark driver, executor and rdd metrics to be graphed on datadog dashboard.

To accomplish this task, we will leverage EMR Bootstrap actions. From the AWS Documentation:
You can use a bootstrap action to install additional software on your cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Amazon EMR installs specified applications and the node begins processing data. If you add nodes to a running cluster, bootstrap actions run on those nodes also. You can create custom bootstrap actions and specify them when you create your cluster.
This is a two step process. 
  1. Install the datadog agent on each node in the EMR cluster.
  2. Configure the datadog agent on master node to run spark check at regular intervals, and publish spark metrics.

I have created a gist for each of the two steps. The first script is launched by the bootstrap step during EMR launch, and downloads and installs the datadog agent on each node of the cluster. Simple! It then executes the second script as a background process.
#!/bin/bash
# Clean install Datadog agent
sudo yum -y erase datadog-agent
sudo rm -rf /etc/dd-agent
# INPUT: Datadog account key
DD_API_KEY=$1 bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/dd-agent/master/packaging/datadog-agent/source/install_agent.sh)"
sudo /etc/init.d/datadog-agent info
# INPUT: EMR Cluster name used to tag metrics
CLUSTER_NAME=$2
# INPUT: Env name e.g. stage or prod, used to tag metrics
INSTANCE_TAG=$3
# INPUT: S3 bucket name where spark check configuration script for datadog agent is uploaded
S3_BUCKET=$4
# Spark check configuration script path in above S3 bucket
S3_LOCATION_SPARK_CHECK_SETUP_SCRIPT="s3://${S3_BUCKET}/bootstrap-actions/"
SCRIPT_NAME="emr-bootstrap-datadog-spark-check-setup.sh"
# Copy the spark check configuration script from S3 to current path
aws s3 cp ${S3_LOCATION_SPARK_CHECK_SETUP_SCRIPT}${SCRIPT_NAME} .
# Make the script executable
chmod +x ${SCRIPT_NAME}
# Bootstrap step occurs on EMR before any software is configured.
# Software configuration is a pre-requisite in order to successfully setup the datadog spark check setup
# Allow bootstrap to complete, so that software configuration can proceed.
./${SCRIPT_NAME} ${CLUSTER_NAME} ${INSTANCE_TAG} spark_check.out 2>&1 &
#!/bin/bash
IS_MASTER=false
if [ $(grep "\"isMaster\": true" /mnt/var/lib/info/instance.json -wc) = 1 ]; then
echo "Running on the master node."
IS_MASTER=true
fi
# Execute spark check configuration only on master node of EMR cluster
if [ "$IS_MASTER" = true ]; then
# Datadog-Spark Integration
# https://docs.datadoghq.com/integrations/spark/
YARN_SITE_XML_LOCATION="/etc/hadoop/conf/yarn-site.xml"
YARN_PROPERTY="resourcemanager.hostname"
DD_AGENT_CONF_DIR="/etc/dd-agent/conf.d"
SPARK_YAML_FILE="${DD_AGENT_CONF_DIR}/spark.yaml"
# Commandline Parameters
CLUSTER_NAME=$1
INSTANCE_TAG=$2
CLUSTER_NAME_WITH_ENV_SUFFIX=`echo ${CLUSTER_NAME}-${INSTANCE_TAG}`
# Wait until yarn-site.xml is available
while [ ! -f ${YARN_SITE_XML_LOCATION} ]
do
sleep 1
done
#Debug
echo "DEBUG: Found: ${YARN_SITE_XML_LOCATION}"
cat ${YARN_SITE_XML_LOCATION}
# Wait until yarn-site.xml has expected content
while [ -z `cat ${YARN_SITE_XML_LOCATION} | grep ${YARN_PROPERTY}` ]
do
sleep 1
done
#Debug
cat ${YARN_SITE_XML_LOCATION} | grep ${YARN_PROPERTY}
# Read the Yarn resource manager hostname to create value for spark_url
YARN_RM_HOSTNAME_RAW=`cat ${YARN_SITE_XML_LOCATION} | grep -A1 ${YARN_PROPERTY} | grep value`
YARN_RM_HOSTNAME=`echo ${YARN_RM_HOSTNAME_RAW}|sed -e 's-value--g' -e 's-<--g' -e 's->--g' -e 's-\/-:-g'`
SPARK_URL=`echo http://${YARN_RM_HOSTNAME}8088`
#Debug
echo "DEBUG: Constructed spark_url: ${SPARK_URL}"
# Create the spark.yaml contents in home directory
cat > spark.yaml << EOL
init_config:
instances:
- spark_url: ${SPARK_URL}
cluster_name: ${CLUSTER_NAME_WITH_ENV_SUFFIX}
spark_cluster_mode: spark_yarn_mode
tags:
- instance: ${INSTANCE_TAG}
EOL
#Debug
ls -l spark.yaml
cat spark.yaml
# Set permissions to move spark.yaml to datadog agent conf.d and reset permissions
sudo chmod 665 ${DD_AGENT_CONF_DIR}
sudo mv spark.yaml ${DD_AGENT_CONF_DIR}
sudo chmod 644 ${SPARK_YAML_FILE}
sudo chown dd-agent:dd-agent ${SPARK_YAML_FILE}
sudo chown dd-agent:dd-agent ${DD_AGENT_CONF_DIR}
sudo chmod 755 ${DD_AGENT_CONF_DIR}
sudo /etc/init.d/datadog-agent stop
sudo /etc/init.d/datadog-agent start
sudo /etc/init.d/datadog-agent info
fi

Why do we need to run the configure as a second step? 


Remember ☝that bootstrap actions are run before any application is installed on the EMR nodes. In the first step we installed a new software. The second step requires that YARN and Spark are pre-installed before datadog configuration can be completed.

yarn-site.xml does not exist at the time the datadog agent is installed. Hence we launch a background process to run the spark check setup script. It waits until yarn-site.xml is created, and contains the value for yarn property 'resourcemanager.hostname'. Once found, it proceeds to create the spark.yaml file, and moves it under /etc/dd-agent/conf.d. Then it sets the appropriate permissions on spark.yaml, and restarts the datadog agent. The agent's info subcommand runs the spark check. πŸ™†

Add Custom Bootstrap Actions

There are three ways to launch an EMR cluster, and bootstrap actions can be invoked via each of them. Refer to AWS Guide for invoking bootstrap actions while launching cluster from either AWS Console, or via AWS CLI. I have created a gist showing our specific bootstrap action script invocation while launching EMR cluster programatically.
val newClusterJobFlowRequest = new RunJobFlowRequest()
newClusterJobFlowRequest.withBootstrapActions(configureBootstrapActions(config))
.withLogUri(logUri)
.with...
private def configureBootstrapActions(emrConfig: Config): Seq[BootstrapActionConfig] = {
val scriptAbsolutePath = s"s3://${config.s3Bucket.bucketName}/bootstrap-actions/emr-bootstrap-datadog-install.sh"
val bootstrapActionConfig = new ScriptBootstrapActionConfig().withPath(scriptAbsolutePath)
bootstrapActionConfig.withArgs(config.cluster_name,
config.stage_env,
config.s3Bucket.bucketName)
val bootstrapAction = new BootstrapActionConfig()
.withScriptBootstrapAction(emrSparkStreamingScriptBootstrapActionConfig)
.withName("DatadogInstaller")
List(bootstrapAction.asJava)
}
view raw RunJobFlow.scala hosted with ❀ by GitHub

Validation

Finally, to confirm that the bootstrap actions completed successfully, you can check the EMR logs in the S3 log directory you specified while launching the cluster. Bootstrap action logs can be found in path like <S3_BUCKET>/<emr_cluster_log_folder>/<emr_cluster_id>/node/<instance_id>/bootstrap-actions

Within few minutes of deploying your spark streaming application on this cluster, you should also start receiving the spark metrics in datadog, as shown in the below screenshot:


One more way to validate would be to ssh into an EMR instance, and execute
sudo /etc/init.d/datadog-agent info

In the output, you should see spark check being run as follows:

Now that we have the datadog agent installed on the driver and executor nodes of the EMR cluster, we have done the groundwork to publish metrics from our application  to datadog. In the next part of this series, I will demonstrate how to publish metrics from your application code.

If you have questions or suggestions, please leave a comment below.

Comments

Popular posts from this blog

Publish Spark Streaming System and Application Metrics From AWS EMR to Datadog - Part 1

Publish Spark Streaming System and Application Metrics From AWS EMR to Datadog - Part 3