Publish Spark Streaming System and Application Metrics From AWS EMR to Datadog - Part 2
This post is the second part in the series to get an AWS EMR cluster, running spark streaming application, ready for deploying in the production environment by enabling monitoring.
In the first part in this series we looked at how to enable EMR specific metrics to be published to datadog service. In this post I will show you how to set up your EMR cluster to enable spark check that will publish spark driver, executor and rdd metrics to be graphed on datadog dashboard.
To accomplish this task, we will leverage EMR Bootstrap actions. From the AWS Documentation:
I have created a gist for each of the two steps. The first script is launched by the bootstrap step during EMR launch, and downloads and installs the datadog agent on each node of the cluster. Simple! It then executes the second script as a background process.
Remember βthat bootstrap actions are run before any application is installed on the EMR nodes. In the first step we installed a new software. The second step requires that YARN and Spark are pre-installed before datadog configuration can be completed.
yarn-site.xml does not exist at the time the datadog agent is installed. Hence we launch a background process to run the spark check setup script. It waits until yarn-site.xml is created, and contains the value for yarn property 'resourcemanager.hostname'. Once found, it proceeds to create the spark.yaml file, and moves it under /etc/dd-agent/conf.d. Then it sets the appropriate permissions on spark.yaml, and restarts the datadog agent. The agent's info subcommand runs the spark check. π
One more way to validate would be to ssh into an EMR instance, and execute
In the output, you should see spark check being run as follows:
Now that we have the datadog agent installed on the driver and executor nodes of the EMR cluster, we have done the groundwork to publish metrics from our application to datadog. In the next part of this series, I will demonstrate how to publish metrics from your application code.
If you have questions or suggestions, please leave a comment below.
To accomplish this task, we will leverage EMR Bootstrap actions. From the AWS Documentation:
This is a two step process.You can use a bootstrap action to install additional software on your cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Amazon EMR installs specified applications and the node begins processing data. If you add nodes to a running cluster, bootstrap actions run on those nodes also. You can create custom bootstrap actions and specify them when you create your cluster.
- Install the datadog agent on each node in the EMR cluster.
- Configure the datadog agent on master node to run spark check at regular intervals, and publish spark metrics.
I have created a gist for each of the two steps. The first script is launched by the bootstrap step during EMR launch, and downloads and installs the datadog agent on each node of the cluster. Simple! It then executes the second script as a background process.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# Clean install Datadog agent | |
sudo yum -y erase datadog-agent | |
sudo rm -rf /etc/dd-agent | |
# INPUT: Datadog account key | |
DD_API_KEY=$1 bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/dd-agent/master/packaging/datadog-agent/source/install_agent.sh)" | |
sudo /etc/init.d/datadog-agent info | |
# INPUT: EMR Cluster name used to tag metrics | |
CLUSTER_NAME=$2 | |
# INPUT: Env name e.g. stage or prod, used to tag metrics | |
INSTANCE_TAG=$3 | |
# INPUT: S3 bucket name where spark check configuration script for datadog agent is uploaded | |
S3_BUCKET=$4 | |
# Spark check configuration script path in above S3 bucket | |
S3_LOCATION_SPARK_CHECK_SETUP_SCRIPT="s3://${S3_BUCKET}/bootstrap-actions/" | |
SCRIPT_NAME="emr-bootstrap-datadog-spark-check-setup.sh" | |
# Copy the spark check configuration script from S3 to current path | |
aws s3 cp ${S3_LOCATION_SPARK_CHECK_SETUP_SCRIPT}${SCRIPT_NAME} . | |
# Make the script executable | |
chmod +x ${SCRIPT_NAME} | |
# Bootstrap step occurs on EMR before any software is configured. | |
# Software configuration is a pre-requisite in order to successfully setup the datadog spark check setup | |
# Allow bootstrap to complete, so that software configuration can proceed. | |
./${SCRIPT_NAME} ${CLUSTER_NAME} ${INSTANCE_TAG} spark_check.out 2>&1 & |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
IS_MASTER=false | |
if [ $(grep "\"isMaster\": true" /mnt/var/lib/info/instance.json -wc) = 1 ]; then | |
echo "Running on the master node." | |
IS_MASTER=true | |
fi | |
# Execute spark check configuration only on master node of EMR cluster | |
if [ "$IS_MASTER" = true ]; then | |
# Datadog-Spark Integration | |
# https://docs.datadoghq.com/integrations/spark/ | |
YARN_SITE_XML_LOCATION="/etc/hadoop/conf/yarn-site.xml" | |
YARN_PROPERTY="resourcemanager.hostname" | |
DD_AGENT_CONF_DIR="/etc/dd-agent/conf.d" | |
SPARK_YAML_FILE="${DD_AGENT_CONF_DIR}/spark.yaml" | |
# Commandline Parameters | |
CLUSTER_NAME=$1 | |
INSTANCE_TAG=$2 | |
CLUSTER_NAME_WITH_ENV_SUFFIX=`echo ${CLUSTER_NAME}-${INSTANCE_TAG}` | |
# Wait until yarn-site.xml is available | |
while [ ! -f ${YARN_SITE_XML_LOCATION} ] | |
do | |
sleep 1 | |
done | |
#Debug | |
echo "DEBUG: Found: ${YARN_SITE_XML_LOCATION}" | |
cat ${YARN_SITE_XML_LOCATION} | |
# Wait until yarn-site.xml has expected content | |
while [ -z `cat ${YARN_SITE_XML_LOCATION} | grep ${YARN_PROPERTY}` ] | |
do | |
sleep 1 | |
done | |
#Debug | |
cat ${YARN_SITE_XML_LOCATION} | grep ${YARN_PROPERTY} | |
# Read the Yarn resource manager hostname to create value for spark_url | |
YARN_RM_HOSTNAME_RAW=`cat ${YARN_SITE_XML_LOCATION} | grep -A1 ${YARN_PROPERTY} | grep value` | |
YARN_RM_HOSTNAME=`echo ${YARN_RM_HOSTNAME_RAW}|sed -e 's-value--g' -e 's-<--g' -e 's->--g' -e 's-\/-:-g'` | |
SPARK_URL=`echo http://${YARN_RM_HOSTNAME}8088` | |
#Debug | |
echo "DEBUG: Constructed spark_url: ${SPARK_URL}" | |
# Create the spark.yaml contents in home directory | |
cat > spark.yaml << EOL | |
init_config: | |
instances: | |
- spark_url: ${SPARK_URL} | |
cluster_name: ${CLUSTER_NAME_WITH_ENV_SUFFIX} | |
spark_cluster_mode: spark_yarn_mode | |
tags: | |
- instance: ${INSTANCE_TAG} | |
EOL | |
#Debug | |
ls -l spark.yaml | |
cat spark.yaml | |
# Set permissions to move spark.yaml to datadog agent conf.d and reset permissions | |
sudo chmod 665 ${DD_AGENT_CONF_DIR} | |
sudo mv spark.yaml ${DD_AGENT_CONF_DIR} | |
sudo chmod 644 ${SPARK_YAML_FILE} | |
sudo chown dd-agent:dd-agent ${SPARK_YAML_FILE} | |
sudo chown dd-agent:dd-agent ${DD_AGENT_CONF_DIR} | |
sudo chmod 755 ${DD_AGENT_CONF_DIR} | |
sudo /etc/init.d/datadog-agent stop | |
sudo /etc/init.d/datadog-agent start | |
sudo /etc/init.d/datadog-agent info | |
fi |
Why do we need to run the configure as a second step?
Remember βthat bootstrap actions are run before any application is installed on the EMR nodes. In the first step we installed a new software. The second step requires that YARN and Spark are pre-installed before datadog configuration can be completed.
yarn-site.xml does not exist at the time the datadog agent is installed. Hence we launch a background process to run the spark check setup script. It waits until yarn-site.xml is created, and contains the value for yarn property 'resourcemanager.hostname'. Once found, it proceeds to create the spark.yaml file, and moves it under /etc/dd-agent/conf.d. Then it sets the appropriate permissions on spark.yaml, and restarts the datadog agent. The agent's info subcommand runs the spark check. π
Add Custom Bootstrap Actions
There are three ways to launch an EMR cluster, and bootstrap actions can be invoked via each of them. Refer to AWS Guide for invoking bootstrap actions while launching cluster from either AWS Console, or via AWS CLI. I have created a gist showing our specific bootstrap action script invocation while launching EMR cluster programatically.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val newClusterJobFlowRequest = new RunJobFlowRequest() | |
newClusterJobFlowRequest.withBootstrapActions(configureBootstrapActions(config)) | |
.withLogUri(logUri) | |
.with... | |
private def configureBootstrapActions(emrConfig: Config): Seq[BootstrapActionConfig] = { | |
val scriptAbsolutePath = s"s3://${config.s3Bucket.bucketName}/bootstrap-actions/emr-bootstrap-datadog-install.sh" | |
val bootstrapActionConfig = new ScriptBootstrapActionConfig().withPath(scriptAbsolutePath) | |
bootstrapActionConfig.withArgs(config.cluster_name, | |
config.stage_env, | |
config.s3Bucket.bucketName) | |
val bootstrapAction = new BootstrapActionConfig() | |
.withScriptBootstrapAction(emrSparkStreamingScriptBootstrapActionConfig) | |
.withName("DatadogInstaller") | |
List(bootstrapAction.asJava) | |
} |
Validation
Finally, to confirm that the bootstrap actions completed successfully, you can check the EMR logs in the S3 log directory you specified while launching the cluster. Bootstrap action logs can be found in path like <S3_BUCKET>/<emr_cluster_log_folder>/<emr_cluster_id>/node/<instance_id>/bootstrap-actions
Within few minutes of deploying your spark streaming application on this cluster, you should also start receiving the spark metrics in datadog, as shown in the below screenshot:
One more way to validate would be to ssh into an EMR instance, and execute
sudo /etc/init.d/datadog-agent info
In the output, you should see spark check being run as follows:
Now that we have the datadog agent installed on the driver and executor nodes of the EMR cluster, we have done the groundwork to publish metrics from our application to datadog. In the next part of this series, I will demonstrate how to publish metrics from your application code.
If you have questions or suggestions, please leave a comment below.
Comments
Post a Comment