Easy Steps to Optimise your AWS EMR Performance and Reduce Cost

Solutions

Products

Resources

Request Demo

Easy Steps to Optimise your AWS EMR Performance and Reduce Cost

Devashish Mulye

Product Manager

Feb 2, 2019

A major challenge while using AWS EMR is reducing cost (or optimising

performance). You need to understand the various concepts like

Partitioning your RDDs, Specialised Instances, Spot Fleets, Spot

Blocking, Ganglia, UI configuration, logging, pricing and many other

eccentricities of EMR. So where do you start? These easy steps will

fetch you some quick results.

Repartitioning your RDD

The

first and the most obvious thing is to partition your RDDs in Spark in

such a way that it ensures maximum resource utilisation. But that’s

easier said than done. Understanding the inner workings of Spark does

take its own time.

quick improvement in efficiency can be achieved by repartitioning your

RDD to 2*(number of cores in your cluster) for mapping transformations.

This may not be the optimal method every time, but it will be a significant improvement.

Use Specialised Instances

You can see an improvement by simply using the correct instance type for your use case. An instance from the c family would be appropriate for a the TASK nodes of the EMR in a compute heavy Spark Job.

However,

Spark is incredibly memory intensive and you may often run into memory

errors. It is important to provision enough memory to your executors or

driver. You can refer to this insightful blog on allocating spark resources. It’s a good idea to use an instance from the r family for the MASTER node of the EMR cluster, especially if your job requires shuffling.

Use Spot Instances

Spot Instances can help reduce your ec2 costs by 50–80%.If you already know what they are, you can move on to the next section.

Below is a description as given by AWS. You can read more here .

Using Spot Fleets

You will face 2 major issues while using Spot Instances.

AWS
won’t provision a Spot Instance if they are unavailable. If you are
using your Spark Job for regular ETL, this is a catastrophic situation.
AWS
may snatch back your Spot Instances at any moment if they have run out
of any more instances and a higher bidder enters the market. If you are
using your Spark Job for regular ETL, this is an apocalyptic situation.

These issues are addressed in these last 2 topics —

A Spot Fleet

is a group of multiple spot and on-demand instances of different types.

You can configure your EMR cluster to comprise of one or more fleets.

The

advantage of using a Spot Fleet is that instead of specifying the

instance types you want, you can specify your computing and memory

capacity requirements. AWS will provision the available instances which

fulfil that requirement. This profoundly increases your chance of

getting Spot Instances.

How to configure Fleets?

Requesting Instance Fleet from AWS Dashboard

Step 1 —

Define your target.If you need a total of 16 cores, set your target to 16.

Step 2 —

Define

a list of instances which would be suitable for your Spark Job and

assign them weights. For example, a c4.x large has 4 cores, so the

weight of c4.x large would be 4. Similarly, you can list more instance

types — like c4.2x large with a weight of 8 and c4.4x large with a

weight of 16.

This will result in AWS provisioning you any of the following combinations —

c4.xlarge (2) , c4.2xlarge (1), c4.4xlarge(0)
c4.xlarge (0) , c4.2xlarge (0), c4.4xlarge(1)
c4.xlarge (4) , c4.2xlarge (0), c4.4xlarge(0)
c4.xlarge (0) , c4.2xlarge (2), c4.4xlarge(0)

Configuring Spot Fleet from AWS Dashboard

You

can follow the same process for your memory requirement. If your

cumulative memory requirement is 160 GB, then set your target to

160.Your list of instances could be — r5.xlarge (Memory — 32GB) with a

weight of 32, r5.2xlarge (Memory-64GB) with a weight of 64) and so on.

*It is not necessary to only specify instances from the same family.

Step 3—

Specify

the TimoutDurationMinutes. This is the amount of time AWS will look for

a spot fleet that fulfils your requirements before giving up. The

maximum value you can set is 60 minutes.

Step 4—

Specify

the TimeoutAction. If AWS is unable to provision an EMR cluster within

the TimeoutDurationMinutes, then it will carry out the TimeoutAction.

TimeoutAction can either be SWITCH_TO_ON_DEMAND or TERMINATE_CLUSTER.

Here’s an example of a cluster configuration—

[

{

"InstanceFleetType": "MASTER",

"TargetOnDemandCapacity": 0,

"TargetSpotCapacity": 1,

"LaunchSpecifications": {

"SpotSpecification": {

"TimeoutDurationMinutes": 60,

"TimeoutAction": "SWITCH_TO_ON_DEMAND"

}

"InstanceTypeConfigs": [

{

"WeightedCapacity": 1,

"BidPriceAsPercentageOfOnDemandPrice": 100,

"InstanceType": "r3.xlarge"

{

"WeightedCapacity": 1,

"BidPriceAsPercentageOfOnDemandPrice": 100,

"InstanceType": "r3.2xlarge"

{

"WeightedCapacity": 1,

"EbsConfiguration": {

"EbsBlockDeviceConfigs": [

{

"VolumeSpecification": {

"SizeInGB": 32,

"VolumeType": "gp2"

"VolumesPerInstance": 1

}

]

"BidPriceAsPercentageOfOnDemandPrice": 100,

"InstanceType": "c5.4xlarge"

{

"WeightedCapacity": 1,

"EbsConfiguration": {

"EbsBlockDeviceConfigs": [

{

"VolumeSpecification": {

"SizeInGB": 32,

"VolumeType": "gp2"

"VolumesPerInstance": 1

}

]

"BidPriceAsPercentageOfOnDemandPrice": 100,

"InstanceType": "m4.2xlarge"

{

"WeightedCapacity": 1,

"EbsConfiguration": {

"EbsBlockDeviceConfigs": [

{

"VolumeSpecification": {

"SizeInGB": 32,

"VolumeType": "gp2"

"VolumesPerInstance": 1

}

]

"BidPriceAsPercentageOfOnDemandPrice": 100,

"InstanceType": "c4.4xlarge"

}

"Name": "MasterFleet"

},{

"InstanceFleetType": "CORE",

"TargetOnDemandCapacity": 0,

"TargetSpotCapacity": 160,

"LaunchSpecifications": {

"SpotSpecification": {

"TimeoutDurationMinutes": 60,

"TimeoutAction": "SWITCH_TO_ON_DEMAND"

}

"InstanceTypeConfigs": [

{

"WeightedCapacity": 16,

"BidPriceAsPercentageOfOnDemandPrice": 100,

"InstanceType": "r3.2xlarge"

{

"WeightedCapacity": 32,

"BidPriceAsPercentageOfOnDemandPrice": 100,

"InstanceType": "r3.4xlarge"

}

"Name": "CoreFleet"

}

]

There

are 2 spot fleets that together comprise the EMR Cluster. I have named

the MASTER node Fleet ‘MasterFleet’, and CORE nodes Fleet ‘CoreFleet’.

The

options given for Core fleet are r3.4xlarge (weight = 32) and

r3.2xlarge (weight = 16). AWS will provision the combination that will

add up to (or more than) the specified target of 160.

AWS can’t provision an EMR cluster for me in 60 minutes, it will go

ahead and provision me on-demand instances. I would rather pay on-demand

price and keep my ETL running than saving the money and not run the ETL

at all.

Using Spot Blocking

Spot

Blocks are spot instances that are provisioned for a definite amount of

time. This can be anything between 1–6 hours. AWS charges more for

blocking, but still much less than on-demand. This way, you can ensure

that your ETL process won’t be interrupted since AWS won’t snatch back

your instances.

Cost of Instance Types

And

there you have it. Happy Distributed Computing to you. If you are

looking for more tips on reducing your AWS bills, check this series of blogs out.

["distributed computing"]["AWS"]

Powering Credit Infrastructure at Scale

Solutions

Platform

Risk Management

Multi Account Aggregator

Alternative Device Data

Resources

Security and compliance

Company

About us

Our Clients

Careers

Identity Verification

KYC

Powering Credit Infrastructure at Scale

Solutions

Platform

Risk Management

Multi Account Aggregator

Alternative Device Data

Resources

Security and compliance

Company

About us

Our Clients

Careers

Identity Verification

KYC