Easy Steps to Optimise your AWS EMR Performance and Reduce Cost
Devashish Mulye
Product Manager
|
Feb 2, 2019
A major challenge while using AWS EMR is reducing cost (or optimising
performance). You need to understand the various concepts like
Partitioning your RDDs, Specialised Instances, Spot Fleets, Spot
Blocking, Ganglia, UI configuration, logging, pricing and many other
eccentricities of EMR. So where do you start? These easy steps will
fetch you some quick results.
Repartitioning your RDD
The
first and the most obvious thing is to partition your RDDs in Spark in
such a way that it ensures maximum resource utilisation. But that’s
easier said than done. Understanding the inner workings of Spark does
take its own time.
A
quick improvement in efficiency can be achieved by repartitioning your
RDD to 2*(number of cores in your cluster) for mapping transformations.
This may not be the optimal method every time, but it will be a significant improvement.
Use Specialised Instances
You can see an improvement by simply using the correct instance type for your use case. An instance from the c family would be appropriate for a the TASK nodes of the EMR in a compute heavy Spark Job.
However,
Spark is incredibly memory intensive and you may often run into memory
errors. It is important to provision enough memory to your executors or
driver. You can refer to this insightful blog on allocating spark resources. It’s a good idea to use an instance from the r family for the MASTER node of the EMR cluster, especially if your job requires shuffling.
Use Spot Instances
Spot Instances can help reduce your ec2 costs by 50–80%.If you already know what they are, you can move on to the next section.
Below is a description as given by AWS. You can read more here .
Using Spot Fleets
You will face 2 major issues while using Spot Instances.
AWS
won’t provision a Spot Instance if they are unavailable. If you are
using your Spark Job for regular ETL, this is a catastrophic situation.
AWS
may snatch back your Spot Instances at any moment if they have run out
of any more instances and a higher bidder enters the market. If you are
using your Spark Job for regular ETL, this is an apocalyptic situation.
These issues are addressed in these last 2 topics —
is a group of multiple spot and on-demand instances of different types.
You can configure your EMR cluster to comprise of one or more fleets.
The
advantage of using a Spot Fleet is that instead of specifying the
instance types you want, you can specify your computing and memory
capacity requirements. AWS will provision the available instances which
fulfil that requirement. This profoundly increases your chance of
getting Spot Instances.
How to configure Fleets?
Requesting Instance Fleet from AWS Dashboard

Step 1 —
Define your target.If you need a total of 16 cores, set your target to 16.
Step 2 —
Define
a list of instances which would be suitable for your Spark Job and
assign them weights. For example, a c4.x large has 4 cores, so the
weight of c4.x large would be 4. Similarly, you can list more instance
types — like c4.2x large with a weight of 8 and c4.4x large with a
weight of 16.
This will result in AWS provisioning you any of the following combinations —
c4.xlarge (2) , c4.2xlarge (1), c4.4xlarge(0)
c4.xlarge (0) , c4.2xlarge (0), c4.4xlarge(1)
c4.xlarge (4) , c4.2xlarge (0), c4.4xlarge(0)
c4.xlarge (0) , c4.2xlarge (2), c4.4xlarge(0)

Configuring Spot Fleet from AWS Dashboard
You
can follow the same process for your memory requirement. If your
cumulative memory requirement is 160 GB, then set your target to
160.Your list of instances could be — r5.xlarge (Memory — 32GB) with a
weight of 32, r5.2xlarge (Memory-64GB) with a weight of 64) and so on.
*It is not necessary to only specify instances from the same family.
Step 3—
Specify
the TimoutDurationMinutes. This is the amount of time AWS will look for
a spot fleet that fulfils your requirements before giving up. The
maximum value you can set is 60 minutes.
Step 4—
Specify
the TimeoutAction. If AWS is unable to provision an EMR cluster within
the TimeoutDurationMinutes, then it will carry out the TimeoutAction.
TimeoutAction can either be SWITCH_TO_ON_DEMAND or TERMINATE_CLUSTER.
Here’s an example of a cluster configuration—
[
{
"InstanceFleetType": "MASTER",
"TargetOnDemandCapacity": 0,
"TargetSpotCapacity": 1,
"LaunchSpecifications": {
"SpotSpecification": {
"TimeoutDurationMinutes": 60,
"TimeoutAction": "SWITCH_TO_ON_DEMAND"
}
},
"InstanceTypeConfigs": [
{
"WeightedCapacity": 1,
"BidPriceAsPercentageOfOnDemandPrice": 100,
"InstanceType": "r3.xlarge"
},
{
"WeightedCapacity": 1,
"BidPriceAsPercentageOfOnDemandPrice": 100,
"InstanceType": "r3.2xlarge"
},
{
"WeightedCapacity": 1,
"EbsConfiguration": {
"EbsBlockDeviceConfigs": [
{
"VolumeSpecification": {
"SizeInGB": 32,
"VolumeType": "gp2"
},
"VolumesPerInstance": 1
}
]
},
"BidPriceAsPercentageOfOnDemandPrice": 100,
"InstanceType": "c5.4xlarge"
},
{
"WeightedCapacity": 1,
"EbsConfiguration": {
"EbsBlockDeviceConfigs": [
{
"VolumeSpecification": {
"SizeInGB": 32,
"VolumeType": "gp2"
},
"VolumesPerInstance": 1
}
]
},
"BidPriceAsPercentageOfOnDemandPrice": 100,
"InstanceType": "m4.2xlarge"
},
{
"WeightedCapacity": 1,
"EbsConfiguration": {
"EbsBlockDeviceConfigs": [
{
"VolumeSpecification": {
"SizeInGB": 32,
"VolumeType": "gp2"
},
"VolumesPerInstance": 1
}
]
},
"BidPriceAsPercentageOfOnDemandPrice": 100,
"InstanceType": "c4.4xlarge"
}
],
"Name": "MasterFleet"
},{
"InstanceFleetType": "CORE",
"TargetOnDemandCapacity": 0,
"TargetSpotCapacity": 160,
"LaunchSpecifications": {
"SpotSpecification": {
"TimeoutDurationMinutes": 60,
"TimeoutAction": "SWITCH_TO_ON_DEMAND"
}
},
"InstanceTypeConfigs": [
{
"WeightedCapacity": 16,
"BidPriceAsPercentageOfOnDemandPrice": 100,
"InstanceType": "r3.2xlarge"
},
{
"WeightedCapacity": 32,
"BidPriceAsPercentageOfOnDemandPrice": 100,
"InstanceType": "r3.4xlarge"
}
],
"Name": "CoreFleet"
}
]
There
are 2 spot fleets that together comprise the EMR Cluster. I have named
the MASTER node Fleet ‘MasterFleet’, and CORE nodes Fleet ‘CoreFleet’.
The
options given for Core fleet are r3.4xlarge (weight = 32) and
r3.2xlarge (weight = 16). AWS will provision the combination that will
add up to (or more than) the specified target of 160.
If
AWS can’t provision an EMR cluster for me in 60 minutes, it will go
ahead and provision me on-demand instances. I would rather pay on-demand
price and keep my ETL running than saving the money and not run the ETL
at all.
Using Spot Blocking
Spot
Blocks are spot instances that are provisioned for a definite amount of
time. This can be anything between 1–6 hours. AWS charges more for
blocking, but still much less than on-demand. This way, you can ensure
that your ETL process won’t be interrupted since AWS won’t snatch back
your instances.

Cost of Instance Types
And
there you have it. Happy Distributed Computing to you. If you are
looking for more tips on reducing your AWS bills, check this series of blogs out.
["distributed computing"]["AWS"]