Methodology Of Autoscaling For Kinesis Streams

Problem statement

Since AWS doesn’t support auto-scaling for kinesis streams most of the time we either over provision the shards and pay more or under provision them and taking a hit on the performance.

Solution:

Using a combination of cloud watch alarms, SNS topics and Lambda we can implement auto-scaling for kinesis streams through which we can manage the shard count and hitting the right balance between cost and performance.

Overview of the solution:

Both scale up and scale down by implemented by monitoring “PutRecords.Bytes” metrics of the stream:

Scale up (doubling of shards) will automatically happen once the stream is utilized more than 80% of its capacity at least once in the given 2 minutes rolling window.Scale down (reducing the shards into half) will automatically happen once the stream is not utilized more than 40% of its capacity at least 3 times in the given 5 minutes rolling window

Note:

We can also use “IncomingBytes” or “IncomingRecords” to implement the same.

We scale up really quick so we don’t take a hit on performance and scale down slowly so we can avoid too many scale up and scale down operations.

Implementation details:

Determine % of utilization for kinesis streams:

Lets say:

Payload size = 100 KB

Total records (per second) = 500

AWS recommended shard count = 49

To determine whether its 80% utilized we do the following:

Total bytes max can be written in 2 minutes = (((100 * 500)*60)*2) = 6,000,000 KB

80% of it would be = 4,800,000 KB

Similarly 40% of it would be 2,400,000 KB

Scale out

The below diagram shows the flow when we perform a scale out operation

Configuration for “Scale out alarm” would be:

Metric Name: PutRecord.Bytes
Stream Name: ETL Kinesis stream
Period: 2 minutes
Threshold >= 4800000Datapoints: 1
Statistic: Sum
Action: Notify topic “Scale SNS topic”

So when we reach 80% of the stream capacity the following happens:

Cloud watch alarm “Scale out alarm” will be triggered
Notification is sent to SNS topic “Scale SNS topic” which triggers “Scale out lambda”
Lambda will scale the number of shards = current shards * 2 and update the threshold to the based on the new shard count. Let’s say the new shard count is “98”, then
Max bytes that can be written in 2 minutes (100 KB * 1000 records) = 12,000,000 KB
80% of it would be 9,600,000 KB
40% of it would be 4,800,000 KB
Reset the “Scale out alarm” back to “OK” state from “ALARM” state

Scale in

The below diagram shows the flow when we perform a scale in operation

Note:

40% value is calculated for 15 minute interval = (100*1000*15*60)*40/100 = 36,000,000

Configuration for “Scale in alarm” would be:

Metric Name: PutRecord.Bytes
Stream Name: ETL Kinesis stream
Period: 15 minutes
Threshold <= 36000000
Datapoints: 3
Statistic: Sum
Action: Notify topic “Scale in SNS topic”

So when we utilize only 40% of the stream capacity the following happens:

Cloud watch alarm “Scale in alarm” will be triggered
Notification is sent to SNS topic “Scale in SNS topic” which triggers “Scale in lambda”
Lambda will scale the number of shards = current shards / 2 and update the threshold to the based on the new shard count.
Reset the “Scale in alarm” back to “OK” state from “ALARM” state

Lambda (scale out)

The attached code “Lambda.js” is written in nodeJs and just for demo purpose so instead of dynamically calculating the threshold value its hardcoded, but the same code can be enhanced to determine all of things mentioned in this post

Environment variables:

ALARM_NAME = “Scale out alarm”
MAX_SHARDS = 100
MIN_SHARDS = 10
STREAM_NAME = “ETL Kinesis stream”

Pseudo code:

Describe the stream and only continue if the stream is not “UPDATING” (if previous scale up/down action has not completed)
Calculate the new shardCount
Update the cloudwatch alarm threshold using “putMetricAlarm()”
Update the shard count using “updateShardCount()”
Wait for shard scale up or scale down operation to complete using “describeStream()” (not implemented in the current code) then reset the state of the alarm to “OK” state using “setAlarmState()”

COGS reduction:

In order to get 1000 TPS for payload of record size 100KB we need to pay 1223.62$, by this approach we can control scale up and scale down which directly reduces the cost of these AWS resources

I am happy to take any feedback or improvements to this approach, thanks for reading it.

Previously published at https://medium.com/@hariohmprasath/autoscaling-with-kinesis-stream-cogs-reduction-dfd87848ce9a