Zone Resiliency in Azure Kubernetes Service

Introduction

Making the cluster highly available and minimizing downtime is fundamental for any modern service. In the context of Azure Kubernetes Service, achieving high availability involves the deployment and distribution of resources across various levels within the infrastructure:

Regions. Creating multiple clusters in different Azure regions is the most expensive yet reliable option.
Zones. Availability zones can be used to improve single-cluster availability.
Nodes. At this level, we can distribute workload between virtual machines.

In this article, I’m going to focus on zone resiliency. In Azure Kubernetes Service (AKS), zone resiliency plays a crucial role when we need to achieve high availability of applications. Before diving deeper into networking and storage aspects of AKS zone resiliency, I’d like to touch upon some common terminology.

Availability zone

I could invent my definition here, but in fact, availability zones are best described in the Microsoft documentation:

Azure availability zones are physically and logically separated datacenters with their own independent power source, network, and cooling. Connected with an extremely low-latency network, they become a building block to delivering high availability applications. Availability zones ensure that if there's an event impacting a datacenter site—for example, if someone cuts the power, or there are issues with cooling—your data will be protected.

In short, availability zones operate independently, ensuring that a failure in one zone does not affect the others. Such isolation significantly enhances fault tolerance and application availability. Not all regions have the same number of availability zones: while most have three zones, some have two or even one.

Control plane

A Kubernetes cluster consists of two main parts—the control plane and the nodes. Both components can use multi-zone availability.

The control plane is automatically distributed among zones. I found two different statements in Azure documentation on how this distribution works:

From a practical point of view, these approaches have no big difference because, normally, we will have nodes in all the zones.

Node pools

We can specify zone parameters when creating a Kubernetes cluster. This would affect the system node pool only.

az aks create … --zones

You need to specify zone parameters for every zone you create.

az aks nodepool add … --zones

The most obvious option is always to select all three zones for each node pool. It works fine in most cases, but some limitations related to networking and storage may arise. Let’s take a closer look at each type of limitation.

Zones and networking

Let’s say one of our pods needs access to the internet. There are several ways we can configure outbound networking in AKS:

Load Balancer
NAT Gateway
User Defined Routing

The last option can be implemented in several different ways: you can route your traffic to Azure Firewall that would be zone-redundant, or you can route your traffic to a single VM that wouldn’t be zone-redundant. We’re not going to cover all possible options in this article.

Using multiple zones per node pool works fine with the Load Balancer option because Azure Load Balancer is a zone-redundant resource.

Yet, it doesn’t work that well with the NAT gateway. Virtual network in Azure is zone agnostic, meaning that it exists in each zone, and it’s safe to use one virtual network (or virtual network subnet) in multizone scenarios. However, the NAT gateway is always deployed in one zone, meaning that even if nodes are distributed among the zones, they are still connected to a single virtual network subnet and a NAT in a single zone. Unfortunately, it’s not possible to connect multiple NATs to one subnet.

The solution to overcome this limitation is to create three node pools instead of one and assign one node pool to one zone.

In this configuration, you must connect each node pool to a separate virtual network subnet.

The configuration with NAT is more complicated, so there should be a solid reason to choose it.

To understand why it can be beneficial to use a NAT gateway, we need to dive deeper into the outbound traffic architecture in Azure. According to Azure documentation, there are four outbound traffic options.

Here, we can see that the load balancer is production-grade; however, NAT is the best. The reason for that is very well described in the section of Azure documentation titled What are SNAT ports? And below.

In brief, when you have only a small subset of nodes that actively use outbound connectivity, you can face port exhaustion, meaning that new outbound connections will fail. For example, you have a pod with a replication factor of 5 and 120 nodes in a cluster. Each pod replica can open about 500 connections to the database simultaneously. All other pods in a cluster don’t actively use outbound connectivity, and we have only one IP address assigned to a load balancer.

In this configuration, each node could create only 256 connections to the same destination IP and port because all the ports have been statically allocated among all the nodes. So, only 256 connections out of the required 500 would be opened.

The exact configuration with the NAT gateway will be able to establish all necessary connections because the port would be allocated dynamically based on actual demand. One IP address gives us 64000 ports, and the demand is 500*5=2500, which is more than enough.

This can be a problem even for relatively small but highly loaded clusters. For a cluster with 1 to 50 nodes, only 1024 ports would be allocated to each node. Of course, if all the pods and nodes use outbound connectivity in the same way, static distribution is not an issue, and you need to add more public IP addresses to the load balancer or NAT.

Here is the full PowerShell script to create the described configuration:

# Variables
$RESOURCE_GROUP="dt-test-cluster"
$LOCATION="westus2"
$VNET_NAME="dt-vnet"
$SUBNET1_NAME="dt-subnet1"
$SUBNET2_NAME="dt-subnet2"
$NAT_GW1_NAME="dt-natgw1"
$NAT_GW2_NAME="dt-natgw2"
$PIP1_NAME="dt-pip-natgw1"
$PIP2_NAME="dt-pip-natgw2"
$AKS_CLUSTER_NAME="dt-aks-cluster"
$NODE_POOL1_NAME="dtnodepool1"
$NODE_POOL2_NAME="dtnodepool2"

# Create Resource Group
az group create --name $RESOURCE_GROUP --location $LOCATION

# Create Virtual Network and Adjusted Subnets
az network vnet create --resource-group $RESOURCE_GROUP --name $VNET_NAME --address-prefixes 10.1.0.0/16 --subnet-name $SUBNET1_NAME --subnet-prefix 10.1.2.0/24
az network vnet subnet create --resource-group $RESOURCE_GROUP --vnet-name $VNET_NAME --name $SUBNET2_NAME --address-prefix 10.1.3.0/24

# Create Public IPs for NAT Gateways
$publicIP1 = az network public-ip create --resource-group $RESOURCE_GROUP --name $PIP1_NAME --location $LOCATION --sku Standard --zone 1 --query publicIp.id --output tsv
$publicIP2 = az network public-ip create --resource-group $RESOURCE_GROUP --name $PIP2_NAME --location $LOCATION --sku Standard --zone 2 --query publicIp.id --output tsv

# Create NAT Gateways in different zones and associate them with the subnets
az network nat gateway create --resource-group $RESOURCE_GROUP --name $NAT_GW1_NAME --location $LOCATION --zone 1 --public-ip-addresses $publicIP1
az network nat gateway create --resource-group $RESOURCE_GROUP --name $NAT_GW2_NAME --location $LOCATION --zone 2 --public-ip-addresses $publicIP2
az network vnet subnet update --resource-group $RESOURCE_GROUP --vnet-name $VNET_NAME --name $SUBNET1_NAME --nat-gateway $NAT_GW1_NAME
az network vnet subnet update --resource-group $RESOURCE_GROUP --vnet-name $VNET_NAME --name $SUBNET2_NAME --nat-gateway $NAT_GW2_NAME

# Create AKS Cluster with two node pools in different zones and assign subnets
$subnet1Id = "/subscriptions/$(az account show --query id --output tsv)/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Network/virtualNetworks/$VNET_NAME/subnets/$SUBNET1_NAME"
$subnet2Id = "/subscriptions/$(az account show --query id --output tsv)/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Network/virtualNetworks/$VNET_NAME/subnets/$SUBNET2_NAME"

az aks create --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --node-count 1 --enable-addons monitoring --generate-ssh-keys --nodepool-name $NODE_POOL1_NAME --vnet-subnet-id $subnet1Id --zones 1 --outbound-type userAssignedNATGateway
az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $AKS_CLUSTER_NAME --name $NODE_POOL2_NAME --node-count 1 --vnet-subnet-id $subnet2Id --zones 2

Please note that this script is not production-ready because only the parameters related to the outbound networking were configured.

Once the configuration is created, you can test it with the following commands:

az aks get-credentials --resource-group dt-test-cluster --name dt-aks-cluster
kubectl run ubuntu --image=ubuntu -- sleep infinity
kubectl run ubuntu --image=ubuntu2 -- sleep infinity

Then, log in to both of the images with:

kubectl exec -it ubuntu -- /bin/sh

And run

apt-get update
apt-get install -y curl
curl http://httpbin.org/ip

As we have two node pools and don’t have an affinity, the scheduler will distribute pods between different node pools. So, the first curl should give you one of the public IPs and the second — another IP.

Zones and storage

Another potential pitfall is storage. The most common and convenient way to persist data in Kubernetes is to use a storage class and persistent volume claim. In Azure, several built-in storage classes, such as Azure Files or Azure Blob Storage, support network remote storage. In addition, several storage classes support the so-called Managed Drive, which can be treated as VM’s local drive.

Let’s say we created a persistent volume claim (PVC) on the Azure default storage classes: managed-csi-premium. We can use the following command to understand what Azure storage we actually use:

kubectl describe sc managed-csi-premium
Name:              	managed-csi-premium
IsDefaultClass:    	No
Annotations:       	<none>
Provisioner:       	disk.csi.azure.com
Parameters:        	skuname=Premium_LRS
AllowVolumeExpansion:  True
MountOptions:      	<none>
ReclaimPolicy:     	Delete
VolumeBindingMode: 	WaitForFirstConsumer
Events:            	<none>

This class connects Managed Drive to a node VM. We can understand this because disk.csi.azure.com provisioner is used. The storage is Premium_LRS. This disk has local redundancy, meaning your data is stored only in one region.

Let’s imagine that something happened to VMs in Zone 1, but the storage in Zone 1 is unaffected. In this case, we will also lose access to the data because cross-zone disc connections are not supported in Azure.

The best approach to avoid such an issue is to use ZRS or zone redundant storage. For this type of storage, Azure synchronously replicates every write operation across all the zones, so we always have three up-to-date replicas.

There is no such storage class by default, so we need to create one.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: test-for-article
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_ZRS
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

The storage class we used mounted a managed disk to a virtual machine. We can use other provisioners, for example, file.csi.azure.com, to mount a file share. For such provisioners, data can be accessed from any zone.

For the disk.csi.azure.com provisioner that mounts a managed disk to a virtual machine, we can choose one of the following SKUs:

Premium_LRS
Premium_ZRS
Standard_LRS
StandardSSD_LRS
StandardSSD_ZRS
UltraSSD_LRS
PremiumV2_LRS

As you can see, we have LRS and ZRS options.

For the file.csi.azure.com that mounts a file share in the Azure Storage Account, there are three additional SKUs available:

Standard_GRS: Standard geo-redundant storage
Standard_RAGRS: Standard read-access geo-redundant storage
Standard_RAGZRS: Standard read-access geo-zone-redundant storage

These three options replicate data in different regions. It can be useful in case of problems with the whole region, but the failover is manual. It may cause some data losses because of the asynchronous nature of the cross-region replication.

Conclusion

Zone resiliency in AKS is pivotal for ensuring the high availability of applications. When addressing AKS zone resiliency, you should consider the following limitations and best practices:

Networking

Azure Load Balancer is inherently zone-redundant, facilitating seamless deployment across multiple zones.
On the contrary, the NAT gateway, which provides the best outbound traffic characteristics, is zone-specific. Implementing it requires creating separate node pools for each zone and tying each to a unique virtual network subnet.

Storage

The best approach to persisting data in the AKS zone resilience cluster is to use zone redundant storage.
You may need to create custom storage classes to get zone redundant storage.

While AKS offers various tools to ensure zone resiliency, it is important to consider which approaches will suit your cluster's technical characteristics and requirements best. Properly configuring node pools, understanding the intricacies of outbound traffic, and employing zone-redundant storage solutions can help achieve a genuinely resilient AKS deployment.