This is first article from the series Jenkins on Azure Kubernetes Cluster (AKS):

  1. Jenkins on Azure Kubernetes Cluster (AKS) - How to recover Jenkins controller from Azure availability zone or region failure

  2. Jenkins on Azure Kubernetes Cluster (AKS) - How to build container images with BuildKit, Jib, Kaniko and use Azure Files NFS, Azure Container Registry (ACR) and Azure Blob storage for caching

  3. Jenkins on Azure Kubernetes Cluster (AKS) - Flexible and cost effective Jenkins Agents with Kubernetes Cluster Autoscaler and Azure Spot Virtual Machines

Table of contents

Jenkins is still a very popular CI/CD tool used by a lot of companies if you want to run it in Azure there are some usual options to choose from including stand alone VM’s or Azure Virtual Machine Scale Sets, but in this article, we will use AKS for both Jenkins controller and agents.

When deploying any solution there are a lot of things to consider like for example security, maintenance overhead, and cost but also reliability, and this last one is what we will focus on, in particular, we will try to answer the question: How to recover Jenkins controller from Azure zone and region failure.

In the beginning, we should put some boundaries around our expectations since this is a simple POC we still want almost instant RPO and RTO for zone failures but we are good with an hour or so for RPO and RTO in case of regional failure let’s just say it’s “lunchtime” RTO (Jenkins has to be up before Dev team comes back from lunch).

What are some of the storage offerings that we have in Azure that can help us with those requirements:

  • Azure NetApp Files - This is enterprise-grade storage solution which can be also used with tools like Astra Control Service. But in our case, it might be overkill, especially since the minimum size of the capacity pool is 4TB which is way too much for what we need. Perhaps this solution should be considered when planning a wider strategy of hosting stateful AKS workloads.

  • Azure Files - It is PAAS offering for hosting SMB/NFS storage. The Premium tier might be fast enough to host Jenkins controller home directory however there are a few problems that I found with regional failover. You cannot replicate to other regions than paired one and sometimes this is not an option. Testing failover is potentially destructive if you will not keep an eye on synchronization. When failover is done the storage account is changed into LRS (locally redundant storage) and cannot be converted back into GZRS (geo-zone-redundant storage).

  • Managed Disks with Zone-redundant storage (ZRS) - Managed disks will be synchronously replicated across three availability zones inside the region which simplifies our zone failover as we can use AKS nodes in the different region connected to the same disk after one of the zones will be gone. Unfortunately, ZRS disks are currently only available in a handful of Azure regions but hopefully, it will change soon.

Here are some of the components we will be using and configuring:

  • AKS deployed in a subnet with Azure Container Networking Interface (CNI) and NGINX Ingress Controller.
  • Azure Backup Vault (sometimes confused with Recovery Service Vault)
  • Azure Zone Redundant Managed Disks and Snapshots

Setting up our environment

To test the Availability Zone failover we will deploy AKS cluster in West US 2 region with two node pools each in a different zone, after that we will deploy Jenkins in AKS which utilizes zones through affinity rules, and at the end perform a test failover by deleting the active node and checking how Jenkins recovers.

Create VNet and AKS with NGINX Ingress Controller

Create Virtual Networks and subnet for AKS and client test machine (which you might don’t need if you whish to use forwarding instead)

az group create \
--location westus2 \
--name poc-rg

az network vnet create \
--name poc-vnet \
--resource-group poc-rg \
--location westus2 \
--address-prefix 10.0.0.0/16 \
--subnet-name aks-subnet \
--subnet-prefixes 10.0.1.0/24

az network vnet subnet create \
--address-prefix 10.0.2.0/24 \
--name client-subnet \
--resource-group poc-rg \
--vnet-name poc-vnet

Create AKS Cluster.

Note: It is important to note that in real life scenario you should also create zone redundant systempool

az aks create \
    -g poc-rg\
    -n aks-wu \
    --location westus2 \
    --enable-managed-identity \
    --load-balancer-sku standard \
    --node-count 1 \
    --nodepool-name systempool \
    --generate-ssh-keys \
    --network-plugin azure \
    --vnet-subnet-id /subscriptions/<subscription_id>/resourceGroups/poc-rg/providers/Microsoft.Network/virtualNetworks/poc-vnet/subnets/aks-subnet \
    --docker-bridge-address 172.17.0.1/16 \
    --dns-service-ip 10.2.0.10 \
    --service-cidr 10.2.0.0/24 \
    --node-vm-size Standard_B2s \
    --node-osdisk-size 30

az aks get-credentials --name aks-wu --resource-group poc-rg --overwrite-existing --admin

Provide permissions to AKS managed identity account so that it can attach managed disks located inside our resource group and allocate IP’s in cluster subnet.

export AZ_PRINCIPAL_ID=$(
az aks show -g poc-rg -n aks-wu \
    --query "identity.principalId" \
    --output tsv
)

az role assignment create --role "Contributor" --assignee $AZ_PRINCIPAL_ID --scope /subscriptions/<subscription_id>/resourceGroups/poc-rg
az role assignment create --role "Network Contributor" --assignee $AZ_PRINCIPAL_ID --scope /subscriptions/<subscription_id>/resourceGroups/poc-rg/providers/Microsoft.Network/virtualNetworks/poc-vnet

To deploy NGINX Ingress Controller first create helm values file internal-ingress.yaml with the following content, this will create another Azure Load Balancer and allocate static private IP to Ingress Controller service.

controller:
    service:
        loadBalancerIP: 10.0.1.251
        annotations:
        service.beta.kubernetes.io/azure-load-balancer-internal: "true"

Then deploy NGINX Ingress Controller chart (we are using the community version):

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx \
    --namespace default \
    --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz \
    -f internal-ingress.yaml

Now we will add to AKS two new node pools each in different zones to host our Jenkins controller.

az aks nodepool add \
    --name poolzone1 \
    --resource-group poc-rg \
    --cluster-name aks-wu \
    --node-count 1 \
    --vnet-subnet-id /subscriptions/<subscription_id>/resourceGroups/poc-rg/providers/Microsoft.Network/virtualNetworks/poc-vnet/subnets/aks-subnet \
    --node-vm-size Standard_B2s \
    --node-osdisk-size 30 \
    --zones 1

az aks nodepool add \
    --name poolzone2 \
    --resource-group poc-rg \
    --cluster-name aks-wu \
    --node-count 1 \
    --vnet-subnet-id /subscriptions/<subscription_id>/resourceGroups/poc-rg/providers/Microsoft.Network/virtualNetworks/poc-vnet/subnets/aks-subnet \
    --node-vm-size Standard_B2s \
    --node-osdisk-size 30 \
    --zones 2

Create Managed Disk and Configure Jenkins Storage

Once we have nodes ready which you can verify by running the command kubectl get no it is time to create ZRS managed disk which will be used to host home directory of Jenkins controller.

az disk create \
--name jenkins-md \
--resource-group poc-rg \
--sku StandardSSD_ZRS \
--location westus2 \
--size-gb 4

We also need to create PV, SC, and PVC which will be used to attach this managed disk to the Jenkins controller pod. Here we also ensure that reclaimPolicy is set to Retain since we are managing disk externally to AKS.

jenkins-sc.yaml

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: jenkins-sc
parameters:
  skuname: StandardSSD_ZRS
provisioner: disk.csi.azure.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

jenkins-pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: jenkins-pv
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 4Gi
  csi:
    driver: disk.csi.azure.com
    volumeHandle: /subscriptions/<subscription_id>/resourceGroups/poc-rg/providers/Microsoft.Compute/disks/jenkins-md
  persistentVolumeReclaimPolicy: Retain
  claimRef:
    name: jenkins-pvc
    namespace: default
  storageClassName: jenkins-sc
  volumeMode: Filesystem

jenkins-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    meta.helm.sh/release-name: jenkins
    meta.helm.sh/release-namespace: default
    volume.beta.kubernetes.io/storage-provisioner: disk.csi.azure.com
    volume.kubernetes.io/storage-provisioner: disk.csi.azure.com
  labels:
    app.kubernetes.io/component: jenkins-controller
    app.kubernetes.io/instance: jenkins
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: jenkins
  name: jenkins-pvc
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 4Gi
  storageClassName: jenkins-sc

Once files are ready just run kubectl to apply all:

kubectl apply -f jenkins-sc.yaml
kubectl apply -f jenkins-pv.yaml
kubectl apply -f jenkins-pvc.yaml

Deploy and configure Jenkins

Before we deploy Jenkins we need to prepare Helm values file. Create file jenkins.values.yaml with following content.

We are setting a few things here:

  • Create an ingress resource
  • Specify affinity rules to ensure that pod can only be created in either of two specific zones in Azure
  • Indicate we want to use our pre-created PVC jenkins-pvc

Content of jenkins.values.yaml

controller:
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
    hostName: jenkins.example.com
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - westus2-1
            - westus2-2
persistence:
  existingClaim: jenkins-pvc

Then we deploy jenkins.

helm repo add jenkins https://charts.jenkins.io
helm repo update
helm install jenkins jenkins/jenkins \
  --namespace default \
  -f jenkins.values.yaml

To verify if managed disk has been attached check latest events on jenkins pod with command kubectl describe pod jenkins-0.

Also as you can see our pod has been deployed to node pool in zone 2.

Type     Reason                  Age   From                     Message
----     ------                  ----  ----                     -------
Normal   Scheduled               67s   default-scheduler        Successfully assigned default/jenkins-0 to aks-poolzone2-27948138-vmss000000
Normal   SuccessfulAttachVolume  58s   attachdetach-controller  AttachVolume.Attach succeeded for volume "jenkins-pv"

Now from the client machine which has private network access to Jenkins subnet, we will try to open Jenkins website, since we are using ingress resource make sure to link IP of NGINX Ingress Controller in our case 10.0.1.251 with the custom hostname which we specified in Helm values files.

You can login with username admin and password which as been generated and it’s located in secret called jenkins which is in the same namespace as jenkins deployment.

Once logged in let’s create a simple pipeline which we will use during test failover. Choose New Item then pipeline and in the script, box copy and paste this content:

podTemplate (inheritFrom: 'default') {
    node(POD_LABEL) {
        stage('Run shell') {
            sh 'sleep 1h'
        }
    }
}

Now let’s start the pipeline, you should see the following output in the pipeline console output.

Running on test-pipline-sleep-26-hd5g0-vkbsf-zn1gb in /home/jenkins/agent/workspace/test-pipline-sleep
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Run shell)
[Pipeline] sh
+ sleep 1h

If you look at running pods you should see the one created for our Jenkins job test-pipline-sleep.

NAME                                        READY   STATUS    RESTARTS   AGE   IP          NODE                                 NOMINATED NODE   READINESS GATES
test-pipline-sleep-26-hd5g0-vkbsf-zn1gb     1/1     Running   0          71s   10.0.1.34   aks-poolzone1-11650587-vmss000000    <none>           <none>

Test Availability Zone failover

Now that the pipeline is running let us simulate the loss of Availability Zone 2 which is where Jenkins pod is running so we will just delete AKS node pool in this zone.

You will notice error in Jenkins website as the controller pod is gone:

If you would run kubectl get pods -w you would notice that around 2 minutes have passed between deleting the original pod and running a new one.

NAME                                        READY   STATUS    RESTARTS   AGE
jenkins-0                                   2/2     Running   0          46s
jenkins-0                                   2/2     Terminating   0          65s
jenkins-0                                   0/2     Terminating   0          66s
jenkins-0                                   0/2     Terminating   0          66s
jenkins-0                                   0/2     Terminating   0          66s
jenkins-0                                   0/2     Pending       0          0s
jenkins-0                                   0/2     Pending       0          0s
jenkins-0                                   0/2     Init:0/1      0          0s
jenkins-0                                   0/2     Init:0/1      0          67s
jenkins-0                                   0/2     PodInitializing   0          81s
jenkins-0                                   1/2     Running           0          89s
jenkins-0                                   1/2     Running           0          105s
jenkins-0                                   2/2     Running           0          106s

When you describe the Jenkins pod you would see that we also hit the Multi-Attach error which might be caused by some kind of race condition where PV is still attached to the previous node while the new POD is trying to claim it and since we set accessModes to ReadWriteOnce only one can claim it at the time.

Eventually pod will recover on it’s own and successfully attach the volume.

Events:
  Type     Reason                  Age   From                     Message
  ----     ------                  ----  ----                     -------
  Normal   Scheduled               117s  default-scheduler        Successfully assigned default/jenkins-0 to aks-poolzone1-29081885-vmss000000
  Warning  FailedAttachVolume      117s  attachdetach-controller  Multi-Attach error for volume "jenkins-pv" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  77s   attachdetach-controller  AttachVolume.Attach succeeded for volume "jenkins-pv"

Now that we have Jenkins running again let’s check the interface you will notice that the pod running the job is still there as in our case luckily it was running on the node pool that wasn’t affected.

Overall failover was quick and we retained all data up to the point when the zone was deleted.

Test Regional Failover

For the region outage, we will rely on the managed disk backup which we will synchronize to another region in our case our primary region will be West US, and failover region will be North Europe.

To perform a backup of managed disk which pod is using we can use for example Azure Disks Container Storage Interface (CSI) to do it from within K8s itself but in this post to make our life easier we will use Azure Backup Vault since it will also sort out the whole orchestration part for us including scheduling and retention. The limitation here is that the shortest frequency of scheduled backup is one hour and you need to add to it also the time that it takes to copy the backup to another region, since it is incremental in general it shouldn’t take to long.

Configuration of the Backup Vault and creating first backup of our disk is actually quite easy and you can do it by following this MS guide Back up Azure Managed Disks.

Once done it will create incremental snapshots in the same region as backup vault which means now we need to copy snapshots to our target region in or case North Europe.

We can use Azure CLI for that:

snapshot_id=$(az snapshot show -n <source_snapshot_name> -g poc-rg --query [id] -o tsv)
az snapshot create -g poc-rg -n jenkins-disk-snapshot -l northeurope --source $snapshot_id --incremental --copy-start

Note: You can use azure cli command az snapshot wait with --custom flag to check if copy process has been done in 100%.

Once fininshed you will have zone redundant snapshot copy in our case called jenkins-disk-snapshot in North Europe region.

Now to create ZRS managed disk in North Europe from this snapshot just choose Create Disk and ensure you set ZRS size so that we can use zone redundant disk in our failover DR environment as well.

To test if Jenkins will work with the restored managed disk in the new region just deploy AKS the same way as we did at the begging of this post but in North Europe also remember about following changes:

jenkins.values.yaml file will need to have affinity rule updated to use labels with availability zones in North Europe

controller:
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
    hostName: jenkins.example.com
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - northeurope-1
            - northeurope-2
persistence:
  existingClaim: jenkins-pvc

jenkins-pv.yaml will need to point to the new restored managed disk name so in our case jenkins-md-ne

apiVersion: v1
kind: PersistentVolume
metadata:
  name: jenkins-pv
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 4Gi
  csi:
    driver: disk.csi.azure.com
    volumeHandle: /subscriptions/<subscription_id>/resourceGroups/poc-rg/providers/Microsoft.Compute/disks/jenkins-md-ne
  persistentVolumeReclaimPolicy: Retain
  claimRef:
    name: jenkins-pvc
    namespace: default
  storageClassName: jenkins-sc
  volumeMode: Filesystem

And to perform final confirmation if our environment in North Europe works fine here is a screenshot from the client machine in DR region pointing to Jenkins with the restored disk. As you can see pipeline that we created is there.

Final Thoughts

We haven’t covered failover of the storage of Jenkins agents like workspaces or cached artifacts which you might want to preserve so this is worth keeping in mind.

It might be worth keeping an eye on the new Azure offering which could be helpful here AKS cluster persistent volume backup but at the moment at least it is still in private preview.

We focused on Azure native solutions in this post but since K8s have such a rich ecosystem of tools it is worth exploring at least some of them, here are few examples: trilio, kasten, portworx