Site Reliability Engineering with Gitops & Kubernetes – Part 1

Intro

One of the key pillars regarding SRE is being able to make quantitative decisions based on key metrics.

The major challenge is what are key metrics and this is testament to the plethora of monitoring software out in the wild today.

At a foundational level you want to ensure your services are always running, however 100% availability in not practical.

Class SRE implements Devops
{
  MeasureEverything()
}

Availability/Error Budget

You then figure out what availability is practical for your application and services.

Availability

Your error budget will then be the downtime figure e.g. 99% is 7.2 hours of downtime a month that you can afford to have.

SLAs, SLOs and SLIs

This will be the starting point on your journey to implementing quantitative analysis to

  • Service Level Agreements
  • Service Level Objectives
  • Service Level Indicators

This blog post is all about how you can measure Service Level Objectives without breaking the bank. You do not need to spend millions of dollars on bloated monitoring solutions to observe key metrics that really impact your customers.

Just like baking a cake, these are the ingredients we will use to implement an agile, scaleable monitoring platform that is solely dedicated to doing one thing well.

Outcome

This is what we want our cake to deliver:

  • Measuring your SLA Compliance Level
  • Measuring your Error Budget Burn Rate
  • Measuring if you have exhausted your error budget
Service Level Compliance – SLAs -> SLOs -> SLIs

If you look at the cake above, you can see all your meaningful information in one dashboard.

  1. Around 11am the error budget burn rate went up. (A kin to your kids spending all their pocket money in one day!)
  2. Compliance was breached (99% availability) – The purple line (burn rate) went above the maximum budget (yellow line)

These are metrics you will want to ALERT on at any time of the day. These sort of metrics matter. They are violating a Service Level Agreement.

What about my other metrics?

Aaaah, like %Disk Queue Length, Processor Time, Kubernetes Nodes/Pods/Pools etc? Well…

I treat these metrics as second class citizens. Like a layered onion. Your initial metrics should be – Am I violating my SLA? If not, then you can use the other metrics that we have enjoyed over the decades to compliment your observeability into the systems and as a visual aid for diagnostics and troubleshooting.

Alerting

Another important consideration is the evolution of infrastructure. In 1999 you will have wanted to receive an alert if a server ran out of disk space. In 2020, you are running container orchestration clusters and other high availability systems. A container running out of disk space is not so critical as it used to be in 1999.

  • Evaluate every single alert you have and ask yourself. Do I really need to wake someone up at 3am for this problem?
  • Always alert on Service Level Compliance levels ABOUT to breach
  • Always alert on Error Budget Burn Rates going up sharply
  • Do not alert someone out of hours because the CPU is 100% for 5 minutes unless Service Level Compliance is being affected to

You will have happier engineers and a more productive team. You will be cooled headed during an incident because you know the different between a cluster node going down versus Service Level Compliance violations. Always solve the Service Level Compliance and then fix the other problems.

Ingredients

Where are the ingredients you promised? You said it will not break the bank, I am intrigued.

Summary

In this post we have touched on the FOUNDATION of what we want out of monitoring.

We know exactly what we want to measure – Service Level Compliance, Error Budget Burn Rate and Max Budget. All this revolves around deciding on the level of availability we give a service.

We touched on the basic ingredients that we can use to build this solution.

In my next blog post we will discuss how we mix all these ingredients together to provide a platform that is is good at doing one thing well.

Measuring Service Level Compliance & Error Budget Burn Rates

When you give your child $30 to spend a month and they need to save $10 a month. You need to be alerted if they spending too fast (Burn Rate).

Installing Kubernetes – The Hard Way – Visual Guide

This is a visual guide to compliment the process of setting up your own Kubernetes Cluster on Google Cloud. This is a visual guide to Kelsey Hightower GIT project called Kubernetes The Hard Way. It can be challenging to remember all the steps a long the way, I found having a visual guide like this valuable to refreshing my memory.

Provision the network in Google Cloud

VPC

Provision Network

Firewall Rules

External IP Address

Provision Controllers and Workers – Compute Instances

Controller and Worker Instances

Workers will have pod CIDR

10.200.0.0/24

10.200.1.0/24

10.200.2.0/24

Provision a CA and TLS Certificates

Certificate Authority

Client & Server Certificates

Kubelet Client Certificates

Controller Manager Client Certificates

Kube Proxy Client Certificates

Scheduler Client Certificates

Kubernetes API Server Certificate

Reference https://github.com/kelseyhightower/kubernetes-the-hard-way/blob/master/docs/04-certificate-authority.md

Service Account Key Pair

Certificate Distribution – Compute Instances

Generating Kubernetes Configuration Files for Authentication

Generating the Data Encryption Config and Key

Bootstrapping etcd cluster

Use TMUX set synchronize-panes on to run on multiple instances at same time. Saves time!

Notice where are using TMUX in a Windows Ubuntu

Linux Subsystem and running commands in parallel to save a lot of time.

The only manual command is actually ssh into each controller, once in, we activate tmux synchronize feature. So what you type in one panel will duplicate to all others.

Bootstrapping the Control Pane (services)

Bootstrapping the Control Pane (LB + Health)

Required Nginx as Google health checks does not support https

Bootstrapping the Control Pane (Cluster Roles)

Bootstrapping the Worker Nodes

Configure kubectl remote access

Provisioning Network Routes

DNS Cluster Add-On

First Pod deployed to cluster – using CoreDNS

Smoke Test

Once you have completed the install of your kubernetes cluster, ensure you tear it down after some time to ensure you do not get billed for the 6 compute instances, load balancer and public statis ip address.

A big thank you to Kelsey for setting up a really comprehensive instruction guide.

Minikube + CloudCode + VSCode – WindDevelopment Environment

As a developer you can deploy your docker containers to a local Kubernetes cluster on your laptop using minikube. You can then use Google Cloud Code extension for Visual Studio Code.

You can then make real time changes to your code and the app will deploy in the background automatically.

  1. Install kubectl – https://kubernetes.io/docs/tasks/tools/install-kubectl/
  2. Install minikube – https://kubernetes.io/docs/tasks/tools/install-minikube/
    1. For Windows users, I recommend the Chocolaty approach
  3. Configure Google Cloud Code to use minikube.
  4. Deploy your application to your local minikube cluster in Visual Studio Code
  5. Ensure you add your container registry in the .vscode\launch.json file – See Appendix

Ensure you running Visual Studio Code as Administrator.

Once deployed, you can make changes to your code and it will automatically be deployed to the cluster.

Quick Start – Create minikube Cluster in Windows (Hyper-V) and deploy a simple web server.

minikube start --vm-driver=hyperv
kubectl create deployment hello-minikube --image=k8s.gcr.io/echoserver:1.10
kubectl expose deployment hello-minikube --type=NodePort --port=8080
kubectl get pod
minikube service hello-minikube --url
minikube dashboard

Grab the output from minikube service hello-minikube –url and browse your web app/service.

Appendix

Starting the Cluster and deploying a default container.

VS Code Deployment

  • Setup your Container Registry in the .vscode\launch.json
  • Click Cloud Code on the bottom tray
  • Click β€œRun on Kubernetes”
  • Open a separate command prompt as administrator

.vscode\launch.json

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Run on Kubernetes",
            "type": "cloudcode.kubernetes",
            "request": "launch",
            "skaffoldConfig": "${workspaceFolder}\\skaffold.yaml",
            "watch": true,
            "cleanUp": true,
            "portForward": true,
            "imageRegistry": "romikocontainerregistry/minikube"
        },
        {
            "podSelector": {
                "app": "node-hello-world"
            },
            "type": "cloudcode",
            "language": "Node",
            "request": "attach",
            "debugPort": 9229,
            "localRoot": "${workspaceFolder}",
            "remoteRoot": "/hello-world",
            "name": "Debug on Kubernetes"
        }
    ]
}
minikube dashboard

We can see our new service is being deployed by VSCode Cloud Code extension. Whenever we make changes to the code, it will automatically deploy.

minikube service nodejs-hello-world-external --url

The above command will give us the url to browse the web app.

If I now change the text for Hello, world! It will automatically deploy. Just refresh the browser πŸ™‚

Here in the status bar we can see deployments as we update code.

Debugging

Once you have deployed your app to Minikube, you can then kick off debugging. This is pretty awesome. Basically your development environment is now a full Kubernetes stack with attached debugging proving a seamless experience.

Check out https://cloud.google.com/code/docs/vscode/debug#nodejs for more information.

You will notice in the launch.json file we setup the debugger port etc. Below I am using port 9229. So all I need to do is start the app with

CMD [“node”, “–inspect=9229”, “app.js”]

or in the launch.json set the “args”: [“–inspect=9229”]. Only supported in launch request type.

Also ensure the Pod Selector is correct. You can use the pod name or label. You can confirm the pod name using the minikube dashboard.

http://127.0.0.1:61668/api/v1/namespaces/kubernetes-dashboard/services/http:kubernetes-dashboard:/proxy/#/pod?namespace=default

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Run on Kubernetes",
            "type": "cloudcode.kubernetes",
            "request": "launch",
            "skaffoldConfig": "${workspaceFolder}\\skaffold.yaml",
            "watch": true,
            "cleanUp": true,
            "portForward": true,
            "imageRegistry": "dccausbcontainerregistry/minikube",
            "args": ["--inspect=9229"]
        },
        {
            "name": "Debug on Kubernetes",
            "podSelector": {
                "app": "nodejs-hello-world"
            },
            "type": "cloudcode",
            "language": "Node",
            "request": "attach",
            "debugPort": 9229,
            "localRoot": "${workspaceFolder}",
            "remoteRoot": "/hello-world"
        }
    ]
}

Now we can do the following

  1. Click Run on Kubernetes
  2. Set a Break Point
  3. Click Debug on Kubernetes

TIPS

  • Run command prompt, powershell and vscode as Administrator
  • Use Choco for Windows installs
  • If you going to reboot/sleep/shutdown your machine. Please run:
minikube stop

If you do not, you risk corrupting hyper-v and you will get all sorts of issues.