One of the key pillars regarding SRE is being able to make quantitative decisions based on key metrics.
The major challenge is what are key metrics and this is testament to the plethora of monitoring software out in the wild today.
At a foundational level you want to ensure your services are always running, however 100% availability in not practical.
Class SRE implements Devops
You then figure out what availability is practical for your application and services.
Your error budget will then be the downtime figure e.g. 99% is 7.2 hours of downtime a month that you can afford to have.
SLAs, SLOs and SLIs
This will be the starting point on your journey to implementing quantitative analysis to
Service Level Agreements
Service Level Objectives
Service Level Indicators
This blog post is all about how you can measure Service Level Objectives without breaking the bank. You do not need to spend millions of dollars on bloated monitoring solutions to observe key metrics that really impact your customers.
Just like baking a cake, these are the ingredients we will use to implement an agile, scaleable monitoring platform that is solely dedicated to doing one thing well.
This is what we want our cake to deliver:
Measuring your SLA Compliance Level
Measuring your Error Budget Burn Rate
Measuring if you have exhausted your error budget
If you look at the cake above, you can see all your meaningful information in one dashboard.
Around 11am the error budget burn rate went up. (A kin to your kids spending all their pocket money in one day!)
Compliance was breached (99% availability) – The purple line (burn rate) went above the maximum budget (yellow line)
These are metrics you will want to ALERT on at any time of the day. These sort of metrics matter. They are violating a Service Level Agreement.
What about my other metrics?
Aaaah, like %Disk Queue Length, Processor Time, Kubernetes Nodes/Pods/Pools etc? Well…
I treat these metrics as second class citizens. Like a layered onion. Your initial metrics should be – Am I violating my SLA? If not, then you can use the other metrics that we have enjoyed over the decades to compliment your observeability into the systems and as a visual aid for diagnostics and troubleshooting.
Another important consideration is the evolution of infrastructure. In 1999 you will have wanted to receive an alert if a server ran out of disk space. In 2020, you are running container orchestration clusters and other high availability systems. A container running out of disk space is not so critical as it used to be in 1999.
Evaluate every single alert you have and ask yourself. Do I really need to wake someone up at 3am for this problem?
Always alert on Service Level Compliance levels ABOUT to breach
Always alert on Error Budget Burn Rates going up sharply
Do not alert someone out of hours because the CPU is 100% for 5 minutes unless Service Level Compliance is being affected to
You will have happier engineers and a more productive team. You will be cooled headed during an incident because you know the different between a cluster node going down versus Service Level Compliance violations. Always solve the Service Level Compliance and then fix the other problems.
Where are the ingredients you promised? You said it will not break the bank, I am intrigued.
A Kubernetes cluster – Google, Azure Kubernetes Services etc
ArgoCD – Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes.
After investigating an issue with Azure Streamiung Analytics, we discovered it cannot deserialise JSON that have the same property names but differ in case e.g.
If you send the above payload to a Streaming Analytics Job, it will fail.
Source ‘<unknown_location>’ had 1 occurrences of kind ‘InputDeserializerError.InvalidData’ between processing times ‘2020-03-30T00:19:27.8689879Z’ and ‘2020-03-30T00:19:27.8689879Z’. Could not deserialize the input event(s) from resource ‘Partition: , Offset: , SequenceNumber: ’ as Json. Some possible reasons: 1) Malformed events 2) Input source configured with incorrect serialization format
We opened a ticket with Microsoft. This was the response.
Thank you for being patience with us. I had further discussion with our ASA PG and here’s our findings.
ASA unfortunately does not support case sensitive column. We understand it is possible for json documents to add to have two columns that differ only in case and that some libraries support it. However there hasn’t been a compelling use case to support it. We will update the documentation as well.
We are sorry for the inconvenience. If you have any questions or concerns, please feel free to reach out to me. I will be happy to assist you.”
Indeed other libraries do support this, such as powershell, c#, python etc.
A significant reason why Microsoft should support it – is the Elastic Common Schema. (ECS), a new specification that provides a consistent and customizable way to structure your data in Elasticsearch, facilitating the analysis of data from diverse sources. With ECS, analytics content such as dashboards and machine learning jobs can be applied more broadly, searches can be crafted more narrowly, and field names are easier to remember.
When introducing a new schema, there is always dealing with existing/custom data. Elastic have an ingenious way to solve this. All fields in ECS are lower case. So your existing data can be guarnteed to not conflict if you use an UpperCase.
When you are dealing with millions of events per day (Json format). You need a debugging tool to deal with events that do no behave as expected.
Recently we had an issue where an Azure Streaming analytics job was in a degraded state. A colleague eventually found the issue to be the output of the Azure Streaming Analytics Job.
The error message was very misleading.
[11:36:35] Source 'EventHub' had 76 occurrences of kind 'InputDeserializerError.TypeConversionError' between processing times '2020-03-24T00:31:36.1109029Z' and '2020-03-24T00:36:35.9676583Z'. Could not deserialize the input event(s) from resource 'Partition: , Offset: , SequenceNumber: ' as Json. Some possible reasons: 1) Malformed events 2) Input source configured with incorrect serialization format\r\n"
The source of the issue was CosmosDB, we need to increase the RU’s. However the error seemed to indicate a serialization issue.
We developed a tool that could subscribe to events at exactly the same time of the error, using the sequence number and partition.
We also wanted to be able to use the tool for a large number of events +- 1 Million per hour.
Please click link to the EventHub .Net client. This tool is optimised to use as little memory as possible and leverage asynchronous file writes for the an optimal event subscription experience (Console app of course).
Have purposely avoided the newton soft library for the final file write to improve the performance.
The output will be a json array of events.
The next time you need to be able to subscribe to event hubs to diagnose an issue with a particular event, I would recommend using this tool to get the events you are interested in analysing.
When a product has been proved to be a success and has just come out of a MVP (Minimal Viable Product) or MMP (Minimal Marketable Product) state, usually a lot of corners would have been cut in order to get a product out and act on the valuable feedback. So inevitably there will be technical debt to take care of.
What is important is having a technical vision that will reduce costs and provide value/impact/scaleable/resilient/reliable which can then be communicated to all stakeholders.
A lot of cost savings can be made when scaling out by putting together a Cloud Architecture Roadmap. The roadmap can then be communicate with your stakeholders, development teams and most importantly finance. It will provide a high level “map” of where you are now and where you want to be at some point in the future.
A roadmap is every changing, just like when my wife and I go travelling around the world. We will have a roadmap of where want to go for a year but are open to making changes half way through the trip e.g. An earthquake hits a country we planned to visit etc. The same is true in IT, sometimes budgets are cut or a budget surplus needs to be consumed, such events can affect your roadmap.
It is something that you want to review on a regular schedule. Most importantly you want to communicate the roadmap and get feedback from others.
Feedback from other engineers and stakeholders is crucial – they may spot something that you did not or provide some better alternative solutions.
The first stage is to decompose your ideas. Below is a list that helps get me started in the right direction. This is by no means an exhausted list, it will differ based on your industry.
Where apps are hosted
File Store Block Store Object Store CDN Message Database Cache
Once you have an idea of all your components. The next step is to breakdown your road-map into milestones that will ultimately assist in reaching your final/target state. Which of course will not be final in a few years time 😉 or even months!
Below is a link to a google slide presentation that you can use for your roadmap.
There are several decision we make every day some conscious and many sub conscious. We have a bit more control over the conscious decisions we make in the work place from an Architecture perspective.
If you are into Development, Devops, Site reliability, Technical Product Owner, Architect or event a contract/consultant; you will be contributing to significant decision regarding engineering.
What Database Technology should we use?
What Search Technology will we use that can scale and do we leverage eventual consistency?
What Container Orchestration or Micro-Service platform shall we use?
When making a decision in 2016, the decision may have been perfectly valid for the technology choices for that time. Fast forward to 2019 and if faced with the exact same decision your solution may be entirely different.
This is absolutely normal and this why it is important to have a “journal” where you outline the key reasons/rationale for a significant architecture decision.
It lays the foundation to effectively communicate with stakeholders and to “sell’ your decisions to others; even better to collaborate with others in a manner that is constructive to evaluating feedback and adjusting key decisions.
I keep a journal of decisions and use a powershell inspired naming convention of Verb-Noun. Secondly I will look at what is trending in the marketplace to use as a guide post. So for a logging/Tracking/Metrics stack, I might start off with reference materials.
This allows me to keep on track with what the industry is doing and forces me to keep up to date with best practices.
Below is sample Decision Record that I use. I hope you may find it useful. I often use them when on-boarding consultants/contractors or new members of the team. It is a great way for them to gain insights into the past and where we going.
In the next blog post, I will discuss formulating an Architecture Roadmap and how to effectively communicate your vision with key stakeholders. Until then, happy decisions and do not be surprised when you sometimes convince yourself out of a bad decision that you made 😉
Now…How do I tell my wife we should do this at home when buying the next sofa?
TITLE (Verb-Description-# e.g. Choose-MetricsTracingLoggingStack)
<what is the issue that we’re seeing that is motivating this decision or change.>
<what boundaries are in place e.g. cost, technology knowledge/resources at hand>
<what is the change/transformation that we’re actually proposing or doing.>