Site reliability engineering (SRE) is a critical topic worldwide. This write-up will discuss the monitoring and measurement aspects required to make services more reliable. The question here is - What are the goals of SRE that are under observation to assess the outcomes?
As a preface to this explanation, it is essential to know the following:
1. What is SRE, and how is it different than traditional operations?
Please check the details here (SRE blog1) because just changing the name of the operations team doesn't make that team an SRE team.
2. How SRE and DevOps concepts are similar but different?
Read it here (SRE blog2); to gauge the impact SREs (site reliability engineers) have on improving the operational stability, user experience, and performance of the delivered services, we must ascertain whether the measurement and quantification of these aspects are possible. For measuring, we need proper monitoring. However, SRE practices talk about observability and not monitoring. Observability is a more holistic approach to monitoring, observing, and measuring the critical KPI of the services. And thus, observability entails monitoring, logging, as well as tracing.
The objective of SRE is to maintain and enhance the reliability of the services and associated products. SRE principles don't always mandate 100% uptime but help decide and quantify service goals. These goals depend upon various factors but majorly on agreement with the customer around the performance of the service, the level of user satisfaction, etc.
The associated terms with these goals are SLI, SLO, SLA, and error budget measurement goals.
- SLA (service level agreement): This is the service level agreed upon by the customer and the service provider. If a service provider cannot meet the agreed service levels, then the service provider is subjected to agreed penalties. SLAs give a fair idea to the consumer about the reliability, performance, and functionality of the subjected service.
- SLI (Service level indicator): This is the quantitative measurement of a specific service aspect that needs to be measured. SLIs are service characteristics that help gauge the maximum impact on customer experience. It is always product/service centric. One of the most popular ways to determine SLIs are golden signals which include latency, traffic, saturation, and error rate. Another variation to using SLIs is USE (utilization, saturation, and error rate). There can be others, but the most common is the golden signals explained above.
- SLO (service level objective): SRE teams should maintain the specific level threshold of the SLI. If the service level is below the SLO, the risk of breaching SLA increases. More effort is required to maintain the reliability of the service. Similarly, suppose the service level is more than SLO. In that case, service reliability is assured, and the SREs can dedicate more time toward bug fixing and feature creation.
- Error budget: Error budget is the difference between SLO and SLA levels. The opportunity of error is available for the SRE team so that the SLA doesn't get breached. In simple words, SLO is the ideal level of service to be maintained. If an SLO is breached, the teams have the opportunity to fix and bring back the service so that SLAs (customer agreed on service level) don't get breached.
A real-life example of SLI/SLA/SLO and error rate
Consider a retail web service for shopping that customers access and place orders on to purchase. The IT team that maintains this web service agrees with the business to have that service with 99% availability and latency under 100 milliseconds. So, in this case, there are 2 SLAs.
a) Availability- 99 %
b) Latency- <100 milliseconds
The IT team will decide on the SLIs as availability and latency and then on the SLOs. The SLOs must be stricter than SLAs. Therefore, the availability can be 99.50%, and latency can be 90 milliseconds. In this case, the error budget for availability and latency will be at 0.5% and ten milliseconds, respectively.
How to choose the right SLOs
Choosing proper SLO levels is very important, and we must consider the following points while choosing SLOs:
- Always choose SLOs stricter than SLAs (by definition as well).
- Historical performance of 3/6/9 months should be considered while choosing SLOs. Current performance alone cannot be the selection criteria.
- Don't target for ideal SLOs but be practical considering the service performance and architecture. Therefore, SLOs should not be 100% in many cases.
- There should be limited SLOs covering the service's most important characteristics. Having SLOs for non-important parameters may shift the focus to achieving non-critical levels that are undesirable.