SLI, SLO, and SLA: Your Terminology Guide

Why?

Understanding and utilizing the concepts of SLI (Service Level Indicators), SLO (Service Level Objectives), and SLA (Service Level Agreement) is crucial for businesses, startups, and the development of new features or application launches for several reasons:

Setting Goals and Expectations

SLI and SLO allow for clear definition and measurement of performance and availability goals for a service. This is important for businesses as it sets specific expectations regarding the quality of service provided. For example, by defining SLI as service availability percentage or average response time, a business can set SLO at a level that meets customer or user needs.

Ensuring Quality and Reliability

Using SLI and SLO helps businesses ensure the high quality and reliability of their services or products. Having well-defined SLOs enables developers and engineers to aim for specific goals when developing new features or application updates.

Monitoring and Management

SLI and SLO provide the foundation for real-time performance monitoring of a service. This allows for timely detection and response to potential issues or failures, minimizing downtime and improving overall user experience.

Assessing the Effectiveness of New Features

When developing new features or launching new products, SLI and SLO can be used to assess their effectiveness. For instance, if a new feature impacts response time or error rates, analyzing changes in SLI and SLO can help evaluate the positive or negative impact on user experience.

Aligning with Customers and Partners

SLA, based on SLI and SLO, is an important tool for establishing agreements with customers and partners. It provides confidence that the business is ready to provide the necessary level of service and ensure compensation in case agreed standards are not met.

What is that?

The concepts of SLA (Service Level Agreement), SLO (Service Level Objective), and SLI (Service Level Indicator) can be visualized as a pyramid, with SLA at the top, representing the overarching and general agreement, while SLO and SLI are positioned below, refining and detailing this agreement.

This pyramid reflects the hierarchical structure and interrelationship between SLA, SLO, and SLI. SLA represents the overall commitment to achieving a certain level of service, SLO specifies this level with target metrics, and SLI provides data to assess the performance against these targets.

It's important to understand that each level of this pyramid plays its role in ensuring a quality and reliable service. While the SLA sets the foundational agreement, without sufficiently specific SLOs and accurate SLIs, fulfilling and evaluating the SLA can be challenging. Therefore, this entire hierarchy is important for businesses and the development of new features or application launches, as it provides a clear understanding of service requirements and enables monitoring in line with these requirements.

SLI

Let's start our discussion with SLI, which stands for Service Level Indicators.

Imagine a service, for example, a fairly simple one with a single functional method - a search method. This example will be quite sufficient for our understanding. Let's focus on this.

So, we have a service and metrics associated with it. Essentially, we have data that can be measured and analyzed. Our task is to identify key indicators that will help us evaluate the service's performance, its response time to requests, and possibly the speed of processing these requests.

To apply these indicators in practice, it's important to determine which parameters are critically important for our service. For example, if it's a search engine, key indicators could be the response time to a user's query and the accuracy of the search results. Based on this, a monitoring system can be built to track these metrics in real-time.

Introducing SLI allows us not only to monitor the current state of the system but also to take steps to improve the quality of the service, based on specific data. This helps increase user satisfaction and makes the service more predictable and reliable.

Most often, as Service Level Indicators (SLIs), we encounter the following metrics:

Service Availability. This is the percentage of time during which our application is accessible and operates without failures. For example, the goal could be set at 99.99% uptime.
Response Time. Here, we define the maximum response time that we commit to maintain, to ensure that integrations working with our service also function quickly and efficiently.
Error Rate. We can set an acceptable percentage of errors and their types. For example, we might decide that no more than 1% of errors with a 500 status are acceptable, and we can tolerate up to 40% of errors with a 404 status.
Maximum Number of Requests per Second (max RPS). This shows how many requests our service can handle without decreasing performance.

With the formation of such metrics, we have the opportunity not only to monitor the state of our service in real time but also to set clear goals for the development team to maintain and improve the level of service quality. This creates a foundation for the stable operation of the service and increases user trust, as they see that the service fulfils its quality and availability promises.

Furthermore, these indicators allow us to quickly respond to emerging problems, prioritize work on errors and improvements, and analyze how changes in code or infrastructure affect the overall performance and quality of the service.

SLO

We've refined our promise to the "provider" by setting specific, numerically expressed goals, known as Service Level Objectives (SLOs). SLOs describe specific targets we aim for, answering the question, "What goal are we pursuing?".

So, we already have some key indicators, and now our task is to match our customers' expectations with each of these indicators. Let's take our service with a single search method as an example and define the following SLOs:

RPS (Requests Per Second). Our service should handle no less than 300 requests per second. This ensures that the service can cope with high loads, providing efficiency and speed in processing requests.
Error Rate. We set an error rate goal at 3%, meaning 97% of requests should be successfully processed with a 200 response code, and only 3% may return with errors (4xx, 5xx codes). This ensures high reliability and quality of service for users.
Response Time. Setting a specific goal for response time helps ensure that the user experience will be fast and efficient. For example, we can aim for the average response time not to exceed 200 ms.
Availability. Availability refers to the percentage of time when the service is fully functional and available for use. The goal can be set, for example, at 99.95% availability.

We face the task of taking a closer look at response time frames. This is a critically important parameter, as it directly affects users' perception of the service. Response time is the period between when a user sends a request and when they receive a response from the service. For a search service, response time is extremely important, as users expect quick and relevant results. By setting a goal for response time, we commit to providing users not only with accurate but also timely responses, improving the overall quality of interaction with the service.

It's quite common to evaluate service performance using the average response time. However, the average is not the best metric for determining the "typical" response time because it doesn't account for how many users experience specific delays.

The difference between the average and percentiles arises from how they are calculated.

Average (or arithmetic mean) is calculated as the sum of all values divided by their count. However, the average is sensitive to outliers (abnormally high or low values). The presence of a few significant outliers can significantly skew the average, making it less representative.

0.700, 0.720, 0.680, 0.660, 0.740, 0.750,

0.730, 0.670, 0.710, 0.200, 0.150, 0.300,

0.350, 0.400, 0.450, 0.500, 0.550, 0.600,

0.250, 0.320, 0.380, 0.420, 0.490, 0.530,

0.580, 0.620, 0.310, 0.370, 0.440, 0.510,

0.560, 0.610, 0.290, 0.340, 0.390

Average = 0.429

Percentiles, on the other hand, divide ordered data into percentage groups. For example, the 50th percentile (median) divides the data such that 50% of values are above it and 50% are below it. Percentiles are less sensitive to outliers and allow for a more detailed understanding of data distribution.

0.150, 0.200, 0.250, 0.290, 0.300, 0.310, 0.320, 0.340,

0.350, 0.370, 0.380, 0.390, 0.400, 0.420, 0.440, 0.450,

0.490, 0.500, 0.510, 0.530, 0.550, 0.560, 0.580, 0.600,

0.610, 0.620, 0.660, 0.670, 0.680, 0.700, 0.710, 0.720,

0.730, 0.740, 0.750

50 Percentil = 0.510

For instance, the 99th percentile signifies the value below which 99% of all data points fall. Only 1% of data points have values exceeding this percentile. This helps understand how often extremely long response times occur, which can be crucial for defining guaranteed performance levels.

To visually compare the average and percentiles, a data distribution plot can be constructed.

For example, when evaluating the performance of a search method, you can set the following SLO (Service Level Objective):

99th percentile response time: All requests to the search method should be processed in less than 500 milliseconds in 99% of cases.

This means that only 1% of requests may exceed a response time of 500 ms, which is a stricter and more informative performance indicator than simply the average.

This approach allows for consideration of various usage scenarios and ensures more stable and predictable service performance for users.

SLA

SLA (Service Level Agreement) is an agreement established between us (the service provider) and our clients or integrators. It defines our commitments and expectations regarding the performance and quality of service, as well as outlines measures for response in case these commitments are not met. SLA specifies specific metrics and performance indicators that must be achieved or maintained.

Examples of inclusions in an SLA:

Error rate: If the error rate in the service exceeds 10%, an action plan will be activated, including mobilization of the entire team to quickly address the issues.
Service availability: If the service becomes unavailable more often than 99.9% of the time, the provider will compensate the client for financial losses at 150%.

These conditions establish transparent and clear expectations for all parties involved and motivate the service provider to maintain a high level of performance and availability. SLA often includes monitoring measures, reporting, and regular metric updates to ensure that the terms of the agreement are upheld throughout the contract term.

Conclusion

Studying the concepts of SLI, SLO, and SLA is fundamentally important for businesses, startups, and application development. These terms form a hierarchy, where SLA (Service Level Agreement) is the foundational agreement between the provider and the client, defining the terms of service provision. SLO (Service Level Objective) specifies performance and availability goals that must be achieved to fulfil the SLA. In turn, SLI (Service Level Indicator) represents specific metrics and performance indicators used to measure the level of service.

The significance of these concepts for businesses lies in the ability to establish clear expectations regarding the quality of service provided and ensure high levels of customer satisfaction. They also help optimize performance monitoring, enabling more efficient responses to potential issues. For developing new features or launching applications, SLI, SLO, and SLA are important tools for evaluating the effectiveness of changes and ensuring compliance with established quality standards. These concepts provide transparency, stability, and reliability in service delivery, fostering business growth and meeting customer needs effectively.