Service Level Objectives (SLOs) are targets for how often you can fail or otherwise not operate properly and still ensure that your users aren’t meaningfully upset. Thus an SLO specifies the threshold of the reliability which your users expect of the service.
Usually, an SLO is expressed in a percentage of the number of “good” events among total events and the SLO is the target for what that percentage should be.
A good event can be a successful login, or a getting the correct search results in on the SRP in less than 1 second.
Our monitoring and logging do not decide our reliability; our users do.
- SLOs (and error budgets) increase both reliability and feature velocity over time. They also align incentives among previously warring factions (dev vs. ops. vs. pm).
- SLOs (over time) give engineers a license to take more risks and to be subject to fewer launch constraints. There’s less bureaucracy to get in the way of a cool new launch.
- Reliability is a first-class feature of the product. In fact, it’s the most important feature. If the users get the idea that the product won’t reliably meet their needs (because it’s unavailable, serving errors, etc.), then they won’t trust it.
SLOs provide us the tools we need to measure the customer experience, and for engineering they provide the data we need to make informed decisions where to put our effort.
Ultimately, SLOs are about happier users, happier engineers, happier product teams, and a happier business. This should always be the goal — not to reach new heights of the number of nines you can append to the end of your SLO target.
A great book with everything you need to know about SLOs is Alex Hidalgo's "Implementing Service Level Objectives