Service Level Agreements Microservices
Understand what is needed to monitor and support applications using the architecture of microservices. With the numerical definition of acceptable areas, we set service level targets (SLOs). You can learn more about SLOs on Wikipedia. Examples of SLOs are that the service should have a 99.9% availability for one year or that the 95th percentile of latency for responses should be less than 300 ms for a month. It is always better to keep a buffer between the advertised SLO and areas where things are really bad. How Architecture Microservices affects monitoring and support Naturally each instance has state control and creates metrics on its internal state to facilitate troubleshooting. In other words, each instance of each service is a white box for the owners of the service, but it is a black box for all the others. We do not yet have a team that has not responded to ALS, probably because we are explicit in advance about the expected performance of each service. We have a service that we have published before writing low-performing ALS and that writing alS helps us prioritize work in order to solve it and know when we are done. The most important thing is that if ALS is known for any service in a product, then ALS can be defined more precisely for the product in general. Let`s focus on a single service and consider them external dependencies, even if they belong to the same team. Maximum operating time is our goal. We follow procedures that ensure high availability for all services, including a redundant production procedure, support for continuous provision and 24/7 support in the event of a problem.
The definition of the number of “9s” is not really helpful. For some applications, especially internal applications, the expected downtime is correct. Our services are well-instrumented for proactive monitoring, so they generate tons of data points every minute. The problem we have faced here is how we use all this data to decide whether each service is working as it should and, if not, what is the problem. We set ourselves a restriction: it should be possible to solve problems without SSH`ing on the servers on which our software works. How can an API customer know the specific service characteristics of an API and its processes? How can these characteristics and their consequences, if not met, be measurableally defined? Because consumers and service providers meet these requirements, the requirements for writing alS for a new service are quite simple. We trust our customers, but unexpected things happen sometimes. It`s better to be defensive here. Each department monitors its own SLIs. Suppose there are two SLIs: arrival rate (number of incoming messages per second) and latency (milliseconds to process each message). So each instance of our service draws and aggregates these two parameters.
Aggregate metrics can be used to ensure that our service and our customers maintain their share of the agreement.