Entries
8


reset

Docker is a tool that makes creating containers easy, but what is a container you ask?
In the past every service (e.g. a Web server) run on its own server (bare-metal). Heaving one server per service is as you can imagine not very efficient, so people start running many services on the same server in parallel. However, that had is problematic as well, because one bad behaving service could affect all the others.
Next virtual servers (aka virtual machines) came to be. Virtual machines enabled us to isolated the workloads from each other and run multiple things in parallel on the same hardware. This was much better then the initial state but it still wasn’t very efficient. Each VM was running a whole operating system running, plus it was not every fast to create and start a VM.
The next step was to create an isolation layer inside the operating system that would separate different services so that they wouldn’t interfere with each other and don’t require the whole stack of a VM. This technology was called containers. However, it was quite cumbersome to create a container until Docker came around and built easy to use tooling which enabled everybody to create a container - and the rest is history!
Cloudflare is a Web Application Firewall. Think of it as the firewall on your laptop or your router at home but on steroids. Cloudflare can’t just block requests to our servers but also inspect what are these requests trying to do and then based on this analysis take a multitude of actions.
Today I wanted to write about Prometheus. However, to understand what Prometheus is, you first need to understand what a Timeseries Database (TSDB) is.
 
A TSDB is a database optimized to store pairs of time and value:
  
| 15/02/2020 @ 10:00 | Some Series Name | 123'456 |
| 15/02/2020 @ 10:10 | Some Series Name | 123'865 |
| 15/02/2020 @ 10:20 | Some Series Name | 124'212 |
| 15/02/2020 @ 10:30 | Some Series Name | 124'745 |
 
As you can see, the basic implementation of TSDB is a very simple. On top of this basic functionality the different TSDBs have their own special features.
 
Even though timeseries databases got much more popular over the last few years, itis nothing new. One of the first widely used TSDB was  RRDTool which was created in 1992 and became the defacto standard to measure networking data. 
Terraform is an "infrastructure as code" tool. That means that we can define how the infrastructure on GCP (or any other cloud provider) should look like in a text file. When Terraform is run, it will parse this file, build a model how things should look and compare this to what is actually running in GCP. If there are any differences, it will ensure that the objects on GCP are the same as on the text file.
This is important to use because it makes our infrastructure idempotent and changes to it are easily trackable (and thus, revertible when there's a problem or mistake). 
The idea of Terraform is, that SRE provides the framework and tooling and SWE only interact with the simpler configuration files. For example, if you want a new Postgresql database, all you need to do is to edit the configuration file for the databases, add a new entry and run Terraform.
Kubernetes (k8s) is a container orchestration framework. It takes a Docker image and makes it available to everybody. Kubernetes controls how many of these images should run in parallel and how many resources it can use (CPU, Memory etc.) and ensures that service A can talk to service B.
Prometheus is a monitoring system. The first stable release of Prometheus was in 2015, thus it is a rather new monitoring system. Initially developed at Soundcloud based on the ideas of Google’s Borgmon, it implements a quite different approach to monitoring compared to the more traditional systems:
Instead of having agents push the data of the monitored object to a central server, the monitored object exposes the information on an HTTP path and Prometheus scrapes this data in a predefined interval (defaults to every 15 seconds).
Prometheus stores this data in a Timeseries Database. This database can be queried via PromQL, a domain specific query language.
Thanks to the Prometheus Timeseries Database the collected data is multi-dimensional and can be sliced and diced at will with the help of PromQL.
Service Level Objectives (SLOs) are targets for how often you can fail or otherwise not operate properly and still ensure that your users aren’t meaningfully upset. Thus an SLO specifies the threshold of the reliability which your users expect of the service.
Usually, an SLO is expressed in a percentage of the number of “good” events among total events and the SLO is the target for what that percentage should be.
A good event can be a successful login, or a getting the correct search results in on the SRP in less than 1 second.

Our monitoring and logging do not decide our reliability; our users do.

  • SLOs (and error budgets) increase both reliability and feature velocity over time. They also align incentives among previously warring factions (dev vs. ops. vs. pm).
  • SLOs (over time) give engineers a license to take more risks and to be subject to fewer launch constraints. There’s less bureaucracy to get in the way of a cool new launch.
  • Reliability is a first-class feature of the product. In fact, it’s the most important feature. If the users get the idea that the product won’t reliably meet their needs (because it’s unavailable, serving errors, etc.), then they won’t trust it.
 
SLOs provide us the tools we need to measure the customer experience, and for engineering they provide the data we need to make informed decisions where to put our effort.
Ultimately, SLOs are about happier users, happier engineers, happier product teams, and a happier business. This should always be the goal — not to reach new heights of the number of nines you can append to the end of your SLO target.

A great book with everything you need to know about SLOs is Alex Hidalgo's "Implementing Service Level Objectives".
A single server can only process a finite amount of data. Once all* its capacity is used, performance begins to degrade until the whole server will crash or come to a halt. Once you hit your current hardware’s limits you have two options:
  1. Vertical Scaling: This means you add more resources to the server. That can be more CPUs, more Memory etc. In the pre-cloud days servers were running on specialized hardware where it was possible to add or remove memory and CPU while the server was running. Nowadays you just change a setting on your VM.
  2. Horizontal Scaling: Instead of pumping up the existing hardware you run more of the same. So instead of having all your requests landing on one server, you will distribute the load over multiple servers that do the same thing.
 
Both methods have their pros and cons. Vertical scaling is easy because it doesn’t require any changes on the programs you are running, but there’s limit of how many CPUs or how much Memory you can add to your server.
Horizontal scaling is “unlimited” - you can always add another server. However, you application must be able to run multiple times at the same time without corrupting data for example. The management overhead is also bigger, you need some sort of coordination layer that distributes your requests to you fleet of servers, etc.
With the ubiquity of the cloud horizontal scaling has become the de facto standard how to scale your service and is one of the reasons for the triumph of microservices.

* Computers behave similarly like roads: a server cannot be utilized a 100% and still be fast. As a rule of thumb a server should not be utilized over 80% to avoid resource congestion.