This year the SREcon is located in Dublin. I attended the following talks:
What is the meaning of SRE? The presenter suggested the following term: "Using scientific principles to build things." Whereas an imported part of scientific principles is measuring.
SRE is the art of measurably optimize reliability vs. cost. To do its job the Engineer has the following possibilities:
And any of these trades add complexity.
The following reliability patterns exist:
SRE is the trade off between an Innovation and Reliability:
=> Measure it with Error Budget (Other measurements may be: MTBF/MTTF, 9s (e.g. 99% uptime)
SRE Team: a recipe
|Monitoring||System Architecture||Product Management|
|Alerting||Distributed Algorithms||Data Science|
|Capacity Planning||Networking||Business Acumen|
|CI/CD & Rollouts||Operating System||US Research|
This talk was about appling already known concepts from complex machine engineering (aircraft, missel) to software systems.
The problem is that our tools are 50-60 years old, but our technology is different today. That results often in a broken or missing interaction between components. And our current tools doesn't allow us to focus on this interactions. This results in unsafe requirements.
Change in the way we conceive of human error:
Link to the Lab description: BGP-Anycast-Workshop
The BGP and IPv6 stuff wasn't new to me. But the guys talked a little bit about the bare metal deployment at Packet and that was quite interesting.
Dan wrote his own application, skinny, to simulate a "hot potato" game. With this application the tried to teach himself distributed consensus (paxos). The application is written in go. For the workshop he set up a working environment with strigo: https://app.strigo.io/event/2AKwJWJ7XELsCSyu8
All relevant tools (also the deployment tools) are in his Gitlab Project: Skinny
The two guys from Google were talking about mitigating some typical outage scenarios. Here are some hints:
According to the Google engineers Zero touch prod (ZTP) would prevent 13% of all their outages. And the invest in ZTP would be less than the cost of the outages. These are the three areas of ZTP:
Every change in production must be either made by automation, prevalidationed by Software od made via audited break-glass mechanism (Seth Hattich)
This was a fun, but strange talk. I don't really know how to summarize it. I try it anyway
1) Test in production - beyond release 2) Progressive delivery - CD with fine controls (Canary, feature flags) 3) Error Budget - Opportunity for learning
Three threads to reliability:
A "behavioural outages" was defined as a situation where the result of processed data, results in an unwanted condition. As an example was mentioned, that Apple's Siri defned Bob Dylan as Dead (nevertheless he was still alive).
=> Data complexity replaces code complexitiy.
Here's a list of talks I couldn't attend but would have: