SREcon EMEA 2019 - Day 3

Here's a summary of the attended talks:

Building Resilience: How to Learn More from Incidents

This talk was a addition to keynote from Nancy Leveson. It explained why it is important to learn from incident and how to accive this.

Why should we learn from incidents:

It should not happen again
Systems are not 100% reliable
- How complex systems fail
- Complex systems run always in degraded mode (a failure is the norm, not an exeption)
- failure is inherent build in
We work in (not on) the system
- human reactions matter
language matters

Resilence engineering - 4 traps

1) Attribution to human error

system design and context affect when and how errors happen
Do not stop investigating if you find a human error 2) counterfactual language (talking about things that didn't happen)
"should have", "would have", "failed to", ...
The engineer did not check the validity
we're talking about things that didn't happen instead of taking the time to understand how what happened, happen 3) normative language
"inadequately", "carelessly", ...
Decisions of operators are judged on the basis of their outcome 4) Mechanistic reasoning
Human adaptive capassity is necessary to keep our systems up and running in the first place
Misconception: once we found the broken human, the failure will disappear

Recommendations:

Run a facilitated post incident review:
- A meeting with incident participents
- ~60-90 min max
- Neutal facilitator
- prepare with one-to-one interviews (if necessary)
- Lots of incidents?
- Don't do it for all of them right away
- Pick interesting ones (not necessarily the big ones)
Ask better questions
- prefer "how?" over "why?"
- Each participant has a different viewpoint, ask about that
- Ask about what normally happens
- Example: Etsy Debriefing Guide
Ask how things work right
- How we recovered the systems
- What insides/tools/skills/people were involved
- How know people what to do, what to decide
Keep review and planing meeting separat
- Hold a separat, smaller, planning meeting (24-48 hours later)
- Allows soak time which results in better repairs

Human error is a symptom, not a cause

Summary: Learning from icidents

How Stripe Invests in Technical Infrastructure

How stripe prioritze its infrastructure investment.

Defintion of technical infrastructure:

dev tools
data infrastructure
core libraries and frameworks
model training and evaluation
business critical tools/infrastructure

Reasoning

Forced <-> Discretionary
short term <-> long term

What to do?

Finish something useful ! (reduce WIP)
Automate
Eliminate categories of problems

Prioritization:

Order by return of investment
Ask your users
long term vision

Convert unplanned work into planned work

Pushing through Friction

Pushing through Friction (Daniel Na)

What is friction?

Friction is all the work to be done to get there

Why does friction occur?

growth

The normalization of deviance

People don't care anymore
Orgs and processes incur friction slowly

How do we fix it?

Document single sources fo truth and keep them updated
updating docs is part of the acceptance criteria for shipping new work
Solicit the "WTF" of new hires

Long term cultural behaviours

Address hard truth - kindly
Celebrate the glue work
- make glue work promotable work, better: make it mandatory to get promoted
make phsycological safety the first principle

Individuals:

Develop your own sence of agency
Take the correct path, even if it is hard
Being a hero or an asshole doesn't scale
Have important discussions face to face
Get to know other people on other teams
New idea? Try once!

Perks and Pitfalls of Building a Remote First Team

Focus on:

Inclusion
Coverage
Pace

Perks	Pitfalls
No commute	Isolation
Flexible schedule	blurry work/life time
distruction free	blocked constantly

Making it better:

Develop regional autonomy (follow the sun)
build community outside your team
fight the hero culture
clarify responsibility

How?

Cross team
- Random connections, conferences, informal team chats
ignore things
- @people always
- set your working hours
- don't do direct messages, use channels
- specify urgency
Document more
- share the why
- brainstorming docs

Long term sustainable

Perks	Pitfalls
global company pool	lack of career growth
long term flexibility	second class citizen
	imposter syndrome

Making it better:

trust people, not character
Inclusion:
- Ideas come from everyone
- All voices are heard
celebrate the smallest of things
- celebrate and thanks channel
- overshare - email everyone
Be supportive and say it out loud
regular unstructured ideas talks
company wide scheduled remote only days

Manager perspektive:

foster culture of: Trust, Autonomy, Accountability
Hiring is hard
You don't scale

How?

Clear policies & job descriptions
Team writes project documentation
Balance teams across time zones

Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures

SRE practices leadd us to a high failure novelty rate.

Transparent response:

Make realtime shadowing easy
All decisions making and data behind is made widely open

Incident simulation

wheel of misfortune
caveat: Only as good as a human understand the system
Really good at training new team members

Game days

start from hypothesis
induce real production failure
observe, recover, adapt

Turn rusty knobs

exercise failure recovery practices and software in production

Automated failure tests (chaos monkey)

focus on most routine failures: timeouts, connection failures, ...

Does this help with novell failures?

high trust teams - fast consensus
communication - calculated risk taking
creativiy - empathy

Preparing with creativ team events: Lego incident response

Hiring Great SREs

Brian from twitter gives a nice overview of their hiring process and their incentive behing it. In the end it boils down to: "What do you need: the skill or the person?" The rest is best practices.

What's the eventual goal for the new employee?
How long do we expect it to take them reach the goal?
Is this someone we want to work with for years?

What makes a great SRE?

bright/wide knowledge
strong feeling of ownership
...

Finding answers

specific skill set?
experience with specific tech stack?
what skill level is required?

The interview

have a strategy
using a rubric to evaluate the candiate
sell it
- yourself
- your org
- your company

SRE in the third age

Nice talk from Björn about his life in SRE.

Summary:

1st age: Google (converged evolution; SRE in name only (SREino))
2nd age: Soundcloud (Engineering leadership role; SRE in spirit)
3rd age: No companies need SREs, but all need SRE (knowledge). Lookout for SRE mindset.