SREcon EMEA 2019 - Day 3
Here's a summary of the attended talks:
Building Resilience: How to Learn More from Incidents
Building Resilience: How to Learn More from Incidents
This talk was a addition to keynote from Nancy Leveson. It explained why it is important to learn from incident and how to accive this.
Why should we learn from incidents:
- It should not happen again
- Systems are not 100% reliable
- How complex systems fail
- Complex systems run always in degraded mode (a failure is the norm, not an exeption)
- failure is inherent build in
- We work in (not on) the system
- language matters
Resilence engineering - 4 traps
1) Attribution to human error
- system design and context affect when and how errors happen
- Do not stop investigating if you find a human error
2) counterfactual language (talking about things that didn't happen)
- "should have", "would have", "failed to", ...
- The engineer did not check the validity
- we're talking about things that didn't happen instead of taking the time to understand how what happened, happen
3) normative language
- "inadequately", "carelessly", ...
- Decisions of operators are judged on the basis of their outcome
4) Mechanistic reasoning
- Human adaptive capassity is necessary to keep our systems up and running in the first place
- Misconception: once we found the broken human, the failure will disappear
Recommendations:
- Run a facilitated post incident review:
- A meeting with incident participents
- ~60-90 min max
- Neutal facilitator
- prepare with one-to-one interviews (if necessary)
- Lots of incidents?
- Don't do it for all of them right away
- Pick interesting ones (not necessarily the big ones)
- Ask better questions
- prefer "how?" over "why?"
- Each participant has a different viewpoint, ask about that
- Ask about what normally happens
- Example: Etsy Debriefing Guide
- Ask how things work right
- How we recovered the systems
- What insides/tools/skills/people were involved
- How know people what to do, what to decide
- Keep review and planing meeting separat
- Hold a separat, smaller, planning meeting (24-48 hours later)
- Allows soak time which results in better repairs
Human error is a symptom, not a cause
Summary: Learning from icidents
How Stripe Invests in Technical Infrastructure
How Stripe Invests in Technical Infrastructure
How stripe prioritze its infrastructure investment.
Defintion of technical infrastructure:
- dev tools
- data infrastructure
- core libraries and frameworks
- model training and evaluation
- business critical tools/infrastructure
Reasoning
- Forced <-> Discretionary
- short term <-> long term
What to do?
- Finish something useful ! (reduce WIP)
- Automate
- Eliminate categories of problems
Prioritization:
- Order by return of investment
- Ask your users
- long term vision
Convert unplanned work into planned work
Pushing through Friction
Pushing through Friction
Pushing through Friction (Daniel Na)
What is friction?
- Friction is all the work to be done to get there
Why does friction occur?
The normalization of deviance
- People don't care anymore
- Orgs and processes incur friction slowly
How do we fix it?
- Document single sources fo truth and keep them updated
- updating docs is part of the acceptance criteria for shipping new work
- Solicit the "WTF" of new hires
Long term cultural behaviours
- Address hard truth - kindly
- Celebrate the glue work
- make glue work promotable work, better: make it mandatory to get promoted
- make phsycological safety the first principle
Individuals:
- Develop your own sence of agency
- Take the correct path, even if it is hard
- Being a hero or an asshole doesn't scale
- Have important discussions face to face
- Get to know other people on other teams
- New idea? Try once!
Perks and Pitfalls of Building a Remote First Team
Perks and Pitfalls of Building a Remote First Team
Focus on:
Perks |
Pitfalls |
No commute |
Isolation |
Flexible schedule |
blurry work/life time |
distruction free |
blocked constantly |
Making it better:
- Develop regional autonomy (follow the sun)
- build community outside your team
- fight the hero culture
- clarify responsibility
How?
- Cross team
- Random connections, conferences, informal team chats
- ignore things
- @people always
- set your working hours
- don't do direct messages, use channels
- specify urgency
- Document more
- share the why
- brainstorming docs
Long term sustainable
Perks |
Pitfalls |
global company pool |
lack of career growth |
long term flexibility |
second class citizen |
|
imposter syndrome |
Making it better:
- trust people, not character
- Inclusion:
- Ideas come from everyone
- All voices are heard
- celebrate the smallest of things
- celebrate and thanks channel
- overshare - email everyone
- Be supportive and say it out loud
- regular unstructured ideas talks
- company wide scheduled remote only days
Manager perspektive:
- foster culture of: Trust, Autonomy, Accountability
- Hiring is hard
- You don't scale
How?
- Clear policies & job descriptions
- Team writes project documentation
- Balance teams across time zones
Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures
Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures
SRE practices leadd us to a high failure novelty rate.
Transparent response:
- Make realtime shadowing easy
- All decisions making and data behind is made widely open
Incident simulation
- wheel of misfortune
- caveat: Only as good as a human understand the system
- Really good at training new team members
Game days
- start from hypothesis
- induce real production failure
- observe, recover, adapt
Turn rusty knobs
- exercise failure recovery practices and software in production
Automated failure tests (chaos monkey)
- focus on most routine failures: timeouts, connection failures, ...
Does this help with novell failures?
- high trust teams - fast consensus
- communication - calculated risk taking
- creativiy - empathy
Preparing with creativ team events: Lego incident response
Hiring Great SREs
Hiring Great SREs
Brian from twitter gives a nice overview of their hiring process and their incentive behing it. In the end it boils down to: "What do you need: the skill or the person?" The rest is best practices.
- What's the eventual goal for the new employee?
- How long do we expect it to take them reach the goal?
- Is this someone we want to work with for years?
What makes a great SRE?
- bright/wide knowledge
- strong feeling of ownership
- ...
Finding answers
- specific skill set?
- experience with specific tech stack?
- what skill level is required?
The interview
- have a strategy
- using a rubric to evaluate the candiate
- sell it
- yourself
- your org
- your company
SRE in the third age
SRE in the third age
Nice talk from Björn about his life in SRE.
Summary:
- 1st age: Google (converged evolution; SRE in name only (SREino))
- 2nd age: Soundcloud (Engineering leadership role; SRE in spirit)
- 3rd age: No companies need SREs, but all need SRE (knowledge). Lookout for SRE mindset.
Talks I missed