SREcon EMEA 2019 - Day 3

Here's a summary of the attended talks:

Building Resilience: How to Learn More from Incidents

Building Resilience: How to Learn More from Incidents

This talk was a addition to keynote from Nancy Leveson. It explained why it is important to learn from incident and how to accive this.

Why should we learn from incidents:

  • It should not happen again
  • Systems are not 100% reliable
    • How complex systems fail
    • Complex systems run always in degraded mode (a failure is the norm, not an exeption)
    • failure is inherent build in
  • We work in (not on) the system
    • human reactions matter
  • language matters

Resilence engineering - 4 traps

1) Attribution to human error

  • system design and context affect when and how errors happen
  • Do not stop investigating if you find a human error 2) counterfactual language (talking about things that didn't happen)
  • "should have", "would have", "failed to", ...
  • The engineer did not check the validity
  • we're talking about things that didn't happen instead of taking the time to understand how what happened, happen 3) normative language
  • "inadequately", "carelessly", ...
  • Decisions of operators are judged on the basis of their outcome 4) Mechanistic reasoning
  • Human adaptive capassity is necessary to keep our systems up and running in the first place
  • Misconception: once we found the broken human, the failure will disappear

Recommendations:

  • Run a facilitated post incident review:
    • A meeting with incident participents
    • ~60-90 min max
    • Neutal facilitator
    • prepare with one-to-one interviews (if necessary)
    • Lots of incidents?
    • Don't do it for all of them right away
    • Pick interesting ones (not necessarily the big ones)
  • Ask better questions
    • prefer "how?" over "why?"
    • Each participant has a different viewpoint, ask about that
    • Ask about what normally happens
    • Example: Etsy Debriefing Guide
  • Ask how things work right
    • How we recovered the systems
    • What insides/tools/skills/people were involved
    • How know people what to do, what to decide
  • Keep review and planing meeting separat
    • Hold a separat, smaller, planning meeting (24-48 hours later)
    • Allows soak time which results in better repairs

Human error is a symptom, not a cause

Summary: Learning from icidents

How Stripe Invests in Technical Infrastructure

How Stripe Invests in Technical Infrastructure

How stripe prioritze its infrastructure investment.

Defintion of technical infrastructure:

  • dev tools
  • data infrastructure
  • core libraries and frameworks
  • model training and evaluation
  • business critical tools/infrastructure

Reasoning

  • Forced <-> Discretionary
  • short term <-> long term

What to do?

  • Finish something useful ! (reduce WIP)
  • Automate
  • Eliminate categories of problems

Prioritization:

  • Order by return of investment
  • Ask your users
  • long term vision

Convert unplanned work into planned work

Pushing through Friction

Pushing through Friction

Pushing through Friction (Daniel Na)

What is friction?

  • Friction is all the work to be done to get there

Why does friction occur?

  • growth

The normalization of deviance

  • People don't care anymore
  • Orgs and processes incur friction slowly

How do we fix it?

  • Document single sources fo truth and keep them updated
  • updating docs is part of the acceptance criteria for shipping new work
  • Solicit the "WTF" of new hires

Long term cultural behaviours

  • Address hard truth - kindly
  • Celebrate the glue work
    • make glue work promotable work, better: make it mandatory to get promoted
  • make phsycological safety the first principle

Individuals:

  • Develop your own sence of agency
  • Take the correct path, even if it is hard
  • Being a hero or an asshole doesn't scale
  • Have important discussions face to face
  • Get to know other people on other teams
  • New idea? Try once!

Perks and Pitfalls of Building a Remote First Team

Perks and Pitfalls of Building a Remote First Team

Focus on:

  • Inclusion
  • Coverage
  • Pace
Perks Pitfalls
No commute Isolation
Flexible schedule blurry work/life time
distruction free blocked constantly

Making it better:

  • Develop regional autonomy (follow the sun)
  • build community outside your team
  • fight the hero culture
  • clarify responsibility

How?

  • Cross team
    • Random connections, conferences, informal team chats
  • ignore things
    • @people always
    • set your working hours
    • don't do direct messages, use channels
    • specify urgency
  • Document more
    • share the why
    • brainstorming docs

Long term sustainable

Perks Pitfalls
global company pool lack of career growth
long term flexibility second class citizen
imposter syndrome

Making it better:

  • trust people, not character
  • Inclusion:
    • Ideas come from everyone
    • All voices are heard
  • celebrate the smallest of things
    • celebrate and thanks channel
    • overshare - email everyone
  • Be supportive and say it out loud
  • regular unstructured ideas talks
  • company wide scheduled remote only days

Manager perspektive:

  • foster culture of: Trust, Autonomy, Accountability
  • Hiring is hard
  • You don't scale

How?

  • Clear policies & job descriptions
  • Team writes project documentation
  • Balance teams across time zones

Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures

Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures

SRE practices leadd us to a high failure novelty rate.

Transparent response:

  • Make realtime shadowing easy
  • All decisions making and data behind is made widely open

Incident simulation

  • wheel of misfortune
  • caveat: Only as good as a human understand the system
  • Really good at training new team members

Game days

  • start from hypothesis
  • induce real production failure
  • observe, recover, adapt

Turn rusty knobs

  • exercise failure recovery practices and software in production

Automated failure tests (chaos monkey)

  • focus on most routine failures: timeouts, connection failures, ...

Does this help with novell failures?

  • high trust teams - fast consensus
  • communication - calculated risk taking
  • creativiy - empathy

Preparing with creativ team events: Lego incident response

Hiring Great SREs

Hiring Great SREs

Brian from twitter gives a nice overview of their hiring process and their incentive behing it. In the end it boils down to: "What do you need: the skill or the person?" The rest is best practices.

  • What's the eventual goal for the new employee?
  • How long do we expect it to take them reach the goal?
  • Is this someone we want to work with for years?

What makes a great SRE?

  • bright/wide knowledge
  • strong feeling of ownership
  • ...

Finding answers

  • specific skill set?
  • experience with specific tech stack?
  • what skill level is required?

The interview

  • have a strategy
  • using a rubric to evaluate the candiate
  • sell it
    • yourself
    • your org
    • your company

SRE in the third age

SRE in the third age

Nice talk from Björn about his life in SRE.

Summary:

  • 1st age: Google (converged evolution; SRE in name only (SREino))
  • 2nd age: Soundcloud (Engineering leadership role; SRE in spirit)
  • 3rd age: No companies need SREs, but all need SRE (knowledge). Lookout for SRE mindset.

Talks I missed