Todo

Raw notes to move elsewhere:

Adaptive capacity
Blamelessness
Sense-making
Service Level Objectives
Service Level Indicators
Service Level Agreements
T-shaped skills
Generalists
Specialists
MTTx
MTBF
Psychological Safety
Learning Organizations
Incident Command System
Alert Fatigue
Observability
Redundancy
OODA loop
Second-loop learning
Chaos engineering
Game days
Tabletops
Runbooks
Incident Severity Classification
Root Cause Analysis
Human Factors Engineering
Cognitive Biases in Operations
Recovery-Oriented Computing
Microservices
Distributed systems
Degraded state
Technical debt
Site Reliability Engineering
DevOps
Change management
Error Budgets
Burnout
Slack
Failure Modes and Effects Analysis
Incident Archaeology
Disaster Recovery Planning
Five Whys
Complex systems
How to Measure Anything
Practice is critical
Blamelessness
"Everyone's busy", but few on the priority projects
Generalists Specialists
Goals
Knowledge in the head and in the world
Maintenance
"Start where you stand" Emergency Response Triage training video
Stay in your lane
"the thing that drove circuit switching was not a technical requirement, the technical requirement followed from the business requirement"
"This might be a stupid question, but I might be missing some context here..." disarms people, puts them in the position of expert, less of a challenge and more of a catch me up thing
Retrospectives
Compliance
7 Habits of Highly Effective People
AI
Incident
Allspaw
Argyris
Second Loop Learning
Cascading Surprises, Accidentally Load Bearing
Chris Argyris
Continuous Deployment! How to make a conservative org. take the leap.
Conversational Capacity
Data Resilience/Reliability/Governance
David Woods Columbia profitability example, so what does the organization look like?
DDoS mitigations
Sidney Dekker quote on when to intervene => should most tech folks view incidents as high urgency?
Disaster Response
Documentation
Donald Schön
Error budgets - Charity Majors Honeycomb
Eventual consistency
Expertise
First 90 Days
Getting Things Done
Goal setting, personal and work
Graceful Extensibility
How to Measure Anything
How to run emergency response for a temporary city
Howie guide
Human error rate for changes
Humble Inquiry
John Seddon
Kill IT with Fire
Leading Change from the Middle
LFI talks, e.g. Multi-party incidents
Monoliths
NUUMI plant
Nvidia https://arstechnica.com/information-technology/2022/03/cybercriminals-who-breached-nvidia-issue-one-of-the-most-unusual-demands-ever/
Oncall structure
Organizational Learning
Paradigm Shift: Circuits to Packets
Risk management approaches
Scaling
Situational leadership
Socratic Method as a way to encourage people to do things, but they need to take action themselves
Specialist titles vs Generalist titles
Speech Chain
SPOF
Streetlights and Shadows - guidance on when to use decision aids vs human deciders
The Coaching Habit
Thermocline of Truth
Thinking in Systems book
Timeoff as a resilience strategy
To Teach
Twitter 🤦‍♂️
Ukraine
W. Edwards Deming
Watermelon projects
Westrum model
What Works for Women at Work
Will Larson
1. rate of change matters, it could be a slow burn vs a sudden spike
2. underutilization is also a problem, if you are too careful of over utilization you will over pay for unused capacity
3 types of post-incident reviews: analysis, affected stakeholder reporting horizontally, stakeholder reporting vertically
A balancing triangle: quantify impact - mitigate - understand what happened
A consultancy pattern: "Heroic" programmer generalists -> intentionally a small team of effective generalists
A developers expertise about their own code has a short shelf-life.
A huge Cassandra cluster incident within Fintech
A review facilitator
A team of generalists with diverse specialties is resilient
Aaron Halfaker Immune Response research
Abstracting away complexity makes some things easier/faster
Accountability to do better in the future can help with burnout, even if there's a short term pain
Accounting for costs: avoid over inflating costs
Adaptive Capacity is less of a fuel and more about stance
Adaptive Capacity relationship with addition/removal of people is non-linear
Agile Definition of but applied to Incidents
Agree upon commonalities so people can act independently
AI for creation -> humans for curation
AI is an accelerant, but may not be an improvement
Alternatives to TTR
And (Inverse)
Annualized costs can be tricky
Another caution: humans have a non-linear response around risk. Making things safer can lead to more risky behavior. Perceived safety vs realized safety. “Dutch helmets” a pattern of increasing the risk to motivate more skill. Instead of encouraging helmets, encourage more safe bicycling skills.
Anti fragility vs stability vs fragility - Tlaeb (spelling?)
Anti-fragility can strengthen the system upon restoration
API as communications channels, pros and cons
Are the CTOs reading the same stuff?
Are there areas of the system you are worried about?
Ask the seniors what the worst could be
Asking descriptive questions, how is better than why
Asking product for 9s doesn't work - SLOs??? Cost?
Asserting I'm adding a hypothesis to the list is different than narrowing the model
Atoms and Bits: copying is hard vs easy; implementing new is easy vs hard
Attributability of investment is importment
Avoid engineer's distraction vs informing why
Avoid focus on implementing specific patterns when there are more burning issues
Avoid too much process in incident review, leave flexibility for different types of insights for near misses
AWS architecture patterns
B = MAP Behavior = Motivation * Ability * Prompt (BJ Fogg - Tiny Habits)
Backlog: "Is it okay to run the diswasher when it's only a quarter full?"
Backlog: Cross-training skills within a team
Backlog: Observability and environment isolation
Backlog: Universal Design, Accessibility
Backlog: Automated alert creation
Backlog: Avoiding incidents by maximizing active knowledge It's common that incidents happen when old untouched systems are being modified. Can we
Backlog: Ethics of balancing resolution with space for learning
Backlog: How do you know if your team is doing well AND productive?
Backlog: How does everyone do Security in your platform?
Backlog: How should people ask thought provoking questions which help driving the conversation while also avoid annoying people?
Backlog: How to prevent failures of omission/ ensure you're taking sufficient risk?
Backlog: Observability and Data
Backlog: Optimum investment, mapping resilience to business value, perhaps with SLOs
Backlog: Putting lots of ambitious goals on list, discovering you've moved from laminar to turbulent flow
Backlog: recovery: how to make sure you're not wasting "stimulus" by not building on the lessons it's teaching?
Backlog: Resilience against underresourcing.
Backlog: Resilient skillsets moving into the GPT-4 era?
Backlog: Simplicity: Subtraction
Backlog: Structured Delegation and Accountability as a way to establish organizational resilience
Backlog: Succession planning, avoiding human SPOFs
Backlog: What to do when everything is broken
Backseat Incident Commanders
Bader-Meinhoff phenomenon
balancing dev / infra headcount
Balancing dimensions of resilience: technical/non-technical
bar raiser analysts
Behind Human Error
Being unaware of power dynamic can lead to the same words being taken differently
Benefits of Monorepos
Benefits of standards vs risk of monoculture
Better Standard Libraries, Better Languages
Beyond Culture
bias, decision aids, availability bias, streetlights and shadows
Blameful retros become an incident
Block of time for learning
Book recommendations from Resilience Coffee
Bounded contexts in domain driven design
Breakout sessions for different parts of a larger incident
Brené Brown - Expectations
Bug injection as a way to gain context on unfamiliar codebases
Build vs Buy
Building consensus for prioritizing paying down Tech debt
building resilient (virtual) communities: immune response, succession planning
Burnout frustrations
Burnt out people can get into a bunker mentality, do what they are told but don't drive forward
Can do tabletops
Can you achieve alignment while maintaining self direction?
Canaries
Carving up a hard problem can make things "bigger" but can also make things more brittle
Cassandra analytics
Celebrate the failures to encourage blamelessness
Centralized direction setting for developing decentralized resilience capabilities
Ceremonies, socio-technical systems, talking to people rather than bugfixes
Ch. 6 Seddon: Beyond Command and Control
Chasing down threads in an incident -> better observability
ChatGPT
Checklists, compliance, reporting
Chesterton's Fence before adding another layer of abstraction?
Clarify accountability
Clients paying for 2 9s, getting used to 5 9s, when it goes to 3 9s, they get upset
Code freeze after incident across ~6 weeks or so
Code freezes: what works and doesn't
Combining technologies for attribution is tricky
Comfort with idle
Communities of Practice
Communities that accept help are more resilient
Companies aren't learning from each other and making the same mistakes
Companies don't know what they know: unknown knowns
Compensating measures are important
Consensus on a small set of KPIs/SLOs can help focus a company
consequences when engineering makes mistakes for customers
Context matters, more context given to the responder helps
Control Points = Ability to handle Perturbation
Conversational Capacity
Conway said the opposite: you make decisions around who will do what, e.g. interpret and action policy. Domains of responsibility closes off pathways to other decisions that might have been better. Consequence -> people will build within their loops
Core Ethical question for incident commanders
cost of 5 minute oncall
Costs of centralization
Creating an "Incident Vibe"
Credit the AI when you use it
Cross cutting concerns are difficult without clearly defined interfaces for both code and people
Cultural work is hard
Culture Map (mixed reviews)
Curiosity > frustration
Customers get used to your reliability even if it isn't what you state
cycle of expanding interventions, automation, not having all hazardinterventions
Dan Davies’s LYING FOR MONEY
Dashboard values are shorthand for a story -> what story do we share with management about what's important to us and our business, how do we represent that in the dashboards, numbers/status is important, but what matters is the story
Dashboards: Key Hole Property
Data chain of custody/provenance matters for cleanup
Data platform can be a wonderful place to surface the conversation
Data quality for making decisions has a lower standard
David Woods, Joint Cognitive Systems page 87, "Alarms and Directed Attention" Alarm is an agent trying to redirect my attention, how good is it at doing that? Medicine is an example
Dealing with Silos
dedication analysts
defensiveness can get people to shutdown and avoid new information
Designing Resilience in from the beginning
DevOps Tel Aviv video
Did these analogies work in the original domains?
Difficult to split line items
Digital Marketers do a great job about spend and return across the different ad markets, last click attribution vs combining attribution channels
Diminishing returns, significance of marketing, ideal customer base, they lose traceability in organic conversions, affiliates
DiRT test with multiple teams for an upcoming yearly throughput peak
Disaster Response
Discoverability : Making complexity shallow
Discussing: (NEW) Cassandra, NewSQL
Discussing: (NEW) Cloud Migrations
Discussing: Financial resilience
Discussing: Nepal
Discussing: SawStop
Discussing: Automated alert management (as a resource for preventing Burnout/alert fatigue/line between good alerts and noise)
Discussing: When is resilience profitable?
distance is inverse similarity; is risk inverse adaptive capacity?
Divergence between commitments and expectations
Do you separate people who fix from those who create problems?
documentation
Does QA stop at release these days? Quality engineering?
Doing that ^ ahead of time helps
Don't (always) apply Manufacturing analogies to Knowledge Work
Doneness is org specific
Downtime budget
Dynamic teaming
Dynamically stable, but perceived as unstable
Ecological Interfaces
Ed Lau
Effort to move is a factor too
eight r’s at amazon. restart, remove,…
Eisenhower Matrix
Embedded design principle: be able to recover sensemaking from the design
embracing inconsistency
Employed partners in discorrelated industry
Empowering engineers is critical
Empowering the person who is paged to fix the source of the page rather than just mitigating it.
Encourage all factors/all hazard/all risk mitigations
Encouraging leadership from the team
Encouraging positive change > discouraging negative changes
Entire leadership chain is important, need to get buy in from middle management
Erika Rowland :headphones: 7 minutes ago We need an inverse round-up, a resilience engineering round-up of things not from resilience engineering proper.
Error budget
error budgets with SLOs as alternative to MTTR - define what you're measuring, from a product perspective
Escalation metrics, dividing things into critical/non-critical/informational
Escalation Path to Org Chart
Etsy three armed t-shirt for biggest oops https://www.ecommercebytes.com/2021/03/29/etsy-gives-award-to-coder-who-crashed-the-site-last-year/
Evaluators of burntout people are also burntout
Every company has a policy, many are just undocumented
“every solution contains the seeds of its undoing”
Everyone has a model, hypothesis generation, then aligning those models, separate from hypothesis testing
Examples of usage of RAG: Water https://onlinelibrary.wiley.com/doi/full/10.1111/wej.12539 Healthcare https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9635744/ A more general analysis of it: https://link.springer.com/chapter/10.1007/978-3-031-12547-8_4
Executive understanding of design patterns may be more impactful than PE understand of design patterns
expectation of coverage seeking
External human factors are important
Extracting patterns and standardizing
Failure Mode Effects analysis
Failure mode effects sessions
Failure of Risk Management NASA Ski Hill Quote
Fall risk, Required RCA/C?A, perf, encoding of the ideas into the policy
Fault trees
Feature flags solve the problem of "gradual rollouts" - sometimes "hash(userId) % 100 < 10" is enough
FEMA has “All Hazard. All Risk” mitigations. In software, those can become automated later. These mitigations make failure cheap for the business. That can change the risk-reward trade-off which enables learning. The next stage of growth comes as that automation breaks and creates different complexity for the incidents.
ferd.ca quote - "It's all going to hell anyway, we can just control how fast it goes"
Finance tends to look at changes, increases, rather than ongoing costs
Finding a new job while burntout is hard, while an existing job is burning you out
Finding balance of product and engineering
FireHydrant
First 90 days
First rule of incidents: Stay calm
Followup - how to evaluate IC value to teams/process
Font incident
Fragmented information, no one is trying to gather overall picture
Framing of questions is important in analysis
Free tier as a way to simplify charge backs
from Aikido for Incidents
From Blameless to Sanctionless
From Waterfall to Lean/Continuous accounting
From whom and to whom matters
Game day/chaos testing can also train people
Game days while seniors are on vacation
Gender matters about "stupid question" framing, there's probably another way to frame it so that you own your expertise but drive to understand the context of the other
General Meeting Guide for Mixed Seniority meetings
Get agreement on what problem to solve
Get people talking to each other, the tools will follow
Getting buy in from above on tech debt
Getting Buy in from middle management, communicating insights
Glue work always helps the team
Goals give direction
Goals: Is it meaningful to set Resilience Goals?
Going back and fixing action items is hard to prioritize
Good Runbooks lead to stasis where people are comfortable with minimal pain
Good Strategy/Bad Strategy - Obstacles to overcome, What we're not going to do
Google SRE abandoned Phone bridges early on -> using a collaborative document works well once you've had practice. Phone bridges are single threaded, stepping on each other's words, any other method besides phone bridge works wonders, but it takes practice
Graceful extensibility - managing adaptive capacity https://www.researchgate.net/publication/327427067_The_Theory_of_Graceful_Extensibility_Basic_rules_that_govern_adaptive_systems
Growth Engineering as Reliability Engineering
Guesstimate for estimating incident duration
Hardware approaches don't apply to software
Have a side gig/consulting company
Have a third party facilitate the incident analysis
Have comms drafted for common types of incidents
Have load that can be shed
Health vs. debugging metrics
Healthcare: COBRA, or purchase it for sidegig company, Healthcare.gov, state hc marketplaces, partner's employer coverage
Hedonistic tendency
Helping analysts to engineer
Helping engineers to understand the business
Helping teams get out of Alert fatigue
Here’s one way I’ve gotten us to talk about the organizers’ skills and interests — with a matrix https://docs.google.com/spreadsheets/d/17coR7yIoqVXFvA-Q0oiP7BU_14CiKuWWnxjPKd3j_sY/edit#gid=1902703774
Hidden Figures 10x engineer
Highly specialized workplaces can make it hard to find the right people, even if the symptoms are solvable
Hijacked retros : Someone shows up to a blameless retro with blame, power dynamics/punishment
Hindsight bias: potentially more toxic than blame
Hiring
Hollnagel's Resilience Analysis Grid [1](https://erikhollnagel.com/ideas/resilience%20assessment%20grid.html "smartCard-inline")
Hot potato the alert to other people to get context
How do I lower the risk of change -> make more smaller changes
How do organizations change toward DevOps?
How do we tell the story/visualize the incident?
How do you allocate on-call duties amongst teams/individuals?
How do you build buffer/ slack into your day? Do you just keep yourself/ your team fully subscribed and drop things if you need to reprioritize?
How do you know "who's the expert" in the org?
How do you know if an organization is resilient?
How do you know your services aren't overbuilt?
How do you measure productivity in Incident Management teams
How do you mid-day mental GC to prevent OOM'ing too early?
How does an organization learn from mistakes?
How much top down buyin for RE is there? Is it a factor when designing organizations?
How reliable do you want to spend?
How Slack uses Slack for Incidents
How Tenuous are things?
How the questions are asked matters
How to balance high bandwidth communication with above... use both!
How to build resilience in before there's too much demand
How to empower people who don't see the power they have?
How to Measure Anything - https://www.howtomeasureanything.com
How to measure Learning?
How to navigate too many displays, visual momentum. analysis of paper: https://resilienceroundup.com/issues/how-not-to-have-to-navigate-through-too-many-displays/
How to prepare for unanticipated external threats in a generic way
How to reduce noise by reviewing the alerts periodically
How well does our understanding match reality?
http://sixpack.seatgeek.com/
http://sunnyday.mit.edu/accidents/jsr-final.pdf
http://www.melconway.com/Home/Committees_Paper.html
https://about.sourcegraph.com/batch-changes
https://cate.blog/2021/11/29/5-signs-its-time-to-quit-your-job/
https://codeball.ai
https://comic.browserling.com/97
https://cse.umn.edu/umsec/events/code-freeze-2023-tech-resilience
https://danlebrero.com/2021/06/30/cto-dairy-lucky-lotto-chaos-engineering-for-teams/
https://dashbit.co/blog/kubernetes-and-the-erlang-vm-orchestration-on-the-large-and-the-small
https://docs.google.com/spreadsheets/d/1-5EGtpt6ZBE19ktle4lc577QnRCCRSXWiivojYoS4xA/edit?usp=sharing
https://dora.dev/devops-capabilities/cultural/generative-organizational-culture/
https://erikhollnagel.com/onewebmedia/RAG%20Outline%20V2.pdf
https://essenceofsoftware.com/
https://ferd.ca/complexity-has-to-live-somewhere.html
https://ferd.ca/notes/ ferd.caferd.ca My notes and other stuff Fred Hebert's notes about various things
https://ferd.ca/notes/paper-ecological-interfaces-a-technological-imperative-in-high-tech-systems.html
https://fourweekmba.com/scheins-model-of-organizational/
https://frameshiftconsulting.com/ally-skills-workshop/
https://gigamonkeys.com/flowers/
https://github.com/JensRantil/java-canary-tools
https://github.com/lorin/resilience-engineering
https://github.com/randsleadershipslack/employer-test
https://grenfellenquirer.blog/catastrophe-systemic-change-the-book/
https://journals.lww.com/transplantjournal/Fulltext/2007/12270/Probabilistic_Risk_Assessment_of_Accidental.12.aspx
https://kubevela.io
https://lethain.com/forty-year-career/
https://martinfowler.com/articles/measuring-developer-productivity-humans.html
https://medium.com/10x-curiosity/boundaries-of-failure-rasmussens-model-of-how-accidents-happen-58dc61eb1cf
https://mitpress.mit.edu/books/digital-apollo
https://ncase.me/polygons/
https://ndmc.pyd.org/
https://news.ycombinator.com/item?id=32196345
https://news.ycombinator.com/item?id=32319147
https://oam.dev/
https://onlinelibrary.wiley.com/doi/full/10.1111/psj.12212
https://pragprog.com/titles/atcrime/your-code-as-a-crime-scene/
https://pragprog.com/titles/ehxta/explore-it/
https://prometheus.io/docs/concepts/metric_types/#summary
https://queue.acm.org/detail.cfm?id=3096459
https://rands-leadership.slack.com/archives/C02J0KV3B55/p1657911889283029
https://rands-leadership.slack.com/archives/CCAMQC0H1/p1666456526936469
https://resilienceroundup.com
https://resilienceroundup.com/issues/four-concepts-for-resilience-and-the-implications-for-the-future-of-resilience-engineering/ Resilience RoundupResilience Roundup Four concepts for resilience and the implications for the future of resilience engineering This week we have a paper by David Woods who is a principal at Adaptive Capacity Labs, a sponsor. Sponsorship or relation to a sponsor does not influence how I analyze papers and have featured Woods' papers previously. I’ve talked to a lot of readers who have told me Written by Thai Wood Filed under Issues Apr 3rd, 2021 Erika Rowland :headphones: 2 minutes ago https://erikarow.land/notes/paper-four-concepts-resilience
https://resilienceroundup.com/issues/measuring-system-resilience-with-the-resilience-analysis-grid/
https://resilienceroundup.com/issues/the-role-of-software-in-spacecraft-accidents/
https://rls.social/@alper/109806169938390380
https://shermanonsoftware.com/2024/04/08/fixing-all-the-bugs-wont-solve-all-the-problems-demings-path-of-frustration/
https://spinnaker.io/
https://sre.google/sre-book/managing-incidents/
https://stayrelevant.globant.com/en/technology/agile-delivery/active-knowledge-in-software-development/
https://surfingcomplexity.blog/2025/02/01/youre-missing-your-near-misses/
https://twitter.com/allspaw/status/1177204840432361472
https://us.macmillan.com/books/9781250249869/subtract
https://wheeldecide.com
https://www.adaptivecapacitylabs.com/blog/2018/03/23/moving-past-shallow-incident-data/
https://www.amazon.com/Design-Implementation-FreeBSD-Operating-System/dp/0201702452
https://www.amazon.com/Driving-Technical-Change-Terrence-Ryan/dp/1934356603
https://www.blackhillsinfosec.com/projects/backdoorsandbreaches/
https://www.brendangregg.com/systems-performance-2nd-edition-book.html
https://www.capterra.com/glossary/hippo-highest-paid-persons-opinion-highest-paid-person-in-the-office/
https://www.getguesstimate.com
https://www.happy-or-not.com/en/
https://www.keyvalues.com
https://www.learningfromincidents.io
https://www.levels.fyi
https://www.penguinrandomhouse.com/books/303275/the-idea-factory-by-jon-gertner/
https://www.penguinrandomhouse.com/books/557044/palaces-for-the-people-by-eric-klinenberg/
https://www.sciencedirect.com/science/article/abs/pii/B9780444818621500923
https://www.simonandschuster.com/books/Lying-for-Money/Dan-Davies/9781982114947
https://www.vox.com/videos/23989817/madagascar-village-crater
https://www.youtube.com/watch?v=CbSiKAtO7Fk&list=PLQmwzq_GIU-idCnJNR4t_aKb0HDCOXfZ1&index=12
https://www.youtube.com/watch?v=cKurUbYvWLA
https://www.youtube.com/watch?v=CMR9z9Xr8GM
https://www.youtube.com/watch?v=gfINfi2K1lE
https://www.youtube.com/watch?v=GXxHiZvxRSE&t=205s
https://www.youtube.com/watch?v=LrK_1ePmz54
https://www.youtube.com/watch?v=rgV4HLSd1dk
https://www.youtube.com/watch?v=Zj48LExaY00
https://xkcd.com/2347/
Humans are eventually consistent
Humble Inquiry - Edgar Shein
Identify Early warning signs of burnout
Identify the main drivers of cost and attribute them, then support them with improving the driver
If your response rate is high, aim for the fences
Incident Commanders should kick the managers out if needed, but can fall back to leaning on the Leadership, deflect
Incident log: structured/unstructured text
Incident Story Time
Incident story time - tell a story, less of an analysis, more of a narrative
Incident story time as a relief valve for larger audiences
Include senior leadership earlier in the process rather than being surprised when they jump in last minute
Influencing without authority (contractors/employee mix)
Internal Product Market Fit
Interview to onsite stage at least once a year
Introducing blamelessness is an important step
Introducing complex topics to groups is hard, sometimes they get simplified and perpetuated
Introducing error budget can help a lot when introducing continuous delivery
Intros: Name, Location, Occupation, Ideas
Invest in All Hazard all Risk, and the return will be bigger
involvement in an incident review can be for learning
Ironies of Automation - Bainbridge
Is that method (multi-channel async communication) written up somewhere? We’ve sometimes spun up dedicated incident slack channels, but that has it’s own challenges
Is the world a more resilient place than it was a year ago? How have we improved our resilience over the last year?
It has to live somewhere - you cannot wish it away
Ivan's talk on Learning Products for different levels
Jez Humble Lean Enterprise - Cost of Delay/Duration
Job leads
Job search
Joint Cognitive Systems
jointly craft a story - "sense-making"
Just In Time delivery/Supply Chain and buffers for succession planning are related
Kelly Shortridge, Security Chaos Engineer : Resilience is the same thing you do when you make your code refactorable, understanding what you want to accomplish is important https://www.youtube.com/watch?v=AxqX9ovGViw
labelling things as an experiment can make starting easier but can make followthrough harder
lack of troubleshooting skills as industry technology shifts
Law of Stretched Systems: Every system is stretched to operate at its capacity. https://github.com/lorin/resilience-engineering/blob/master/laws.md#law-of-stretched-systems All systems are redlining.
Leadership is important - make sure people understand the broader context
Leadership Pipeline
Leading Geeks, Paul Glenn Why can't we do good estimates?
Learned resilience rather than designed resilience
Learning from Failures
Learning from Incidents conference
Learning incidents -> like learning a new language
Learning to use the tools as they are new
Let's remove staging environment! Does it add value?
LFI as a beginning of a movement
LFI debrief
Life insurance/LT Disability/etc.
Linguistics
Literate Programming
Ludic fallacy
Make all your production lines able to do slightly larger and slightly smaller vehicles, to account for cascading failures
Make it okay to make mistakes and learn
Make the work visible to make delegation worth celebrating
Making exceptions clear to others is hard, making them known to self is also hard
Making it okay to learn things
Managerial accounting separate from compliance
Many systems don't have a "100%"
Margin (vis a vis buffer) vs lean/ efficiency
Matt Davis talks on practicing incidents: https://www.sounding.com/2021/12/20/practice-of-practice-gamelan/ and https://emamo.com/event/developerweek-2022/r/speaker/matt-davis-2
Meaningful Metrics
Measuring Engineer hours spent on an incident
Mechanism of action of Conway's law: why is it that it manifests?
Meeting culture: lots of attendance, low participation/attention -> Ask what people are getting out of it, "no one can blame me for not going"
Metrics around TTR -> team cares around definition of done
Minor demo! https://github.com/JensRantil/conc
Minutes are more intuitive than 9s
Mission and Burnout can be in conflict
monitoring/logging
Month's end can be busy, but then things go to quarterly, but then the cash flow is further from accrual accounting
Moving teams from Robustness to Resilience
multi-party dilemma talk from Febrary LFI Conference (incidents with vendors) [2](https://www.youtube.com/watch?v=CbSiKAtO7Fk&list=PLQmwzq_GIU-idCnJNR4t_aKb0HDCOXfZ1&index=12 "‌")
Multi-tenant teams make attribution harder, technically, but then culturally and governance
Mushroom theory of management - Tracy Kidder's Soul of a new machine
Muting alarms in healthcare, e.g. Pulse Oximeter
Narrowing scope from all or nothing thinking to identify particular triggers for a particular situation
NASA LLIS
Nested transactions are hard
Net90 is nice for cash but can make accrual difficult
No context -> turn it off?
No more accidents needs to be partnered with safe disclosure
NoEstimates talk
Non Vacation - Vacation (Work Vacation)
Non-Infrastructure Resilience for Product
Nora Jones about her experience with Chaos tools e.g. chapter 9 in the Chaos Eng books
Normalization of Deviance
Not a SPOF but Correlated failures
notes
Nucor steel - No more accidents - Clayton Christensen
Nvidia https://arstechnica.com/information-technology/2022/03/cybercriminals-who-breached-nvidia-issue-one-of-the-most-unusual-demands-ever/
one more chapter: https://sre.google/workbook/incident-response/
One potential case for runbooks - multi-party incidents/vendors: contact/escalation/mitigation
One way to quantize a qualitative measure of resilience: https://erikhollnagel.com/onewebmedia/RAG_introduction.pdf
Only 4 adaptive load strategies
OODA loops
OpenCollective
Ops can get noisy
or _If You Can't Measure It, Maybe You Shouldn't_ https://www.amazon.com/You-Cant-Measure-Maybe-Shouldnt/dp/8269037729
Organizational change and adaptive capacity. Regulating pace of change in light of adaptive capacity
Organizational changes for resilience
Organizations with physical presence take on operational excellence more
Other ways to address not waking people up: alternate scheduling approaches: more people 8 hour shifts, follow the sun, etc.
Ownership of choosing and interfacing with dependencies
Packet Collision avoidance algorithms primes
Paradox of Tolerance extended to Blamelessness
Partnering with AI for its strengths while adapting around its weaknesses
Paying for unhappy path is hard when it is unlikely
PDF of Software Complexity presentation in #resilience-coffee https://rands-leadership.slack.com/archives/C02J0KV3B55/p1657911889283029
Peak Preparedness
People are used to "flagpole" escalation, even when it slows things down
People need to agree to be measured/accountable, or they won't be accountable
People reach for easy solutions at retros, but are uncomfortable with discussing larger problems
Perceived difficulty is more important than actual difficulty
Perception of Speed
Perception of stability
performative attendance is a problem, especially when execs are present.
Phoenix Project, Brent
Pickup game- dynamic teaming https://www.youtube.com/watch?v=3boKz0Exros
Pitfalls of utilization metrics:
Plan for degradable experiences
Plan for failures
Playing to the audience rather than the instigator
Playing to the strengths of a contractors: short term deprecations/migrations
Position consulting as coaching rather than doing, have a primary contact, have a retainer rather than hourly, expectations around comms
Post-Incident Interviews gain lots of insights, but people were less interested in the results
Post-incident survey
Post-Product Market Fit Reliability matters a lot more
Predefined Grafana dashboard(s): https://grafana.com/grafana/dashboards/10849-cassandra-dashboard/
Probability near zero is hard to reason
Process management rather than building tools
Process Tracing, Fault Trees, Failure Mode Error Analysis
Product may see reliability as good enough, so if engineers are waking up, you may not be able turn off alarms
Product should own reliability
Product should shadow oncall as user feedback
Production is unique, software development is about reuse, the two don't align
Productivity as a value driver
Project into future, with discounted cash flows / NPV
Projects involving multiple silos, instead of improving communication, create temporary cross-functional teams
Promos are easier when not a cost center
Proposals/Backlog: Analysis for Resilience over time
psema.org
pushing responsibility down into the org
Put a cost on the incident review document, both the immediate cost, and the opportunity cost of a fix
QA as coach rather than gate
Qualitative Metrics can be helpful
Quality expectations can increase coupling
Quality is Free
r9y.dev - definition of terms
Railroad cross safety video
Rands Thread on burnout https://rands-leadership.slack.com/archives/CUAAP1A3G/p1693487876705699
rate of growth may be greater than user growth
Recovery time is important, especially after incidents, need the space to learn and grow
reduce risk enables learning
Reliability as Revenue driver in other industries, e.g. Insurance
Reliability is easier to Frame as a functional requirement: I want to use it when I want to use it
Reporting out metrics broadly about progress to encourage discussion
Requisite Variety http://pespmc1.vub.ac.be/REQVAR.html
Resilience is a verb not a noun https://www.researchgate.net/publication/329035477_Resilience_is_a_Verb
Resilience planning for transitioning in and out of jobs? (Aka how to smoothly transition healthcare/ emergency funds/ etc)
Resilience Systems before you need them
Resilience Theater
Resilience was a proxy for engineering retention, so that's less of a concern now
Resilient but unreliable distributed systems vs Less resilient reliable centralized systems
Resilient energy storage
Retrospective pitfall: Ski patrol cutting down trees that skiers crash into
Reward the right things
Rewarding Subject Matter Experts siloing knowledge is an antipattern
Right message/context to the right people at the right time <- hard
Risk management approaches
Risks have to go somewhere
Role of rules in incidents
Role of Software in Spacecraft accidents
Role of Software in Spacecraft Accidents
Ron Burt: Weak Ties/Structural Holes
Rotating team facilitation
Rules in Incidents
Ruling the Waves - History book on cycles from Piracy, Telegraph, Radio, TV, Internet
Runbook/Manual steps in prose as literate programming toward automation
Runbooks aren't usually used because of ambiguity
Runbooks should have an expiration date, how are you going to avoid needing it in the future?
Runbooks vs
Sabine Hossenfelder: Backreaction: Why does science news suck so much? (http://backreaction.blogspot.com/2022/06/why-does-science-news-suck-so-much.html?m=1)
Sacrifice decisions as a method to shed load
Sacrifice decisions/judgements
Safety II professionals: How resilience engineering can transform safety practice by Provan, Woods, Dekker, Rae comes to mind. Thai wrote a bit about it from a software perspective here: https://resilienceroundup.com/issues/safety-ii-professionals-how-resilience-engineering-can-transform-safety-practice/
SAVINGS
scaling up
SEBoK
Second Loop Learning - Argyris
Security industry has done a good balance of sharing findings with privacy
Self awareness of why you do the things that you do
sensemaking is easier for generalists, specialists assume the sensemaking
Sensemaking: preparing for tail events https://ferd.ca/negotiable-abstractions.html
Separate the commander and the review facilitator: https://howie-guide.pagerduty.com/
Setting things up so that they benefit you either way events unfold
Shallow data (summary stats) really fail in complex scenarios
Share the theory for the questions to get other people to ask the questions too
Shared goals allow easier coordination.
Sharp edge of system gets harder
Sharp end teams focus more on sharp end things, blunt end people need retros, too. Reinforcing vs redesigning structure. when they do get involved.
Shielding vs delegating, how to empower people; People should move on from roles in ~3 years, how to do that
Shifting from doing to applying the theory
Shouldn't be...
Situational leadership
Slack allows for change
SLOs may be able to relieve that pressure
SLOs measure things from the customer perspective, S stands for service in Customer Service, not "software listening over the network" service.
Software Development as Shared Vision
Some documentation for newcomers
Some people get annoyed by the slack messages as well, has anyone experienced that? This is challenging… Some companies can't decide Slack vs Zoom. Moved toward Slack because of availability of access, persistence of information, didn't need rebriefing
Someone directly involved in the response and mitigation of the incident
Speak in reverse seniority order
Specialist vs SPOF distinction
Specialists and Generalists, T-shaped people, diverse teams to respond to unexpected changes
SRE as Product
SRE Book: Chubby Lock Service introduced downtime
Staff Engineer Book Club (and developing Staff Engs in general)
Staff Engineer's Path
Startup experiences
Steve Yegge
Stop Blaming, Retributive justice: people stop reporting things
Stop Work Authority Card from modern agile.org
Story + Data paired is key
Street Lights and Shadows topic of decision aids. book
Succession planning/pair programming/documentation
Sufficient capacity to handle the thing you're dealing with
Super connectors
Supporting collective memory of incidents
Surveys: comfortable with oncall
Swarming to support the team collectively
T-shaped engineer
Tangential: There is this number I've heard a long time ago around how "X% of Google's source code is being modified every month". IIRC, it was fairly high. I remember thinking "ah, that means that they have active knowledge about most of their systems, which is good!".
Tanya Reilly Talk: Acknowledge problems but don't fix them
Team Skill Maps
Tech people aren't Vulcans but unexamined emotional blobs
TED Talk: how to turn a group of strangers into a team “Pickup game and dynamic teaming” youtube
TEK: Traditional Ecological Knowledge
Temporary teams -> short lived solution, has its challenges
Terms like "Disaster Recovery/Business Continuity" may be an easier sell than Resilience Engineering– Taking the other persons’ perspective & using their terms to describe/“market” the issue — Disaster Recovery (external term) instead of Resilience Engineering (internal term)
Terraform to create the hosts, Ansible to configure them, Kubernetes is running on them, Helm is configuring K8s, KubeVela is generating Helm?
The ? of Tennis - nothing wrong with observing the process to identify improvment, what's wrong is feeling bad about the current state
The "incident responder team" rather than the "day to day team" - if you were involved, you were on the team
The Failure of Risk Management
The Four Agreements
The map is not the territory
The questions change in incident reviews as people navigate the blamelessness maturity model
The right coverage won't be found, so accept adapting after incidents
The right tools matter
The skill is in the tradeoffs
The system has told us what it needs but we are justifying things with best practices
The testing pyramid, including post-release/deploy: https://rands-leadership.slack.com/archives/C02J0KV3B55/p1657911889283029 (page 30)
The Three Boxes of Life
The variety of responses are a feature
There are actions we can take to reduce couplings
There needs to be a way to break the glass
There's also a piece of sense making there too, the solution maybe have been beautiful and perfect for the initial situation, but didn't have the freedom (or understanding) to shift as the world changed.
Thermocline of Truth https://brucefwebster.com/2008/04/15/the-wetware-crisis-the-themocline-of-truth/
Thinking of incidents differently: Beyond MTTR
This issue would need to be fixed from higher up as a culture? Not likely can be fully fixed in the usual companies IMO.
This topic came out of https://stayrelevant.globant.com/en/technology/agile-organizations/active-knowledge-in-software-development/.
Threshold for what is a big deal matters for managing level of involvement in analysis
Throwing something at the wall to see if it will stick is an incident
Tight coupling between programmers and the code and services they write.
Tight feedback loops -> resolved problems; cross-team boundaries -> slower feedback, interfaces around those feedback mechanisms
Time to convert shadows into primary rotation proportional to cognitive load
Time to figure out the plan > time to implementing the plan
Time zones are hard
Timeboxing as a way to solve (part of) burnout
Times of stability -> specialists can yield higher results, times of change -> generalists can be more flexible. Understanding the macro environment is critical as it can be
To know what 100% is, you need the customer perspective
Too clever for their own good
Too much automation can lead to alert fatigue
Too much context can also hurt
Top down vs bottom up metrics
Treating Resilience as a Product
Tricks of the Trade
Tricks of the Trade - how to effectively ask sociological research questions - avoid Why questions
Trust and Delegation, EAs trading responsibilities as signal for adaptive extensibility in leadership/polycentric governance
Trust and multi-party incidents
Trust but verify
Two elements of profit: revenue - cost, so when reliability around increasing revenue or reducing costs
Understand the internals of contract vendors
Understand the power dynamics of your position, educate those who are violating that
Unemployment
Untangling types of incident reviews
Valuing Resilience is separate from being Resilient
Visiualizations for concepts are important for visual learners
Waffle House index for hurricanes
We
we don't know what the problems are "unknown unknowns" so an "optimal" technician is something we can't know.
We don't see what the burntout people are balancing
What are some ways to measure resiliency
What are the things that can be dropped under duress?
What can we take away? (Rather than add)
What happens to Organizational Resilience when selecting for Leet Code?
What happens when one team is overtasked and others are undertasked?
What happens when we go beyond MTTR
What is the goal in setting goals?
What is the market for resilience?
What is your relationship with failure?
What promises have we made to customers that make us the most money?
What signal is indicating instability?
What’s the typical set up for people in configuring Kubernetes?
When complexity is abstracted away even for operators, then things are complicated
When human intervention is needed to respond
When is it going to fail? Hard to know
When is reliability NOT profitable? lots of examples
When is reliability profitable?
When the team doesn't prioritize the permanent fix, is that OK?
when unanticipated things happen
When we can't say names in an attempt to be blamelessness, it's a sign of blameful culture, e.g. "the CTO is a jerk"
When we jump in and fix things we take away Product's ownership
When/why is an incident done?
Where are all the _simple_ tools for deployment strategies? Blue-green, canary etc.
Where do incident review invites go?
Who did you design the system to be maintained by?
Who has ultimate responsibility for reliability? Some organizations haven't defined whether it is engineering or product
Who is hiring?
Who is responsible for connecting the company? Hiring/Development/Execs?
Who is the customer of resilience?
Why can force rationalization, but how questions defuses defensiveness
Work after incidents can disincentivize incidents
Working backward from listening to the system
Working cross-functionally
Writing a runbook can be a stepping stone toward automated remediation
Writing docs for self can help reduce coupling
Writing things down helps structure our thinking
Written vs unwritten rules, Schein's culture model
YAGNI: You ain't gonna need it
You don't have to account for all feedback, get buy in from your manager on that