Todo
Appearance
Raw notes to move elsewhere:
- Adaptive capacity
- Blamelessness
- Sense-making
- Service Level Objectives
- Service Level Indicators
- Service Level Agreements
- T-shaped skills
- Generalists
- Specialists
- MTTx
- MTBF
- Psychological Safety
- Learning Organizations
- Incident Command System
- Alert Fatigue
- Observability
- Redundancy
- OODA loop
- Second-loop learning
- Chaos engineering
- Game days
- Tabletops
- Runbooks
- Incident Severity Classification
- Root Cause Analysis
- Human Factors Engineering
- Cognitive Biases in Operations
- Recovery-Oriented Computing
- Microservices
- Distributed systems
- Degraded state
- Technical debt
- Site Reliability Engineering
- DevOps
- Change management
- Error Budgets
- Burnout
- Slack
- Failure Modes and Effects Analysis
- Incident Archaeology
- Disaster Recovery Planning
- Five Whys
- Complex systems
- How to Measure Anything
- Practice is critical
- Blamelessness
- "Everyone's busy", but few on the priority projects
- Generalists Specialists
- Goals
- Knowledge in the head and in the world
- Maintenance
- "Start where you stand" Emergency Response Triage training video
- Stay in your lane
- "the thing that drove circuit switching was not a technical requirement, the technical requirement followed from the business requirement"
- "This might be a stupid question, but I might be missing some context here..." disarms people, puts them in the position of expert, less of a challenge and more of a catch me up thing
- Retrospectives
- Compliance
- 7 Habits of Highly Effective People
- AI
- Incident
- Allspaw
- Argyris
- Second Loop Learning
- Cascading Surprises, Accidentally Load Bearing
- Chris Argyris
- Continuous Deployment! How to make a conservative org. take the leap.
- Conversational Capacity
- Data Resilience/Reliability/Governance
- David Woods Columbia profitability example, so what does the organization look like?
- DDoS mitigations
- Sidney Dekker quote on when to intervene => should most tech folks view incidents as high urgency?
- Disaster Response
- Documentation
- Donald Schön
- Error budgets - Charity Majors Honeycomb
- Eventual consistency
- Expertise
- First 90 Days
- Getting Things Done
- Goal setting, personal and work
- Graceful Extensibility
- How to Measure Anything
- How to run emergency response for a temporary city
- Howie guide
- Human error rate for changes
- Humble Inquiry
- John Seddon
- Kill IT with Fire
- Leading Change from the Middle
- LFI talks, e.g. Multi-party incidents
- Monoliths
- NUUMI plant
- Nvidia https://arstechnica.com/information-technology/2022/03/cybercriminals-who-breached-nvidia-issue-one-of-the-most-unusual-demands-ever/
- Oncall structure
- Organizational Learning
- Paradigm Shift: Circuits to Packets
- Risk management approaches
- Scaling
- Situational leadership
- Socratic Method as a way to encourage people to do things, but they need to take action themselves
- Specialist titles vs Generalist titles
- Speech Chain
- SPOF
- Streetlights and Shadows - guidance on when to use decision aids vs human deciders
- The Coaching Habit
- Thermocline of Truth
- Thinking in Systems book
- Timeoff as a resilience strategy
- To Teach
- Twitter 🤦♂️
- Ukraine
- W. Edwards Deming
- Watermelon projects
- Westrum model
- What Works for Women at Work
- Will Larson
- 1. rate of change matters, it could be a slow burn vs a sudden spike
- 2. underutilization is also a problem, if you are too careful of over utilization you will over pay for unused capacity
- 3 types of post-incident reviews: analysis, affected stakeholder reporting horizontally, stakeholder reporting vertically
- A balancing triangle: quantify impact - mitigate - understand what happened
- A consultancy pattern: "Heroic" programmer generalists -> intentionally a small team of effective generalists
- A developers expertise about their own code has a short shelf-life.
- A huge Cassandra cluster incident within Fintech
- A review facilitator
- A team of generalists with diverse specialties is resilient
- Aaron Halfaker Immune Response research
- Abstracting away complexity makes some things easier/faster
- Accountability to do better in the future can help with burnout, even if there's a short term pain
- Accounting for costs: avoid over inflating costs
- Adaptive Capacity is less of a fuel and more about stance
- Adaptive Capacity relationship with addition/removal of people is non-linear
- Agile Definition of but applied to Incidents
- Agree upon commonalities so people can act independently
- AI for creation -> humans for curation
- AI is an accelerant, but may not be an improvement
- Alternatives to TTR
- And (Inverse)
- Annualized costs can be tricky
- Another caution: humans have a non-linear response around risk. Making things safer can lead to more risky behavior. Perceived safety vs realized safety. “Dutch helmets” a pattern of increasing the risk to motivate more skill. Instead of encouraging helmets, encourage more safe bicycling skills.
- Anti fragility vs stability vs fragility - Tlaeb (spelling?)
- Anti-fragility can strengthen the system upon restoration
- API as communications channels, pros and cons
- Are the CTOs reading the same stuff?
- Are there areas of the system you are worried about?
- Ask the seniors what the worst could be
- Asking descriptive questions, how is better than why
- Asking product for 9s doesn't work - SLOs??? Cost?
- Asserting I'm adding a hypothesis to the list is different than narrowing the model
- Atoms and Bits: copying is hard vs easy; implementing new is easy vs hard
- Attributability of investment is importment
- Avoid engineer's distraction vs informing why
- Avoid focus on implementing specific patterns when there are more burning issues
- Avoid too much process in incident review, leave flexibility for different types of insights for near misses
- AWS architecture patterns
- B = MAP Behavior = Motivation * Ability * Prompt (BJ Fogg - Tiny Habits)
- Backlog: "Is it okay to run the diswasher when it's only a quarter full?"
- Backlog: Cross-training skills within a team
- Backlog: Observability and environment isolation
- Backlog: Universal Design, Accessibility
- Backlog: Automated alert creation
- Backlog: Avoiding incidents by maximizing active knowledge It's common that incidents happen when old untouched systems are being modified. Can we
- Backlog: Ethics of balancing resolution with space for learning
- Backlog: How do you know if your team is doing well AND productive?
- Backlog: How does everyone do Security in your platform?
- Backlog: How should people ask thought provoking questions which help driving the conversation while also avoid annoying people?
- Backlog: How to prevent failures of omission/ ensure you're taking sufficient risk?
- Backlog: Observability and Data
- Backlog: Optimum investment, mapping resilience to business value, perhaps with SLOs
- Backlog: Putting lots of ambitious goals on list, discovering you've moved from laminar to turbulent flow
- Backlog: recovery: how to make sure you're not wasting "stimulus" by not building on the lessons it's teaching?
- Backlog: Resilience against underresourcing.
- Backlog: Resilient skillsets moving into the GPT-4 era?
- Backlog: Simplicity: Subtraction
- Backlog: Structured Delegation and Accountability as a way to establish organizational resilience
- Backlog: Succession planning, avoiding human SPOFs
- Backlog: What to do when everything is broken
- Backseat Incident Commanders
- Bader-Meinhoff phenomenon
- balancing dev / infra headcount
- Balancing dimensions of resilience: technical/non-technical
- bar raiser analysts
- Behind Human Error
- Being unaware of power dynamic can lead to the same words being taken differently
- Benefits of Monorepos
- Benefits of standards vs risk of monoculture
- Better Standard Libraries, Better Languages
- Beyond Culture
- bias, decision aids, availability bias, streetlights and shadows
- Blameful retros become an incident
- Block of time for learning
- Book recommendations from Resilience Coffee
- Bounded contexts in domain driven design
- Breakout sessions for different parts of a larger incident
- Brené Brown - Expectations
- Bug injection as a way to gain context on unfamiliar codebases
- Build vs Buy
- Building consensus for prioritizing paying down Tech debt
- building resilient (virtual) communities: immune response, succession planning
- Burnout frustrations
- Burnt out people can get into a bunker mentality, do what they are told but don't drive forward
- Can do tabletops
- Can you achieve alignment while maintaining self direction?
- Canaries
- Carving up a hard problem can make things "bigger" but can also make things more brittle
- Cassandra analytics
- Celebrate the failures to encourage blamelessness
- Centralized direction setting for developing decentralized resilience capabilities
- Ceremonies, socio-technical systems, talking to people rather than bugfixes
- Ch. 6 Seddon: Beyond Command and Control
- Chasing down threads in an incident -> better observability
- ChatGPT
- Checklists, compliance, reporting
- Chesterton's Fence before adding another layer of abstraction?
- Clarify accountability
- Clients paying for 2 9s, getting used to 5 9s, when it goes to 3 9s, they get upset
- Code freeze after incident across ~6 weeks or so
- Code freezes: what works and doesn't
- Combining technologies for attribution is tricky
- Comfort with idle
- Communities of Practice
- Communities that accept help are more resilient
- Companies aren't learning from each other and making the same mistakes
- Companies don't know what they know: unknown knowns
- Compensating measures are important
- Consensus on a small set of KPIs/SLOs can help focus a company
- consequences when engineering makes mistakes for customers
- Context matters, more context given to the responder helps
- Control Points = Ability to handle Perturbation
- Conversational Capacity
- Conway said the opposite: you make decisions around who will do what, e.g. interpret and action policy. Domains of responsibility closes off pathways to other decisions that might have been better. Consequence -> people will build within their loops
- Core Ethical question for incident commanders
- cost of 5 minute oncall
- Costs of centralization
- Creating an "Incident Vibe"
- Credit the AI when you use it
- Cross cutting concerns are difficult without clearly defined interfaces for both code and people
- Cultural work is hard
- Culture Map (mixed reviews)
- Curiosity > frustration
- Customers get used to your reliability even if it isn't what you state
- cycle of expanding interventions, automation, not having all hazardinterventions
- Dan Davies’s LYING FOR MONEY
- Dashboard values are shorthand for a story -> what story do we share with management about what's important to us and our business, how do we represent that in the dashboards, numbers/status is important, but what matters is the story
- Dashboards: Key Hole Property
- Data chain of custody/provenance matters for cleanup
- Data platform can be a wonderful place to surface the conversation
- Data quality for making decisions has a lower standard
- David Woods, Joint Cognitive Systems page 87, "Alarms and Directed Attention" Alarm is an agent trying to redirect my attention, how good is it at doing that? Medicine is an example
- Dealing with Silos
- dedication analysts
- defensiveness can get people to shutdown and avoid new information
- Designing Resilience in from the beginning
- DevOps Tel Aviv video
- Did these analogies work in the original domains?
- Difficult to split line items
- Digital Marketers do a great job about spend and return across the different ad markets, last click attribution vs combining attribution channels
- Diminishing returns, significance of marketing, ideal customer base, they lose traceability in organic conversions, affiliates
- DiRT test with multiple teams for an upcoming yearly throughput peak
- Disaster Response
- Discoverability : Making complexity shallow
- Discussing: (NEW) Cassandra, NewSQL
- Discussing: (NEW) Cloud Migrations
- Discussing: Financial resilience
- Discussing: Nepal
- Discussing: SawStop
- Discussing: Automated alert management (as a resource for preventing Burnout/alert fatigue/line between good alerts and noise)
- Discussing: When is resilience profitable?
- distance is inverse similarity; is risk inverse adaptive capacity?
- Divergence between commitments and expectations
- Do you separate people who fix from those who create problems?
- documentation
- Does QA stop at release these days? Quality engineering?
- Doing that ^ ahead of time helps
- Don't (always) apply Manufacturing analogies to Knowledge Work
- Doneness is org specific
- Downtime budget
- Dynamic teaming
- Dynamically stable, but perceived as unstable
- Ecological Interfaces
- Ed Lau
- Effort to move is a factor too
- eight r’s at amazon. restart, remove,…
- Eisenhower Matrix
- Embedded design principle: be able to recover sensemaking from the design
- embracing inconsistency
- Employed partners in discorrelated industry
- Empowering engineers is critical
- Empowering the person who is paged to fix the source of the page rather than just mitigating it.
- Encourage all factors/all hazard/all risk mitigations
- Encouraging leadership from the team
- Encouraging positive change > discouraging negative changes
- Entire leadership chain is important, need to get buy in from middle management
- Erika Rowland :headphones: 7 minutes ago We need an inverse round-up, a resilience engineering round-up of things not from resilience engineering proper.
- Error budget
- error budgets with SLOs as alternative to MTTR - define what you're measuring, from a product perspective
- Escalation metrics, dividing things into critical/non-critical/informational
- Escalation Path to Org Chart
- Etsy three armed t-shirt for biggest oops https://www.ecommercebytes.com/2021/03/29/etsy-gives-award-to-coder-who-crashed-the-site-last-year/
- Evaluators of burntout people are also burntout
- Every company has a policy, many are just undocumented
- “every solution contains the seeds of its undoing”
- Everyone has a model, hypothesis generation, then aligning those models, separate from hypothesis testing
- Examples of usage of RAG: Water https://onlinelibrary.wiley.com/doi/full/10.1111/wej.12539 Healthcare https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9635744/ A more general analysis of it: https://link.springer.com/chapter/10.1007/978-3-031-12547-8_4
- Executive understanding of design patterns may be more impactful than PE understand of design patterns
- expectation of coverage seeking
- External human factors are important
- Extracting patterns and standardizing
- Failure Mode Effects analysis
- Failure mode effects sessions
- Failure of Risk Management NASA Ski Hill Quote
- Fall risk, Required RCA/C?A, perf, encoding of the ideas into the policy
- Fault trees
- Feature flags solve the problem of "gradual rollouts" - sometimes "hash(userId) % 100 < 10" is enough
- FEMA has “All Hazard. All Risk” mitigations. In software, those can become automated later. These mitigations make failure cheap for the business. That can change the risk-reward trade-off which enables learning. The next stage of growth comes as that automation breaks and creates different complexity for the incidents.
- ferd.ca quote - "It's all going to hell anyway, we can just control how fast it goes"
- Finance tends to look at changes, increases, rather than ongoing costs
- Finding a new job while burntout is hard, while an existing job is burning you out
- Finding balance of product and engineering
- FireHydrant
- First 90 days
- First rule of incidents: Stay calm
- Followup - how to evaluate IC value to teams/process
- Font incident
- Fragmented information, no one is trying to gather overall picture
- Framing of questions is important in analysis
- Free tier as a way to simplify charge backs
- from Aikido for Incidents
- From Blameless to Sanctionless
- From Waterfall to Lean/Continuous accounting
- From whom and to whom matters
- Game day/chaos testing can also train people
- Game days while seniors are on vacation
- Gender matters about "stupid question" framing, there's probably another way to frame it so that you own your expertise but drive to understand the context of the other
- General Meeting Guide for Mixed Seniority meetings
- Get agreement on what problem to solve
- Get people talking to each other, the tools will follow
- Getting buy in from above on tech debt
- Getting Buy in from middle management, communicating insights
- Glue work always helps the team
- Goals give direction
- Goals: Is it meaningful to set Resilience Goals?
- Going back and fixing action items is hard to prioritize
- Good Runbooks lead to stasis where people are comfortable with minimal pain
- Good Strategy/Bad Strategy - Obstacles to overcome, What we're not going to do
- Google SRE abandoned Phone bridges early on -> using a collaborative document works well once you've had practice. Phone bridges are single threaded, stepping on each other's words, any other method besides phone bridge works wonders, but it takes practice
- Graceful extensibility - managing adaptive capacity https://www.researchgate.net/publication/327427067_The_Theory_of_Graceful_Extensibility_Basic_rules_that_govern_adaptive_systems
- Growth Engineering as Reliability Engineering
- Guesstimate for estimating incident duration
- Hardware approaches don't apply to software
- Have a side gig/consulting company
- Have a third party facilitate the incident analysis
- Have comms drafted for common types of incidents
- Have load that can be shed
- Health vs. debugging metrics
- Healthcare: COBRA, or purchase it for sidegig company, Healthcare.gov, state hc marketplaces, partner's employer coverage
- Hedonistic tendency
- Helping analysts to engineer
- Helping engineers to understand the business
- Helping teams get out of Alert fatigue
- Here’s one way I’ve gotten us to talk about the organizers’ skills and interests — with a matrix https://docs.google.com/spreadsheets/d/17coR7yIoqVXFvA-Q0oiP7BU_14CiKuWWnxjPKd3j_sY/edit#gid=1902703774
- Hidden Figures 10x engineer
- Highly specialized workplaces can make it hard to find the right people, even if the symptoms are solvable
- Hijacked retros : Someone shows up to a blameless retro with blame, power dynamics/punishment
- Hindsight bias: potentially more toxic than blame
- Hiring
- Hollnagel's Resilience Analysis Grid [1](https://erikhollnagel.com/ideas/resilience%20assessment%20grid.html "smartCard-inline")
- Hot potato the alert to other people to get context
- How do I lower the risk of change -> make more smaller changes
- How do organizations change toward DevOps?
- How do we tell the story/visualize the incident?
- How do you allocate on-call duties amongst teams/individuals?
- How do you build buffer/ slack into your day? Do you just keep yourself/ your team fully subscribed and drop things if you need to reprioritize?
- How do you know "who's the expert" in the org?
- How do you know if an organization is resilient?
- How do you know your services aren't overbuilt?
- How do you measure productivity in Incident Management teams
- How do you mid-day mental GC to prevent OOM'ing too early?
- How does an organization learn from mistakes?
- How much top down buyin for RE is there? Is it a factor when designing organizations?
- How reliable do you want to spend?
- How Slack uses Slack for Incidents
- How Tenuous are things?
- How the questions are asked matters
- How to balance high bandwidth communication with above... use both!
- How to build resilience in before there's too much demand
- How to empower people who don't see the power they have?
- How to Measure Anything - https://www.howtomeasureanything.com
- How to measure Learning?
- How to navigate too many displays, visual momentum. analysis of paper: https://resilienceroundup.com/issues/how-not-to-have-to-navigate-through-too-many-displays/
- How to prepare for unanticipated external threats in a generic way
- How to reduce noise by reviewing the alerts periodically
- How well does our understanding match reality?
- http://sixpack.seatgeek.com/
- http://sunnyday.mit.edu/accidents/jsr-final.pdf
- http://www.melconway.com/Home/Committees_Paper.html
- https://about.sourcegraph.com/batch-changes
- https://cate.blog/2021/11/29/5-signs-its-time-to-quit-your-job/
- https://codeball.ai
- https://comic.browserling.com/97
- https://cse.umn.edu/umsec/events/code-freeze-2023-tech-resilience
- https://danlebrero.com/2021/06/30/cto-dairy-lucky-lotto-chaos-engineering-for-teams/
- https://dashbit.co/blog/kubernetes-and-the-erlang-vm-orchestration-on-the-large-and-the-small
- https://docs.google.com/spreadsheets/d/1-5EGtpt6ZBE19ktle4lc577QnRCCRSXWiivojYoS4xA/edit?usp=sharing
- https://dora.dev/devops-capabilities/cultural/generative-organizational-culture/
- https://erikhollnagel.com/onewebmedia/RAG%20Outline%20V2.pdf
- https://essenceofsoftware.com/
- https://ferd.ca/complexity-has-to-live-somewhere.html
- https://ferd.ca/notes/ ferd.caferd.ca My notes and other stuff Fred Hebert's notes about various things
- https://ferd.ca/notes/paper-ecological-interfaces-a-technological-imperative-in-high-tech-systems.html
- https://fourweekmba.com/scheins-model-of-organizational/
- https://frameshiftconsulting.com/ally-skills-workshop/
- https://gigamonkeys.com/flowers/
- https://github.com/JensRantil/java-canary-tools
- https://github.com/lorin/resilience-engineering
- https://github.com/randsleadershipslack/employer-test
- https://grenfellenquirer.blog/catastrophe-systemic-change-the-book/
- https://journals.lww.com/transplantjournal/Fulltext/2007/12270/Probabilistic_Risk_Assessment_of_Accidental.12.aspx
- https://kubevela.io
- https://lethain.com/forty-year-career/
- https://martinfowler.com/articles/measuring-developer-productivity-humans.html
- https://medium.com/10x-curiosity/boundaries-of-failure-rasmussens-model-of-how-accidents-happen-58dc61eb1cf
- https://mitpress.mit.edu/books/digital-apollo
- https://ncase.me/polygons/
- https://ndmc.pyd.org/
- https://news.ycombinator.com/item?id=32196345
- https://news.ycombinator.com/item?id=32319147
- https://oam.dev/
- https://onlinelibrary.wiley.com/doi/full/10.1111/psj.12212
- https://pragprog.com/titles/atcrime/your-code-as-a-crime-scene/
- https://pragprog.com/titles/ehxta/explore-it/
- https://prometheus.io/docs/concepts/metric_types/#summary
- https://queue.acm.org/detail.cfm?id=3096459
- https://rands-leadership.slack.com/archives/C02J0KV3B55/p1657911889283029
- https://rands-leadership.slack.com/archives/CCAMQC0H1/p1666456526936469
- https://resilienceroundup.com
- https://resilienceroundup.com/issues/four-concepts-for-resilience-and-the-implications-for-the-future-of-resilience-engineering/ Resilience RoundupResilience Roundup Four concepts for resilience and the implications for the future of resilience engineering This week we have a paper by David Woods who is a principal at Adaptive Capacity Labs, a sponsor. Sponsorship or relation to a sponsor does not influence how I analyze papers and have featured Woods' papers previously. I’ve talked to a lot of readers who have told me Written by Thai Wood Filed under Issues Apr 3rd, 2021 Erika Rowland :headphones: 2 minutes ago https://erikarow.land/notes/paper-four-concepts-resilience
- https://resilienceroundup.com/issues/measuring-system-resilience-with-the-resilience-analysis-grid/
- https://resilienceroundup.com/issues/the-role-of-software-in-spacecraft-accidents/
- https://rls.social/@alper/109806169938390380
- https://shermanonsoftware.com/2024/04/08/fixing-all-the-bugs-wont-solve-all-the-problems-demings-path-of-frustration/
- https://spinnaker.io/
- https://sre.google/sre-book/managing-incidents/
- https://stayrelevant.globant.com/en/technology/agile-delivery/active-knowledge-in-software-development/
- https://surfingcomplexity.blog/2025/02/01/youre-missing-your-near-misses/
- https://twitter.com/allspaw/status/1177204840432361472
- https://us.macmillan.com/books/9781250249869/subtract
- https://wheeldecide.com
- https://www.adaptivecapacitylabs.com/blog/2018/03/23/moving-past-shallow-incident-data/
- https://www.amazon.com/Design-Implementation-FreeBSD-Operating-System/dp/0201702452
- https://www.amazon.com/Driving-Technical-Change-Terrence-Ryan/dp/1934356603
- https://www.blackhillsinfosec.com/projects/backdoorsandbreaches/
- https://www.brendangregg.com/systems-performance-2nd-edition-book.html
- https://www.capterra.com/glossary/hippo-highest-paid-persons-opinion-highest-paid-person-in-the-office/
- https://www.getguesstimate.com
- https://www.happy-or-not.com/en/
- https://www.keyvalues.com
- https://www.learningfromincidents.io
- https://www.levels.fyi
- https://www.penguinrandomhouse.com/books/303275/the-idea-factory-by-jon-gertner/
- https://www.penguinrandomhouse.com/books/557044/palaces-for-the-people-by-eric-klinenberg/
- https://www.sciencedirect.com/science/article/abs/pii/B9780444818621500923
- https://www.simonandschuster.com/books/Lying-for-Money/Dan-Davies/9781982114947
- https://www.vox.com/videos/23989817/madagascar-village-crater
- https://www.youtube.com/watch?v=CbSiKAtO7Fk&list=PLQmwzq_GIU-idCnJNR4t_aKb0HDCOXfZ1&index=12
- https://www.youtube.com/watch?v=cKurUbYvWLA
- https://www.youtube.com/watch?v=CMR9z9Xr8GM
- https://www.youtube.com/watch?v=gfINfi2K1lE
- https://www.youtube.com/watch?v=GXxHiZvxRSE&t=205s
- https://www.youtube.com/watch?v=LrK_1ePmz54
- https://www.youtube.com/watch?v=rgV4HLSd1dk
- https://www.youtube.com/watch?v=Zj48LExaY00
- https://xkcd.com/2347/
- Humans are eventually consistent
- Humble Inquiry - Edgar Shein
- Identify Early warning signs of burnout
- Identify the main drivers of cost and attribute them, then support them with improving the driver
- If your response rate is high, aim for the fences
- Incident Commanders should kick the managers out if needed, but can fall back to leaning on the Leadership, deflect
- Incident log: structured/unstructured text
- Incident Story Time
- Incident story time - tell a story, less of an analysis, more of a narrative
- Incident story time as a relief valve for larger audiences
- Include senior leadership earlier in the process rather than being surprised when they jump in last minute
- Influencing without authority (contractors/employee mix)
- Internal Product Market Fit
- Interview to onsite stage at least once a year
- Introducing blamelessness is an important step
- Introducing complex topics to groups is hard, sometimes they get simplified and perpetuated
- Introducing error budget can help a lot when introducing continuous delivery
- Intros: Name, Location, Occupation, Ideas
- Invest in All Hazard all Risk, and the return will be bigger
- involvement in an incident review can be for learning
- Ironies of Automation - Bainbridge
- Is that method (multi-channel async communication) written up somewhere? We’ve sometimes spun up dedicated incident slack channels, but that has it’s own challenges
- Is the world a more resilient place than it was a year ago? How have we improved our resilience over the last year?
- It has to live somewhere - you cannot wish it away
- Ivan's talk on Learning Products for different levels
- Jez Humble Lean Enterprise - Cost of Delay/Duration
- Job leads
- Job search
- Joint Cognitive Systems
- jointly craft a story - "sense-making"
- Just In Time delivery/Supply Chain and buffers for succession planning are related
- Kelly Shortridge, Security Chaos Engineer : Resilience is the same thing you do when you make your code refactorable, understanding what you want to accomplish is important https://www.youtube.com/watch?v=AxqX9ovGViw
- labelling things as an experiment can make starting easier but can make followthrough harder
- lack of troubleshooting skills as industry technology shifts
- Law of Stretched Systems: Every system is stretched to operate at its capacity. https://github.com/lorin/resilience-engineering/blob/master/laws.md#law-of-stretched-systems All systems are redlining.
- Leadership is important - make sure people understand the broader context
- Leadership Pipeline
- Leading Geeks, Paul Glenn Why can't we do good estimates?
- Learned resilience rather than designed resilience
- Learning from Failures
- Learning from Incidents conference
- Learning incidents -> like learning a new language
- Learning to use the tools as they are new
- Let's remove staging environment! Does it add value?
- LFI as a beginning of a movement
- LFI debrief
- Life insurance/LT Disability/etc.
- Linguistics
- Literate Programming
- Ludic fallacy
- Make all your production lines able to do slightly larger and slightly smaller vehicles, to account for cascading failures
- Make it okay to make mistakes and learn
- Make the work visible to make delegation worth celebrating
- Making exceptions clear to others is hard, making them known to self is also hard
- Making it okay to learn things
- Managerial accounting separate from compliance
- Many systems don't have a "100%"
- Margin (vis a vis buffer) vs lean/ efficiency
- Matt Davis talks on practicing incidents: https://www.sounding.com/2021/12/20/practice-of-practice-gamelan/ and https://emamo.com/event/developerweek-2022/r/speaker/matt-davis-2
- Meaningful Metrics
- Measuring Engineer hours spent on an incident
- Mechanism of action of Conway's law: why is it that it manifests?
- Meeting culture: lots of attendance, low participation/attention -> Ask what people are getting out of it, "no one can blame me for not going"
- Metrics around TTR -> team cares around definition of done
- Minor demo! https://github.com/JensRantil/conc
- Minutes are more intuitive than 9s
- Mission and Burnout can be in conflict
- monitoring/logging
- Month's end can be busy, but then things go to quarterly, but then the cash flow is further from accrual accounting
- Moving teams from Robustness to Resilience
- multi-party dilemma talk from Febrary LFI Conference (incidents with vendors) [2](https://www.youtube.com/watch?v=CbSiKAtO7Fk&list=PLQmwzq_GIU-idCnJNR4t_aKb0HDCOXfZ1&index=12 "")
- Multi-tenant teams make attribution harder, technically, but then culturally and governance
- Mushroom theory of management - Tracy Kidder's Soul of a new machine
- Muting alarms in healthcare, e.g. Pulse Oximeter
- Narrowing scope from all or nothing thinking to identify particular triggers for a particular situation
- NASA LLIS
- Nested transactions are hard
- Net90 is nice for cash but can make accrual difficult
- No context -> turn it off?
- No more accidents needs to be partnered with safe disclosure
- NoEstimates talk
- Non Vacation - Vacation (Work Vacation)
- Non-Infrastructure Resilience for Product
- Nora Jones about her experience with Chaos tools e.g. chapter 9 in the Chaos Eng books
- Normalization of Deviance
- Not a SPOF but Correlated failures
- notes
- Nucor steel - No more accidents - Clayton Christensen
- Nvidia https://arstechnica.com/information-technology/2022/03/cybercriminals-who-breached-nvidia-issue-one-of-the-most-unusual-demands-ever/
- one more chapter: https://sre.google/workbook/incident-response/
- One potential case for runbooks - multi-party incidents/vendors: contact/escalation/mitigation
- One way to quantize a qualitative measure of resilience: https://erikhollnagel.com/onewebmedia/RAG_introduction.pdf
- Only 4 adaptive load strategies
- OODA loops
- OpenCollective
- Ops can get noisy
- or _If You Can't Measure It, Maybe You Shouldn't_ https://www.amazon.com/You-Cant-Measure-Maybe-Shouldnt/dp/8269037729
- Organizational change and adaptive capacity. Regulating pace of change in light of adaptive capacity
- Organizational changes for resilience
- Organizations with physical presence take on operational excellence more
- Other ways to address not waking people up: alternate scheduling approaches: more people 8 hour shifts, follow the sun, etc.
- Ownership of choosing and interfacing with dependencies
- Packet Collision avoidance algorithms primes
- Paradox of Tolerance extended to Blamelessness
- Partnering with AI for its strengths while adapting around its weaknesses
- Paying for unhappy path is hard when it is unlikely
- PDF of Software Complexity presentation in #resilience-coffee https://rands-leadership.slack.com/archives/C02J0KV3B55/p1657911889283029
- Peak Preparedness
- People are used to "flagpole" escalation, even when it slows things down
- People need to agree to be measured/accountable, or they won't be accountable
- People reach for easy solutions at retros, but are uncomfortable with discussing larger problems
- Perceived difficulty is more important than actual difficulty
- Perception of Speed
- Perception of stability
- performative attendance is a problem, especially when execs are present.
- Phoenix Project, Brent
- Pickup game- dynamic teaming https://www.youtube.com/watch?v=3boKz0Exros
- Pitfalls of utilization metrics:
- Plan for degradable experiences
- Plan for failures
- Playing to the audience rather than the instigator
- Playing to the strengths of a contractors: short term deprecations/migrations
- Position consulting as coaching rather than doing, have a primary contact, have a retainer rather than hourly, expectations around comms
- Post-Incident Interviews gain lots of insights, but people were less interested in the results
- Post-incident survey
- Post-Product Market Fit Reliability matters a lot more
- Predefined Grafana dashboard(s): https://grafana.com/grafana/dashboards/10849-cassandra-dashboard/
- Probability near zero is hard to reason
- Process management rather than building tools
- Process Tracing, Fault Trees, Failure Mode Error Analysis
- Product may see reliability as good enough, so if engineers are waking up, you may not be able turn off alarms
- Product should own reliability
- Product should shadow oncall as user feedback
- Production is unique, software development is about reuse, the two don't align
- Productivity as a value driver
- Project into future, with discounted cash flows / NPV
- Projects involving multiple silos, instead of improving communication, create temporary cross-functional teams
- Promos are easier when not a cost center
- Proposals/Backlog: Analysis for Resilience over time
- psema.org
- pushing responsibility down into the org
- Put a cost on the incident review document, both the immediate cost, and the opportunity cost of a fix
- QA as coach rather than gate
- Qualitative Metrics can be helpful
- Quality expectations can increase coupling
- Quality is Free
- r9y.dev - definition of terms
- Railroad cross safety video
- Rands Thread on burnout https://rands-leadership.slack.com/archives/CUAAP1A3G/p1693487876705699
- rate of growth may be greater than user growth
- Recovery time is important, especially after incidents, need the space to learn and grow
- reduce risk enables learning
- Reliability as Revenue driver in other industries, e.g. Insurance
- Reliability is easier to Frame as a functional requirement: I want to use it when I want to use it
- Reporting out metrics broadly about progress to encourage discussion
- Requisite Variety http://pespmc1.vub.ac.be/REQVAR.html
- Resilience is a verb not a noun https://www.researchgate.net/publication/329035477_Resilience_is_a_Verb
- Resilience planning for transitioning in and out of jobs? (Aka how to smoothly transition healthcare/ emergency funds/ etc)
- Resilience Systems before you need them
- Resilience Theater
- Resilience was a proxy for engineering retention, so that's less of a concern now
- Resilient but unreliable distributed systems vs Less resilient reliable centralized systems
- Resilient energy storage
- Retrospective pitfall: Ski patrol cutting down trees that skiers crash into
- Reward the right things
- Rewarding Subject Matter Experts siloing knowledge is an antipattern
- Right message/context to the right people at the right time <- hard
- Risk management approaches
- Risks have to go somewhere
- Role of rules in incidents
- Role of Software in Spacecraft accidents
- Role of Software in Spacecraft Accidents
- Ron Burt: Weak Ties/Structural Holes
- Rotating team facilitation
- Rules in Incidents
- Ruling the Waves - History book on cycles from Piracy, Telegraph, Radio, TV, Internet
- Runbook/Manual steps in prose as literate programming toward automation
- Runbooks aren't usually used because of ambiguity
- Runbooks should have an expiration date, how are you going to avoid needing it in the future?
- Runbooks vs
- Sabine Hossenfelder: Backreaction: Why does science news suck so much? (http://backreaction.blogspot.com/2022/06/why-does-science-news-suck-so-much.html?m=1)
- Sacrifice decisions as a method to shed load
- Sacrifice decisions/judgements
- Safety II professionals: How resilience engineering can transform safety practice by Provan, Woods, Dekker, Rae comes to mind. Thai wrote a bit about it from a software perspective here: https://resilienceroundup.com/issues/safety-ii-professionals-how-resilience-engineering-can-transform-safety-practice/
- SAVINGS
- scaling up
- SEBoK
- Second Loop Learning - Argyris
- Security industry has done a good balance of sharing findings with privacy
- Self awareness of why you do the things that you do
- sensemaking is easier for generalists, specialists assume the sensemaking
- Sensemaking: preparing for tail events https://ferd.ca/negotiable-abstractions.html
- Separate the commander and the review facilitator: https://howie-guide.pagerduty.com/
- Setting things up so that they benefit you either way events unfold
- Shallow data (summary stats) really fail in complex scenarios
- Share the theory for the questions to get other people to ask the questions too
- Shared goals allow easier coordination.
- Sharp edge of system gets harder
- Sharp end teams focus more on sharp end things, blunt end people need retros, too. Reinforcing vs redesigning structure. when they do get involved.
- Shielding vs delegating, how to empower people; People should move on from roles in ~3 years, how to do that
- Shifting from doing to applying the theory
- Shouldn't be...
- Situational leadership
- Slack allows for change
- SLOs may be able to relieve that pressure
- SLOs measure things from the customer perspective, S stands for service in Customer Service, not "software listening over the network" service.
- Software Development as Shared Vision
- Some documentation for newcomers
- Some people get annoyed by the slack messages as well, has anyone experienced that? This is challenging… Some companies can't decide Slack vs Zoom. Moved toward Slack because of availability of access, persistence of information, didn't need rebriefing
- Someone directly involved in the response and mitigation of the incident
- Speak in reverse seniority order
- Specialist vs SPOF distinction
- Specialists and Generalists, T-shaped people, diverse teams to respond to unexpected changes
- SRE as Product
- SRE Book: Chubby Lock Service introduced downtime
- Staff Engineer Book Club (and developing Staff Engs in general)
- Staff Engineer's Path
- Startup experiences
- Steve Yegge
- Stop Blaming, Retributive justice: people stop reporting things
- Stop Work Authority Card from modern agile.org
- Story + Data paired is key
- Street Lights and Shadows topic of decision aids. book
- Succession planning/pair programming/documentation
- Sufficient capacity to handle the thing you're dealing with
- Super connectors
- Supporting collective memory of incidents
- Surveys: comfortable with oncall
- Swarming to support the team collectively
- T-shaped engineer
- Tangential: There is this number I've heard a long time ago around how "X% of Google's source code is being modified every month". IIRC, it was fairly high. I remember thinking "ah, that means that they have active knowledge about most of their systems, which is good!".
- Tanya Reilly Talk: Acknowledge problems but don't fix them
- Team Skill Maps
- Tech people aren't Vulcans but unexamined emotional blobs
- TED Talk: how to turn a group of strangers into a team “Pickup game and dynamic teaming” youtube
- TEK: Traditional Ecological Knowledge
- Temporary teams -> short lived solution, has its challenges
- Terms like "Disaster Recovery/Business Continuity" may be an easier sell than Resilience Engineering– Taking the other persons’ perspective & using their terms to describe/“market” the issue — Disaster Recovery (external term) instead of Resilience Engineering (internal term)
- Terraform to create the hosts, Ansible to configure them, Kubernetes is running on them, Helm is configuring K8s, KubeVela is generating Helm?
- The ? of Tennis - nothing wrong with observing the process to identify improvment, what's wrong is feeling bad about the current state
- The "incident responder team" rather than the "day to day team" - if you were involved, you were on the team
- The Failure of Risk Management
- The Four Agreements
- The map is not the territory
- The questions change in incident reviews as people navigate the blamelessness maturity model
- The right coverage won't be found, so accept adapting after incidents
- The right tools matter
- The skill is in the tradeoffs
- The system has told us what it needs but we are justifying things with best practices
- The testing pyramid, including post-release/deploy: https://rands-leadership.slack.com/archives/C02J0KV3B55/p1657911889283029 (page 30)
- The Three Boxes of Life
- The variety of responses are a feature
- There are actions we can take to reduce couplings
- There needs to be a way to break the glass
- There's also a piece of sense making there too, the solution maybe have been beautiful and perfect for the initial situation, but didn't have the freedom (or understanding) to shift as the world changed.
- Thermocline of Truth https://brucefwebster.com/2008/04/15/the-wetware-crisis-the-themocline-of-truth/
- Thinking of incidents differently: Beyond MTTR
- This issue would need to be fixed from higher up as a culture? Not likely can be fully fixed in the usual companies IMO.
- This topic came out of https://stayrelevant.globant.com/en/technology/agile-organizations/active-knowledge-in-software-development/.
- Threshold for what is a big deal matters for managing level of involvement in analysis
- Throwing something at the wall to see if it will stick is an incident
- Tight coupling between programmers and the code and services they write.
- Tight feedback loops -> resolved problems; cross-team boundaries -> slower feedback, interfaces around those feedback mechanisms
- Time to convert shadows into primary rotation proportional to cognitive load
- Time to figure out the plan > time to implementing the plan
- Time zones are hard
- Timeboxing as a way to solve (part of) burnout
- Times of stability -> specialists can yield higher results, times of change -> generalists can be more flexible. Understanding the macro environment is critical as it can be
- To know what 100% is, you need the customer perspective
- Too clever for their own good
- Too much automation can lead to alert fatigue
- Too much context can also hurt
- Top down vs bottom up metrics
- Treating Resilience as a Product
- Tricks of the Trade
- Tricks of the Trade - how to effectively ask sociological research questions - avoid Why questions
- Trust and Delegation, EAs trading responsibilities as signal for adaptive extensibility in leadership/polycentric governance
- Trust and multi-party incidents
- Trust but verify
- Two elements of profit: revenue - cost, so when reliability around increasing revenue or reducing costs
- Understand the internals of contract vendors
- Understand the power dynamics of your position, educate those who are violating that
- Unemployment
- Untangling types of incident reviews
- Valuing Resilience is separate from being Resilient
- Visiualizations for concepts are important for visual learners
- Waffle House index for hurricanes
- We
- we don't know what the problems are "unknown unknowns" so an "optimal" technician is something we can't know.
- We don't see what the burntout people are balancing
- What are some ways to measure resiliency
- What are the things that can be dropped under duress?
- What can we take away? (Rather than add)
- What happens to Organizational Resilience when selecting for Leet Code?
- What happens when one team is overtasked and others are undertasked?
- What happens when we go beyond MTTR
- What is the goal in setting goals?
- What is the market for resilience?
- What is your relationship with failure?
- What promises have we made to customers that make us the most money?
- What signal is indicating instability?
- What’s the typical set up for people in configuring Kubernetes?
- When complexity is abstracted away even for operators, then things are complicated
- When human intervention is needed to respond
- When is it going to fail? Hard to know
- When is reliability NOT profitable? lots of examples
- When is reliability profitable?
- When the team doesn't prioritize the permanent fix, is that OK?
- when unanticipated things happen
- When we can't say names in an attempt to be blamelessness, it's a sign of blameful culture, e.g. "the CTO is a jerk"
- When we jump in and fix things we take away Product's ownership
- When/why is an incident done?
- Where are all the _simple_ tools for deployment strategies? Blue-green, canary etc.
- Where do incident review invites go?
- Who did you design the system to be maintained by?
- Who has ultimate responsibility for reliability? Some organizations haven't defined whether it is engineering or product
- Who is hiring?
- Who is responsible for connecting the company? Hiring/Development/Execs?
- Who is the customer of resilience?
- Why can force rationalization, but how questions defuses defensiveness
- Work after incidents can disincentivize incidents
- Working backward from listening to the system
- Working cross-functionally
- Writing a runbook can be a stepping stone toward automated remediation
- Writing docs for self can help reduce coupling
- Writing things down helps structure our thinking
- Written vs unwritten rules, Schein's culture model
- YAGNI: You ain't gonna need it
- You don't have to account for all feedback, get buy in from your manager on that