software fault injection

Lukas Pirl, Daniel Richter, Arne Boockmeyer and Andreas Polze
Seminar on Embedded Operating Systems WiSe20
Operating Systems & Middleware Group
Hasso Plattner Institute at the University of Potsdam, Germany

fault-tolerant systems do fail

2.5h Facebook outage 2010
- “friendly” DDoS due to wrong configuration value
8h Azure outage 2012
- leap day bug in SSL certificate generation
4.5h Amazon S3 outage 2017
- typo in manual command took “too many” servers down

threats

fault (Fehlerursache)
- adjudged or hypothesized error cause
  - in software: bugs/defects
- might activate an error
error (Fehlerzustand)
- incorrect system state
- might propagate to a failure
failure (Ausfall)
- deviation from specification
- might appear as fault to related systems

threats

single component view

$// we use a directed graph for being able to influence the layout better digraph G { charset = "utf-8" rankdir = LR forcelabels = true compound = true node [ shape = box label = "" margin = 0 ] outer_fault_node [ label="fault" ] outer_error_node [ label="error" ] outer_failure_node [ label="failure" ] outer_fault_node -> outer_error_node [ label = "activates" ltail = outer_fault_node ] outer_error_node -> outer_failure_node [ label = "propagates to" ] }$
systems of systems view

$// we use a directed graph for being able to influence the layout better digraph G { charset = "utf-8" rankdir = LR forcelabels = true compound = true node [ shape = box label = "" margin = 0 ] subgraph cluster_outer_fault { label = "fault" edge [ fontsize = 8 color = "#757575" fontcolor = "#757575" ] node [ height = .3 fontcolor = "#757575" shape = none ] subgraph cluster_inner_fault { label = "fault" fontsize = 8 color = "#757575" fontcolor = "#757575" inner_fault_node [ shape = box color = "#757575" fontcolor = "#757575" label = "..." ] } subgraph cluster_inner_error { label = "error" fontsize = 8 color = "#757575" fontcolor = "#757575" inner_error_node } subgraph cluster_inner_failure { label = "failure" fontsize = 8 color = "#757575" fontcolor = "#757575" inner_failure_node } inner_fault_node -> inner_error_node [ label = "activates" ltail = cluster_inner_fault lhead = cluster_inner_error ] inner_error_node -> inner_failure_node [ label = "propagates to" ltail = cluster_inner_error lhead = cluster_inner_failure ] } outer_error_node [ label="error" ] outer_failure_node [ label="failure" ] inner_failure_node -> outer_error_node [ label = "activates" ltail = cluster_outer_fault ] outer_error_node -> outer_failure_node [ label = "propagates to" ] }$

fault activation

fault activation of software highly dependent on environment
- hardware dependability
- feature interaction
- third party components
  - e.g., libraries
- related services
  - e.g., remote APIs
- user interaction
  - e.g., data input
- …

dependability evaluation

two classes of approaches
- formal verification
  - prove software correct
  - requires formal specification
    - for all inputs (incl. environment states) → combinatorial explosion
- testing
  - prove software wrong
  - discover bugs during runtime
  - requires a fault model

formal verification

increasingly hard
- increasing complexity
  - higher technology stacks, tool chain (e.g., compilers), composition, …
- resource constraints
  - requirements change due to agile development, time-to-market pressure, …
attractive for model-driven development
- i.e., model is specification, transformation is formally verified
usually makes strong assumptions
- e.g., assume correct hardware for verification of seL4 microkernel
formally verified code might still not meet intentions
- e.g., 802.11i/WPA2 vulnerabilities despite (partial) formal verification

testing

widely adopted
best practice
extensive
- unit testing, integration testing, regression testing, …
but: developers/testers might be biased
- tests are expected to succeed
  - code is crafted to to satisfy tests (TDD)
  - xor
  - tests are crafted to test code (non-TDD)
→ usually “testing in success space”

fault model

set of faults assumed to occur
- hardware faults
  - relatively established fault model
    - bit flips: single xor multi
    - stuck-at faults: a bit permanently set to 1 (stuck-at-1) xor 0 (stuck-at-0)
    - bridging faults: two signals are connected although they shouldn’t be
    - delay faults: delay of a path exceeds clock frequency
- software faults
  - no commonly established fault model
    - timing / omission
    - computing
    - crash

fault injection

fault injection ⊂ testing ¹

experimental dependability assessment
- idea: lower complexity
  - compared, e.g., to formal verification
concept
1. forcefully activate (i.e., “inject”) faults
  - or, forcefully introduce errors
2. assess delivered quality of service

¹ no widely-accepted definition to differentiate between the two

history

not definitive, but to give an idea:
- ~1969 hardware fault injection at IBM
  - simulated to evaluate integrity of logic units during design
    - faults: stuck transistors, open/shorted diodes
- 1970+ A. Avižienis: early theory on faults
  - coined “fault tolerance”, classification, modeling, …
    - wanted operating system support for fault-tolerant hardware

M. Ball and F. Hardie, “Effects and detection of intermittent failures in digital systems,” in Proceedings of the November 18-20, 1969, fall joint computer conference , 1969, pp. 329–335.
A. Avizienis and D. A. Rennels, “Fault-tolerance experiments with the JPL STAR computer.,” 1972.

software fault injection

implemented in software and targeting software ¹
- != hardware-implemented fault injection (HWIFI)
  - targeting hardware, e.g., exposition to increased radiation
- != software-implemented fault injection (SWIFI)
  - targeting hardware, e.g., flipping of bits in memory

requires
- faultload
  - which faults (from fault model) to inject when and where (depends on operational profile)
- workload
  - for realistic fault activation and error propagation

¹ no widely-accepted definition here; this is what I think makes sense; feel free to question and have your own view

typical objectives

find “dependability bottlenecks” / single points of failure
assess quality of service in presence of faults
- performance degradation
  - e.g., bandwidth, latency
- dependability attributes
  - availability, reliability, safety, security, integrity, maintainability
assess specific fault tolerance mechanisms
- e.g., efficiency, effectiveness
determine coverage of error detection and recovery

typical “meta objectives”

experiences & confidence regarding dependability
- e.g., developers, testers, operators, architects, best-practices, documentation
bug fixes for fault tolerance mechanisms
well-tested and -understood fault tolerance mechanisms
measurements
- only objective measures allow comparisons between different systems
  - thus, allow to judge improvement/worsening between different versions

implementation

depends heavily on system under consideration

aspects to consider

injection trigger
execution state during injection
target artifact

injection trigger

time-based
- absolute xor relative
  - e.g., absolute time of day, relative to run time
- one-time vs. periodic vs. sporadic
  - e.g., fixed rate, between a minimum and a maximum rate
location-based
- depends on system under consideration and level of abstraction
  - e.g., on access of specific memory areas, specific nodes
execution-driven
- based on control flow during runtime

execution state during injection

prior to execution
- e.g., code mutation, environment state, infrastructure
during runtime
- at library load time
- software traps
- hardware traps

target artifact

source code
- e.g., change control flow, add sleeps
intermediate code representation
- e.g., change operators or constants in bytecode
binary representation
- e.g., bit flips
state
- e.g., memory/storage modifications, edge-case states of environment
environmental behavior
- e.g., clock drift, node crashes, misbehaving hardware, related APIs’ behavior

characteristics of different methods

different approaches have different advantages and disadvantages, e.g.:

	Hardware		Software
	with contact	without contact	with contact	without contact
cost	high	high	low	low
perturbation	none	none	low	high
risk of damage	high	low	none	none
time resolution	high	high	high	low
injection points	chip pin	chip internal	memory	memory
			software	IO controller
controllability	high	low	high	high
trigger	yes	no	yes	yes
repeatability	high	low	high	high

M.-C. Hsueh, T. K. Tsai, and R. K. Iyer, “Fault injection techniques and tools,” vol. 30, no. 4, pp. 75–82, Apr. 1997.

example injection targets for applications

black-box

$digraph G { charset = "utf-8" rankdir = LR forcelabels = true splines = line node [shape=point width=0 label=""] comm1 comm2 node [shape=box width=""] sut [label="application under consideration" ordering=out] node [peripheries=2] hw [label="hardware"] sw1 [label="directly interacting\nsoftware"] sw2 [label="indirectly interacting\nsoftware" color=grey60 fontcolor=grey60] node [shape=ellipse style=dashed constraint=false peripheries=1] sfi [label="software fault injection"] hw -> comm1 [dir=back] comm1-> sut sut -> comm2 [dir=back] comm2 -> sw1 sw1 -> sw2 [dir=both color=grey60 fontcolor=grey60] // dummy for placement hw -> sfi [style=invis weight=0] edge [style=dashed constraint=false arrowhead=open minlen=2] sfi -> hw sfi -> comm1 sfi -> sut sfi -> comm2 sfi -> sw1 }$
- less intrusiveness, less interference with result, less coupling, …
white-box

$digraph G { charset = "utf-8" rankdir = LR forcelabels = true compound = true splines = line ext [label="interacting\nsoftware" shape=box color=grey60 fontcolor=grey60] subgraph cluster_outer { label = "application under consideration" src [label="source\ncode" shape=note] subgraph cluster_inner { label = "in execution" node [shape=box] bin [label="binary code"] data [label="data"] bin -> data [dir=both] } node [shape=ellipse] comp [label="compiling,\nlinking, ..."] comm [label="system calls,\nIPC, RPC, ..."] } src -> comp -> bin data -> comm [label="" ltail=cluster_inner dir=both] comm -> ext [color=grey60 dir=both] node [style=dashed constraint=false] sfi [label="software fault injection"] // dummy for placement comp -> sfi [style=invis weight=0] edge [style=dashed constraint=false arrowhead=open minlen=2] sfi -> src [headport=se] sfi -> comp sfi -> bin sfi -> data sfi -> comm }$
- possibly more insights, higher performance, easier to debug, …

example injection targets for operating systems

example injection targets for operating systems

_images/os-technology-stack-fi-between-layers.svg

example injection targets for operating systems

adoption

long-established for hardware testing
partly adopted for software testing
- missing accessibility?
  - e.g., tools not public, no documentation
- tools too specialized?
  - e.g., on certain programming languages or APIs
- available information too scattered/heterogeneous?
  - e.g., research prototypes, products, open-source projects
- available information too heterogeneous?
  - e.g., heterogeneous wording makes it hard to find things
- missing automation?
  - e.g., in comparison unit testing

FIDD: fault-injection-driven development

incorporate software fault injection in development practices
- in analogy and in addition to test-driven development

case study on OpenStack (IaaS framework)

Lena Feinbube

success stories

Linux kernel
- e.g., through syscall fuzzing
ISO 26262 (Road vehicles – Functional safety) recommends fault injection
software fault injection in production
- Etsy (e-commerce)
- Netflix
  - Chaos Monkey
    - terminates AWS EC2 instances
      - in AWS Auto Scaling Groups
    - during business hours only
      - staff is watching and can react quickly
chaos engineering offered as a service by major Cloud providers

software fault injection in production

pro
- staging environments inherently different from the production environment
  - likely to have an influence on results
- less uncertainty
  - since no difference between staging and production environments
- failures happen when staff is prepared
- proven concept
  - e.g., in fire departments
- awareness / critical analysis of own production system

software fault injection in production

con
- risk
  - loosing data
  - frustrated customers
  - reputation
  - economic damage
- testing is in place anyway
- missing awareness?
- lack of expertise?
- unpredictable legacy systems?
- …

conclusion

specifically for the system under consideration:
- What to inject? → fault model
  - bug trackers, vulnerability databases and failure reports can give inspiration
- When to inject? → trigger
  - likely to be chosen according to workload, when injecting during runtime
- Where to inject? → dependability model
  - know which faults should be tolerated, since there is usually not much gain from injecting non-tolerated faults
have a clear scope
- considering all faults in all locations at all times in all layers of the technology stack is unrealistic