software fault injection

  • Lukas Pirl, Daniel Richter, Arne Boockmeyer and Andreas Polze
  • Seminar on Embedded Operating Systems WiSe20
  • Operating Systems & Middleware Group
  • Hasso Plattner Institute at the University of Potsdam, Germany
1

fault-tolerant systems do fail

  • 2.5h Facebook outage 2010

    • “friendly” DDoS due to wrong configuration value
  • 8h Azure outage 2012

    • leap day bug in SSL certificate generation
  • 4.5h Amazon S3 outage 2017

    • typo in manual command took “too many” servers down
2

threats

  • fault (Fehlerursache)
    • adjudged or hypothesized error cause
      • in software: bugs/defects
    • might activate an error
  • error (Fehlerzustand)
    • incorrect system state
    • might propagate to a failure
  • failure (Ausfall)
    • deviation from specification
    • might appear as fault to related systems
3

threats

  • single component view

    // we use a directed graph for being able to influence the layout better
digraph G {
  charset = "utf-8"
  rankdir = LR
  forcelabels = true
  compound = true

  node [
    shape = box
    label = ""
    margin = 0
  ]

  outer_fault_node [
    label="fault"
  ]

  outer_error_node [
    label="error"
  ]

  outer_failure_node [
    label="failure"
  ]

  outer_fault_node -> outer_error_node [
    label = "activates"
    ltail = outer_fault_node
  ]

  outer_error_node -> outer_failure_node [
    label = "propagates to"
  ]

}
  • systems of systems view

    // we use a directed graph for being able to influence the layout better
digraph G {
  charset = "utf-8"
  rankdir = LR
  forcelabels = true
  compound = true

  node [
    shape = box
    label = ""
    margin = 0
  ]

  subgraph cluster_outer_fault {
    label = "fault"

    edge [
      fontsize = 8
      color = "#757575"
      fontcolor = "#757575"
    ]

    node [
      height = .3
      fontcolor = "#757575"
      shape = none
    ]

    subgraph cluster_inner_fault {
      label = "fault"
      fontsize = 8
      color = "#757575"
      fontcolor = "#757575"
      inner_fault_node [
        shape = box
        color = "#757575"
        fontcolor = "#757575"
        label = "..."
      ]

    }

    subgraph cluster_inner_error {
      label = "error"
      fontsize = 8
      color = "#757575"
      fontcolor = "#757575"
      inner_error_node
    }

    subgraph cluster_inner_failure {
      label = "failure"
      fontsize = 8
      color = "#757575"
      fontcolor = "#757575"
      inner_failure_node
    }


    inner_fault_node -> inner_error_node [
      label = "activates"
      ltail = cluster_inner_fault
      lhead = cluster_inner_error
    ]

    inner_error_node -> inner_failure_node [
      label = "propagates to"
      ltail = cluster_inner_error
      lhead = cluster_inner_failure
    ]

  }

  outer_error_node [
    label="error"
  ]

  outer_failure_node [
    label="failure"
  ]

  inner_failure_node -> outer_error_node [
    label = "activates"
    ltail = cluster_outer_fault
  ]

  outer_error_node -> outer_failure_node [
    label = "propagates to"
  ]

}
4

fault activation

  • fault activation of software highly dependent on environment
    • hardware dependability
    • feature interaction
    • third party components
      • e.g., libraries
    • related services
      • e.g., remote APIs
    • user interaction
      • e.g., data input
5

dependability evaluation

  • two classes of approaches
    • formal verification
      • prove software correct
      • requires formal specification
        • for all inputs (incl. environment states) → combinatorial explosion
    • testing
      • prove software wrong
      • discover bugs during runtime
      • requires a fault model
6

formal verification

  • increasingly hard
    • increasing complexity
      • higher technology stacks, tool chain (e.g., compilers), composition, …
    • resource constraints
      • requirements change due to agile development, time-to-market pressure, …
  • attractive for model-driven development
    • i.e., model is specification, transformation is formally verified
  • usually makes strong assumptions
    • e.g., assume correct hardware for verification of seL4 microkernel
  • formally verified code might still not meet intentions
    • e.g., 802.11i/WPA2 vulnerabilities despite (partial) formal verification
7

testing

  • widely adopted
  • best practice
  • extensive
    • unit testing, integration testing, regression testing, …
  • but: developers/testers might be biased
    • tests are expected to succeed
      • code is crafted to to satisfy tests (TDD)
      • xor
      • tests are crafted to test code (non-TDD)
  • → usually “testing in success space”
8

fault model

  • set of faults assumed to occur
    • hardware faults
      • relatively established fault model
        • bit flips: single xor multi
        • stuck-at faults: a bit permanently set to 1 (stuck-at-1) xor 0 (stuck-at-0)
        • bridging faults: two signals are connected although they shouldn’t be
        • delay faults: delay of a path exceeds clock frequency
    • software faults
      • no commonly established fault model
        • timing / omission
        • computing
        • crash
9

fault injection

  • fault injection ⊂ testing ¹
  • experimental dependability assessment
    • idea: lower complexity
      • compared, e.g., to formal verification
  • concept
    1. forcefully activate (i.e., “inject”) faults
      • or, forcefully introduce errors
    2. assess delivered quality of service

¹ no widely-accepted definition to differentiate between the two

10

history

  • not definitive, but to give an idea:
    • ~1969 hardware fault injection at IBM
      • simulated to evaluate integrity of logic units during design
        • faults: stuck transistors, open/shorted diodes
    • 1970+ A. Avižienis: early theory on faults
      • coined “fault tolerance”, classification, modeling, …
        • wanted operating system support for fault-tolerant hardware
  • M. Ball and F. Hardie, “Effects and detection of intermittent failures in digital systems,” in Proceedings of the November 18-20, 1969, fall joint computer conference , 1969, pp. 329–335.
  • A. Avizienis and D. A. Rennels, “Fault-tolerance experiments with the JPL STAR computer.,” 1972.
11

software fault injection

  • implemented in software and targeting software ¹
    • != hardware-implemented fault injection (HWIFI)
      • targeting hardware, e.g., exposition to increased radiation
    • != software-implemented fault injection (SWIFI)
      • targeting hardware, e.g., flipping of bits in memory
  • requires
    • faultload
      • which faults (from fault model) to inject when and where (depends on operational profile)
    • workload
      • for realistic fault activation and error propagation

¹ no widely-accepted definition here; this is what I think makes sense; feel free to question and have your own view

12

typical objectives

  • find “dependability bottlenecks” / single points of failure
  • assess quality of service in presence of faults
    • performance degradation
      • e.g., bandwidth, latency
    • dependability attributes
      • availability, reliability, safety, security, integrity, maintainability
  • assess specific fault tolerance mechanisms
    • e.g., efficiency, effectiveness
  • determine coverage of error detection and recovery
13

typical “meta objectives”

  • experiences & confidence regarding dependability
    • e.g., developers, testers, operators, architects, best-practices, documentation
  • bug fixes for fault tolerance mechanisms
  • well-tested and -understood fault tolerance mechanisms
  • measurements
    • only objective measures allow comparisons between different systems
      • thus, allow to judge improvement/worsening between different versions
14

implementation

injection trigger

  • time-based
    • absolute xor relative
      • e.g., absolute time of day, relative to run time
    • one-time vs. periodic vs. sporadic
      • e.g., fixed rate, between a minimum and a maximum rate
  • location-based
    • depends on system under consideration and level of abstraction
      • e.g., on access of specific memory areas, specific nodes
  • execution-driven
    • based on control flow during runtime
16

execution state during injection

  • prior to execution
    • e.g., code mutation, environment state, infrastructure
  • during runtime
    • at library load time
    • software traps
    • hardware traps
17

target artifact

  • source code
    • e.g., change control flow, add sleeps
  • intermediate code representation
    • e.g., change operators or constants in bytecode
  • binary representation
    • e.g., bit flips
  • state
    • e.g., memory/storage modifications, edge-case states of environment
  • environmental behavior
    • e.g., clock drift, node crashes, misbehaving hardware, related APIs’ behavior
18

characteristics of different methods

different approaches have different advantages and disadvantages, e.g.:

Hardware Software
with contact without contact with contact without contact
cost high high low low
perturbation none none low high
risk of damage high low none none
time resolution high high high low
injection points chip pin chip internal memory memory
    software IO controller
controllability high low high high
trigger yes no yes yes
repeatability high low high high

M.-C. Hsueh, T. K. Tsai, and R. K. Iyer, “Fault injection techniques and tools,” vol. 30, no. 4, pp. 75–82, Apr. 1997.

19

example injection targets for applications

  • black-box

    digraph G {
  charset = "utf-8"
  rankdir = LR
  forcelabels = true
  splines = line

  node [shape=point width=0 label=""]
  comm1
  comm2

  node [shape=box width=""]
  sut [label="application under consideration" ordering=out]

  node [peripheries=2]
  hw [label="hardware"]
  sw1 [label="directly interacting\nsoftware"]
  sw2 [label="indirectly interacting\nsoftware"
       color=grey60
       fontcolor=grey60]

  node [shape=ellipse style=dashed constraint=false peripheries=1]
  sfi [label="software fault injection"]

  hw -> comm1 [dir=back]
  comm1-> sut
  sut -> comm2 [dir=back]
  comm2 -> sw1
  sw1 -> sw2 [dir=both color=grey60 fontcolor=grey60]

  // dummy for placement
  hw -> sfi [style=invis weight=0]

  edge [style=dashed constraint=false arrowhead=open minlen=2]
  sfi -> hw
  sfi -> comm1
  sfi -> sut
  sfi -> comm2
  sfi -> sw1
}
    • less intrusiveness, less interference with result, less coupling, …
  • white-box

    digraph G {
  charset = "utf-8"
  rankdir = LR
  forcelabels = true
  compound = true
  splines = line

  ext [label="interacting\nsoftware" shape=box color=grey60 fontcolor=grey60]

  subgraph cluster_outer {

    label = "application under consideration"

    src [label="source\ncode" shape=note]

    subgraph cluster_inner {
      label = "in execution"
      node [shape=box]
      bin [label="binary code"]
      data [label="data"]
      bin -> data [dir=both]
    }

    node [shape=ellipse]
    comp [label="compiling,\nlinking, ..."]
    comm [label="system calls,\nIPC, RPC, ..."]

  }

  src -> comp -> bin
  data -> comm [label="" ltail=cluster_inner dir=both]
  comm -> ext [color=grey60 dir=both]

  node [style=dashed constraint=false]
  sfi [label="software fault injection"]

  // dummy for placement
  comp -> sfi [style=invis weight=0]

  edge [style=dashed constraint=false arrowhead=open minlen=2]
  sfi -> src [headport=se]
  sfi -> comp
  sfi -> bin
  sfi -> data
  sfi -> comm

}
    • possibly more insights, higher performance, easier to debug, …
20

example injection targets for operating systems

_images/os-technology-stack.svg
21

example injection targets for operating systems

_images/os-technology-stack-fi-between-layers.svg
22

example injection targets for operating systems

_images/os-technology-stack-fi-all.svg
23

adoption

  • long-established for hardware testing
  • partly adopted for software testing
    • missing accessibility?
      • e.g., tools not public, no documentation
    • tools too specialized?
      • e.g., on certain programming languages or APIs
    • available information too scattered/heterogeneous?
      • e.g., research prototypes, products, open-source projects
    • available information too heterogeneous?
      • e.g., heterogeneous wording makes it hard to find things
    • missing automation?
      • e.g., in comparison unit testing
24

FIDD: fault-injection-driven development

  • incorporate software fault injection in development practices
    • in analogy and in addition to test-driven development
_images/fidd.svg
  • case study on OpenStack (IaaS framework)

Lena Feinbube

25

success stories

  • Linux kernel
    • e.g., through syscall fuzzing
  • ISO 26262 (Road vehicles – Functional safety) recommends fault injection
  • software fault injection in production
    • Etsy (e-commerce)
    • Netflix
      • Chaos Monkey
        • terminates AWS EC2 instances
          • in AWS Auto Scaling Groups
        • during business hours only
          • staff is watching and can react quickly
  • chaos engineering offered as a service by major Cloud providers
26

software fault injection in production

  • pro
    • staging environments inherently different from the production environment
      • likely to have an influence on results
    • less uncertainty
      • since no difference between staging and production environments
    • failures happen when staff is prepared
    • proven concept
      • e.g., in fire departments
    • awareness / critical analysis of own production system
27

software fault injection in production

  • con
    • risk
      • loosing data
      • frustrated customers
      • reputation
      • economic damage
    • testing is in place anyway
    • missing awareness?
    • lack of expertise?
    • unpredictable legacy systems?
28

conclusion

  • specifically for the system under consideration:
    • What to inject? → fault model
      • bug trackers, vulnerability databases and failure reports can give inspiration
    • When to inject? → trigger
      • likely to be chosen according to workload, when injecting during runtime
    • Where to inject? → dependability model
      • know which faults should be tolerated, since there is usually not much gain from injecting non-tolerated faults
  • have a clear scope
    • considering all faults in all locations at all times in all layers of the technology stack is unrealistic
29