software fault injection tools

software fault injection in theory

↻ recap

  • fault injection ⊂ testing
  • implemented in software and targeting software
  • as dependability evaluation
    • lower complexity
      • compared, e.g., to formal methods
    • to overcome developers’ biases
      • e.g., prove wrong instead of correct
  • assessments in the presence of forcefully activated faults
    • e.g., quality of service, fault tolerance mechanisms
2

software fault injection in practice

↻ recap

  • specifically for the system under consideration:
    • What to inject? → fault model
      • bug trackers, vulnerability databases and failure reports can give inspiration
    • When to inject? → trigger
      • likely to be chosen according to workload, when injecting during runtime
    • Where to inject? → dependability model
      • know which faults should be tolerated, since there is usually not much gain from injecting non-tolerated faults
  • have a clear scope
    • considering all faults in all locations at all times in all layers of the technology stack is unrealistic
3

anecdote: how fuzzing was born

  • 1989: Prof. Barton Miller, University of Wisconsin
    • had dial-up connection to campus computer
    • thunderstorm caused noise in phone line
    • random characters crashed UNIX applications
    • → let students write a random character generator
      • test as many UNIX utilities for robustness as possible
    • tool called “fuzz
https://www.goodfreephotos.com/albums/weather/lightning-out-of-the-skies.jpg
4

anecdote: how fuzzing was born (2)

_images/Miller90Empirical-table2.png
5

fuzzing / fuzz testing

  • fault model: unexpected/erroneous/malicious input values
  • can be considered a form of software fault injection
    • pro: tries to prove software wrong & input values can cause errors (hence, faults)
    • con: similar to integration testing at interfaces & inputs are not considered faults
  • trade-offs
    • black vs. gray vs. white box
      • efficiency vs. generality
      • e.g., incorporate knowledge about internal sanity checks
    • coverage vs. required resources
      • e.g., required time to run experiments
      • Does specifying constraints on input values lower the coverage?
6

fuzzing / fuzz testing (2)

  • Where to get values from?

    • _images/dog-fuzzing.jpg
    • mutation-based

      • derive from given samples
    • generation-based

      • random
        • seeded random from higher reproducibility
      • based on model for input values
        • e.g., certain constraints
7

fault injection tools by category

digraph device_stack { layout=neato node [shape=record style=filled fontname="Ubuntu,sans" fontsize=40 fixedsize=true width=10 height=1 pin=true fillcolor=none] distributed [pos="0,5" label="distributed application"] application [pos="0,4" label="application"] code [pos="0,3" label="code"] os [pos="0,2" label="operating system"] hardware [pos="0,1" label="hardware"] node [shape=box style=invis height=0 width=1] edge [penwidth=10] arrow_top [pos="6,4.75"] arrow_bottom [pos="6,1"] arrow_bottom -> arrow_top }

note: this is our agenda, no definitive list

8

fault injection close to hardware

digraph device_stack { layout=neato node [shape=record style=filled fontname="Ubuntu,sans" fontsize=40 fixedsize=true width=10 height=1 pin=true fillcolor=none] distributed [pos="0,5" label="distributed application"] application [pos="0,4" label="application"] code [pos="0,3" label="code"] os [pos="0,2" label="operating system"] hardware [pos="0,1" label="hardware" penwidth=10 color="#077cc0"] }

9

FTAPE

  • Fault Tolerance and Performance Evaluator
  • software emulates hardware faults
    • e.g., flip a bit to emulate faults caused by radiation
    • i.e., software-implemented fault injection
_images/Tsai95FTAPE-figure1.png
  • workload generator
    • i.e., dummy CPU, memory, IO operations
  • platform-specific implementation
    • Tandem Integrity S2
_images/Tsai95FTAPE-figure2.png
  • evaluation based on performance degradation
    • time_{without faults} / time_{with faults} - 1

T. Tsai and R. Iyer, “FTAPE-A fault injection tool to measure fault tolerance,” in 10th Computing in Aerospace Conference , 1995, p. 1041.

10

FTAPE (2)

_images/Tsai95FTAPE-table1.png
  • composition: how workload is composed, time-wise
    • e.g., 2/1/1: 50% of time CPU stress, 25% of time memory stress, 25% of time IO stress
11

NFTAPE

  • problem: existing tools too specialized (e.g., on fault model, platform)
  • approach: common control mechanism for multiple …
    • … fault models (e.g., bit flips in registers and memory, communication, IO)
    • … fault triggers (e.g., path-based, time-based, event-based)
    • … fault targets (e.g., hardware communication interfaces, MPI applications)
    • … reporting methods (e.g., dump memory, but: detail vs. intrusiveness)
  • two case studies
    • injection in physical layer of Myrinet LAN
    • debugger-based injection in space imaging application

D. T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, and R. K. Iyer, “NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors,” in Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000 , 2000, pp. 91–100.

12

NFTAPE (2)

_images/Stott00NFTAPE-figure1.png
13

MEFISTO

  • fault injection in VHDL models

    • early in design process
    • in simulation of VHDL models
    • simulator supports special fault injection commands
      • requires no modification of VHDL models
  • error model very hardware-specific

    • e.g., error on port during execution of instruction
    • e.g., address bus error during fetch
    • e.g., data bus error during read or write
_images/Jenn95Fault-figure4.png

E. Jenn, J. Arlat, M. Rimen, J. Ohlsson, and J. Karlsson, “Fault injection into VHDL models: the MEFISTO tool,” in Predictably Dependable Computing Systems , Springer, 1995, pp. 329–346.

14

GPU-Qin

  • GPUs face increased dependability demands
    • esp. when used as general purpose accelerator
  • fault model: transient single-bit faults in functional units
    • i.e., arithmetic logic unit, load-store unit
  • inject into assembly code
    • in contrast to, e.g., in register transfer language or micro architecture simulation
  • counter state space explosion with grouping of similar threats

B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,” in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on , 2014, pp. 221–230

15

GPU-Qin (2)

_images/Fang14Gpu-figure4.png _images/Fang14Gpu-figure7.png
16

software fault injection assessing OS’s

digraph device_stack { layout=neato node [shape=record style=filled fontname="Ubuntu,sans" fontsize=40 fixedsize=true width=10 height=1 pin=true fillcolor=none] distributed [pos="0,5" label="distributed application"] application [pos="0,4" label="application"] code [pos="0,3" label="code"] hardware [pos="0,1" label="hardware"] os [pos="0,2" label="operating system" zindex=10 penwidth=10 color="#077cc0"] }

17

FINE

_images/Kao93FINE-figure3.png


  • Fault Injection and Monitoring Environment
  • study fault propagation in UNIX systems

    • hardware faults
      • e.g., memory corruption, wrong computation, wrong control flow
    • software faults
      • e.g., uninitialized variables, wrong assignments, wrong condition checks

W.-I. Kao, R. K. Iyer, and D. Tang, “FINE: A fault injection and monitoring environment for tracing the UNIX system behavior under faults,” vol. 19, no. 11, pp. 1105–1118, Nov. 1993.

18

FINE (2)

  • findings
    • memory and software faults tend to …
      • … have higher latency until activation
      • … cause lower performance loss
    • CPU and bus faults tend to …
      • … have lower latency until activation
      • … cause higher performance loss
19

What happens here?

...
typedef void (*FN) ();

int main(int argc, char **argv) {
  unsigned char fn_data[NBYTES];        /* holds garbage program */
  FN            fn_ptr = (FN) &fn_data; /* declares pointer to data as function */

  mprotect(...); /* unsets no-execute bit on memory area */
  srand(SEED);
  while(1) {
    for (int i=0; i < NBYTES; i++)
      fn_data[i] = rand() & 0xFF; /* ``& 0xFF``: int to byte */
    fn_ptr();
  }
}
  • see next slide…

source excerpt from https://github.com/28mm/Crashme– , modified

20

crashme

  • execute randomly generated bytes as procedure
    • test operating system stability
    • i.e., a sort of fuzzing
  • first implementation 1996 by George J. Carrette
  • used for testing the Linux kernel
    • earliest entry ~ late 1996 (~2.0.20) in Linux kernel mailing list
  • very simplistic
    • trigger: manual during runtime
    • failure model: usually aims for crash faults
    • analysis: provides no programmatic detection, monitoring, tracing, etc.

G. J. Carrette, “Crashme,” 1996 [Online]. Available: http://people.delphiforums.com/gjc/crashme.html

21

iofuzz

  • write random bytes to IO ports
  • case study: security failures for all hypervisor tested:

    • _images/Ormandy07Empirical-table2.png
      • full: hypervisor can be compromised; partial: e.g., information disclosure, unauthorized resource-allocation; minor: e.g., hypervisor crash
  • similar tools still used to continuously test hypervisor robustness
    • e.g., KVM, Qemu
22

Trinity

  • Linux system call fuzzing

    • large surface
      • almost 400 system calls in 4.17
    • reaching into privileged mode
  • assess robustness of kernel

  • gray-box testing

    • considers data types and values/ranges system calls expect
      • via annotations
      • to pass argument checks at beginning of procedures
      • increases efficiency
  • found and still finds bugs

  • (Source code, svg, png)

    _images/trinity-mailing-list-search-hits.svg
    _images/trinity-mailing-list-search-hits.svg

https://github.com/kernelslacker/trinity

23

software fault injection in managed code

digraph device_stack { layout=neato node [shape=record style=filled fontname="Ubuntu,sans" fontsize=40 fixedsize=true width=10 height=1 pin=true fillcolor=none] distributed [pos="0,5" label="distributed application"] application [pos="0,4" label="application"] os [pos="0,2" label="operating system"] hardware [pos="0,1" label="hardware"] code [pos="0,3" label="code" penwidth=10 color="#077cc0"] }

24

software fault injection in managed code (2)

  • leverage advanced instrumentation and reflection features
    • of pre-runtime and runtime instrumentation
  • targeting applications
    • e.g., code for exception handling
    • e.g., tolerance to computation faults
    • e.g., effectiveness of test suite
      • i.e., “Do injected faults cause a test to fail?”
  • approaches
    • dependency injection: use stub/mock objects for testing
    • runtime instrumentation: intercept and modify calls directly
25

TestAPI

  • testing framework for .NET
  • fault model: e.g., throwing exceptions, return values or state of environment
using System;
class MyApplication {
  static void Main(string[] args) {
    int a = 2;
    int b = 3;
    for (int i = 0; i < 10; i++) {
        Console.WriteLine("{0}) {1} + {2} = {3}", i, a, b, Sum(a, b));
    }
  }
  private static int Sum(int a, int b) {
    return a + b; /* e.g., fault injection happens here: return -10 every 7th run */
  }
}
  • facilities for error/failure detection
    • e.g., for deep object comparison, string generation with constraints
26

Jaca



  • targets bytecode of Java applications
  • uses objects’ public interfaces
    • i.e., attributes, method parameters, return values
  • based on language-independent patterns
    • i.e., propose patterns of (different aspects of) a software fault injection tool
_images/Martins02Jaca-figure2.1.png
  • uses reflection for triggering and analysis/monitoring

E. Martins, C. M. Rubira, and N. G. Leme, “Jaca: A reflective fault injection tool based on patterns,” in Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on , 2002, pp. 483–487.

27

Byteman

  • by JBoss
  • promises to ease tracing, monitoring and testing
  • injects code
    • e.g., print statements, testing tests, modifying private state
RULE trace Object.finalize at initial call
CLASS ^java.lang.Object
METHOD finalize
IF NOT callerEquals("finalize")
DO System.out.println("Finalizing " + $0)
ENDRULE
  • triggers for code injection
28

software fault injection at application level

digraph device_stack { layout=neato node [shape=record style=filled fontname="Ubuntu,sans" fontsize=40 fixedsize=true width=10 height=1 pin=true fillcolor=none] distributed [pos="0,5" label="distributed application"] code [pos="0,3" label="code"] os [pos="0,2" label="operating system"] hardware [pos="0,1" label="hardware"] application [pos="0,4" label="application" penwidth=10 color="#077cc0"] }

29

Ballista

  • motivation: use off-the-shelf components for mission-critical systems

  • assess robustness via component interfaces

    • test valid and “exceptional” inputs, exhaustively

      • i.e., fuzzing

      • _images/Kropp98Automated-figure1.png
      • _images/Kropp98Automated-figure2.png

N. P. Kropp, P. J. Koopman, and D. P. Siewiorek, “Automated robustness testing of off-the-shelf software components,” in Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224) , 1998, pp. 230–239.

30

Ballista (2)

  • case study: assess POSIX interface on 10 different UNIX systems
_images/Kropp98Automated-table2.png

catastrophic failure: crash requiring manual reboot

31

FERRARI

  • Flexible Software-Based Fault and Error Injection Tool

  • transient and permanent faults

    • e.g., execution of wrong or additional instructions
    • e.g., fetching operands from wrong address
  • _images/Kanawati95FERRARI-figure2.png

G. A. Kanawati, N. A. Kanawati, and J. A. Abraham, “FERRARI: a flexible software-based fault and error injection system,” vol. 44, no. 2, pp. 248–260, Feb. 1995.

32

FERRARI (2)

_images/Kanawati95FERRARI-figure3.png _images/Kanawati95FERRARI-figure10.png _images/Kanawati95FERRARI-table2.png
33

LFI – Library Fault Injector

  • function calls intercepted using LD_PRELOAD for fault injection
    • i.e., fault injection library is called instead actual library
  • programmatic generation of exhaustive xor random fault injection scenarios
    1. disassembles binaries of libraries
    2. determines return codes
      • by determining control flows and how static return values propagate back
      • also analyzes writes to variables, when return code stored therein
  • trigger: e.g,. nth call, specific call stack entries
  • fault model: incomplete return code handling
  • analysis: no measurements, but detailed logs and “replay” scripts

P. D. Marinescu and G. Candea, “LFI: A practical and general library-level fault injector,” in Dependable Systems & Networks, 2009. DSN’09. IEEE/IFIP International Conference on , 2009, pp. 379–388 [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5270313

34

LFI – Library Fault Injector (2)

  • _images/Marinescu09LFI-figure1.png
  • _images/Marinescu09LFI-figure3.png
35

Hovac

  • assess fault tolerance regarding library failures

  • fault model from Common Weaknesses Enumeration (CWE) database

  • _images/Herscheid15Hovac-figure1.png

L. Herscheid, D. Richter, and A. Polze, “Hovac: A configurable fault injection framework for benchmarking the dependability of C/C++ applications,” in Software Quality, Reliability and Security (QRS), 2015 IEEE International Conference on , 2015, pp. 1–10.

36

software fault injection in distributed systems

digraph device_stack { layout=neato node [shape=record style=filled fontname="Ubuntu,sans" fontsize=40 fixedsize=true width=10 height=1 pin=true fillcolor=none] application [pos="0,4" label="application"] code [pos="0,3" label="code"] os [pos="0,2" label="operating system"] hardware [pos="0,1" label="hardware"] distributed [pos="0,5" label="distributed application" penwidth=10 color="#077cc0"] }

37

ORCHESTRA





  • focused on distributed protocols

    • have a large state space
  • protocol fault injection (PFI) layer inserted below target protocol

    • message filtering, manipulation and injection
  • _images/Dawson96ORCHESTRA-figure1.png
  • implemented for Solaris and real-time Mach

  • used to assess membership protocols

S. Dawson, F. Jahanian, and T. Mitton, “ORCHESTRA: A fault injection environment for distributed systems,” vol. 1001, pp. 48109–2122, 1996.

38

FATE / DESTINI

  • tests the dependability of cloud systems

    • goals: formality, verifiability, exhaustiveness
  • FATE: failure injection service

    • _images/Gunawi11FATE-figure2.png
    • insert failure surfaces into target system

      • are then controlled by failure server
    • optimization by prioritizing dependent failures

  • DESTINI: declaration of desired recovery behavior

    • uses Datalog language
    • e.g., existence of replica (rack-aware), timings
  • exemplified using HDFS

H. S. Gunawi et al. , “FATE and DESTINI: A framework for cloud recovery testing,” in Proceedings of NSDI’11: 8th USENIX Symposium on Networked Systems Design and Implementation , 2011, p. 239.

39

Chaos Monkey

  • terminates AWS EC2 instances
    • in AWS Auto Scaling Groups
    • with certain constraints
      • e.g., maximum rate, maximum percentage
    • fault model: crash faults of nodes / virtual machines
  • used by Netflix in production
    • during business hours only
      • staff is watching and can react quickly
    • increases confidence regarding fault tolerance of own system
    • keeps constant awareness of fault tolerance
40

this was just an excerpt…

digraph device_stack { layout=neato node [shape=record style=filled fontname="Ubuntu,sans" fontsize=40 fixedsize=true width=10 height=1 pin=true fillcolor=none] distributed [pos="0,5" label="distributed application"] application [pos="0,4" label="application"] code [pos="0,3" label="code"] os [pos="0,2" label="operating system"] hardware [pos="0,1" label="hardware"] }

41