conductorone@engineering ~
$ cd /engineering && cat ./fault-injection-for-complete-branch-coverage.md
Formal Methods Engineering

> Fault Injection for Complete Branch Coverage

/images/author-rch.png
5 min read
share:
width:

Error handling branches – timeouts, malformed responses, rate limits, partial failures – are invisible to valid inputs. The connector code that handles a 503 or a half-written JSON response never fires when the mock server is behaving. And that’s where the bugs hide.

This is part of a series on formally verifying identity connectors. The coverage-guided verification reaches most branches through normal input variation. Fault injection reaches the rest.


Ten Fault Types

The system defines ten categories of injectable failure:

const (
    FaultTimeout            = "timeout"              // Block until context cancels
    FaultServerError        = "server_error"         // HTTP 500
    FaultBadGateway         = "bad_gateway"          // HTTP 502
    FaultServiceUnavailable = "service_unavailable"  // HTTP 503
    FaultMalformedJSON      = "malformed_json"       // Invalid JSON body
    FaultEmptyBody          = "empty_body"           // 200 OK, no content
    FaultRateLimit          = "rate_limit"           // HTTP 429 + Retry-After
    FaultConnectionReset    = "connection_reset"     // TCP RST
    FaultSlowResponse       = "slow_response"        // Delayed response
    FaultPartialResponse    = "partial_response"     // Truncated JSON
)

These aren’t arbitrary. Each corresponds to a failure mode that real APIs produce and that connectors must handle. A 429 from Okta’s rate limiter. A 503 from AWS during a service disruption. A connection reset from a load balancer timeout. Truncated JSON from a proxy that closed the connection early.


Configuration

Each fault is configured with targeting and scheduling:

type FaultConfig struct {
    Type          FaultType
    Probability   float64   // 0.0-1.0
    AfterRequests int       // Trigger after N successes
    ForRequests   int       // Apply for N requests, then stop
    Endpoints     []string  // Filter by URL path pattern
    DelayMs       int       // For slow_response
    CustomStatus  int       // Override HTTP status code
    CustomBody    string    // Override response body
}

AfterRequests is key. A connector might handle the first page of results correctly but fail on the third. Setting AfterRequests: 2 injects the fault after two successful requests, testing pagination error recovery specifically. ForRequests limits the fault duration – the system can verify that a connector recovers after a transient failure.

Endpoints targets faults at specific API paths. A connector talks to multiple endpoints – users, groups, roles, memberships. Injecting a fault on the groups endpoint while users works normally tests whether the connector handles partial API availability.


Breaking the Transport Layer

The simple faults are straightforward: return an error status code, return empty content, delay the response. The interesting ones break the transport layer itself.

Connection reset hijacks the TCP connection and forces a RST instead of a clean FIN:

case FaultConnectionReset:
    if hijacker, ok := w.(http.Hijacker); ok {
        conn, _, _ := hijacker.Hijack()
        if tcpConn, ok := conn.(*net.TCPConn); ok {
            tcpConn.SetLinger(0)  // Forces TCP RST
        }
        conn.Close()
    }

SetLinger(0) tells the kernel to send a RST segment instead of the normal FIN handshake. The connector sees an abrupt connection drop, not a graceful close. This is exactly what happens when a load balancer times out or a network partition heals ungracefully.

Partial response writes the beginning of a valid JSON response, then kills the connection:

case FaultPartialResponse:
    w.Header().Set("Content-Length", "1000")  // Claim more data coming
    w.Write([]byte(`{"data": [`))             // Start valid JSON
    if hijacker, ok := w.(http.Hijacker); ok {
        conn, _, _ := hijacker.Hijack()
        conn.Close()                           // Abrupt close
    }

The Content-Length header says 1000 bytes are coming. Only 11 arrive. The connector has to detect the truncation – by checking Content-Length against bytes received, by handling the read error, or by detecting invalid JSON. Each connector handles this differently. The fault injection tests whether it handles it at all.


Field-Level Mutations

Beyond transport-level faults, the system supports semantic mutations – modifying the content of otherwise valid responses:

type FaultKind struct {
    Name            string            `yaml:"name"`
    Status          int               `yaml:"status"`
    Body            string            `yaml:"body"`
    Headers         map[string]string `yaml:"headers"`
    DelayMs         int               `yaml:"delay_ms"`
    OmitFields      []string          `yaml:"omit_fields"`
    OverrideFields  map[string]any    `yaml:"override_fields"`
    InjectFields    map[string]any    `yaml:"inject_fields"`
    TruncateResults int               `yaml:"truncate_results"`
}

OmitFields removes JSON fields from the response. What happens when the API returns a user without an email field? Does the connector crash, skip the user, or sync it with a blank email?

OverrideFields replaces values. What if status comes back as an unexpected string? What if role is null instead of a string?

TruncateResults limits array sizes. A paginated response that normally returns 100 items returns 3. Does the connector still follow the pagination link?

InjectFields adds unexpected fields. APIs evolve. A new field appearing in the response shouldn’t break a connector that doesn’t expect it.

These mutations apply via a capture-modify-forward middleware:

func (fi *FaultInjector) FaultKindMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        capture := &responseCapture{ResponseWriter: w}
        next.ServeHTTP(capture, r)

        body, status := kind.ApplyToResponse(
            capture.body.Bytes(), capture.statusCode)
        w.WriteHeader(status)
        w.Write(body)
    })
}

The mock server generates a correct response. The middleware captures it, mutates the JSON, and forwards the mutated version. The connector sees a response that’s structurally valid but semantically wrong in a controlled way.


Combined Coverage

Normal input variation exercises the happy path and its boundary conditions. Fault injection exercises error handling, retry logic, and degraded-mode behavior. Between the two, the framework reaches branches that neither covers alone.

For connectors where the framework has source access, the combination gets branch coverage to 100%. Every if err != nil, every pagination check, every rate limit handler, every timeout path – exercised by some combination of input configuration and fault scenario.

The coverage predictor tracks which faults exercise which error-handling branches, using the same DFA-based prediction that guides normal input exploration. The system walks through fault scenarios the same way it walks through the input space: one fault type at a time, one endpoint at a time, Gray code traversal over the fault configuration space, bisecting to find which fault configurations trigger new branches.


Series

This is part of a series on formally verifying identity connectors:

  1. Six Shapes of Authorization
  2. Formally Verifying Two Hundred Identity Connectors
  3. One Mock Server, Twelve Protocols
  4. Every Branch Condition Compiles to a DFA
  5. Proving Equivalence with E-Graphs
  6. Fault Injection for Complete Branch Coverage (this post)