$ cd /engineering && cat ./fault-injection-for-complete-branch-coverage.md
Formal MethodsEngineering
> Fault Injection for Complete Branch Coverage
Robert Chiniquy
||5 min read
share:
width:
Error handling branches – timeouts, malformed responses, rate limits, partial failures – are invisible to valid inputs. The connector code that handles a 503 or a half-written JSON response never fires when the mock server is behaving. And that’s where the bugs hide.
This is part of a series on formally verifying identity connectors. The coverage-guided verification reaches most branches through normal input variation. Fault injection reaches the rest.
Ten Fault Types
The system defines ten categories of injectable failure:
These aren’t arbitrary. Each corresponds to a failure mode that real APIs produce and that connectors must handle. A 429 from Okta’s rate limiter. A 503 from AWS during a service disruption. A connection reset from a load balancer timeout. Truncated JSON from a proxy that closed the connection early.
Configuration
Each fault is configured with targeting and scheduling:
type FaultConfig struct {
Type FaultType
Probability float64 // 0.0-1.0
AfterRequests int // Trigger after N successes
ForRequests int // Apply for N requests, then stop
Endpoints []string // Filter by URL path pattern
DelayMs int // For slow_response
CustomStatus int // Override HTTP status code
CustomBody string // Override response body
}
AfterRequests is key. A connector might handle the first page of results correctly but fail on the third. Setting AfterRequests: 2 injects the fault after two successful requests, testing pagination error recovery specifically. ForRequests limits the fault duration – the system can verify that a connector recovers after a transient failure.
Endpoints targets faults at specific API paths. A connector talks to multiple endpoints – users, groups, roles, memberships. Injecting a fault on the groups endpoint while users works normally tests whether the connector handles partial API availability.
Breaking the Transport Layer
The simple faults are straightforward: return an error status code, return empty content, delay the response. The interesting ones break the transport layer itself.
Connection reset hijacks the TCP connection and forces a RST instead of a clean FIN:
case FaultConnectionReset:
if hijacker, ok := w.(http.Hijacker); ok {
conn, _, _ := hijacker.Hijack()
if tcpConn, ok := conn.(*net.TCPConn); ok {
tcpConn.SetLinger(0) // Forces TCP RST
}
conn.Close()
}
SetLinger(0) tells the kernel to send a RST segment instead of the normal FIN handshake. The connector sees an abrupt connection drop, not a graceful close. This is exactly what happens when a load balancer times out or a network partition heals ungracefully.
Partial response writes the beginning of a valid JSON response, then kills the connection:
case FaultPartialResponse:
w.Header().Set("Content-Length", "1000") // Claim more data coming
w.Write([]byte(`{"data": [`)) // Start valid JSON
if hijacker, ok := w.(http.Hijacker); ok {
conn, _, _ := hijacker.Hijack()
conn.Close() // Abrupt close
}
The Content-Length header says 1000 bytes are coming. Only 11 arrive. The connector has to detect the truncation – by checking Content-Length against bytes received, by handling the read error, or by detecting invalid JSON. Each connector handles this differently. The fault injection tests whether it handles it at all.
Field-Level Mutations
Beyond transport-level faults, the system supports semantic mutations – modifying the content of otherwise valid responses:
type FaultKind struct {
Name string `yaml:"name"`
Status int `yaml:"status"`
Body string `yaml:"body"`
Headers map[string]string `yaml:"headers"`
DelayMs int `yaml:"delay_ms"`
OmitFields []string `yaml:"omit_fields"`
OverrideFields map[string]any `yaml:"override_fields"`
InjectFields map[string]any `yaml:"inject_fields"`
TruncateResults int `yaml:"truncate_results"`
}
OmitFields removes JSON fields from the response. What happens when the API returns a user without an email field? Does the connector crash, skip the user, or sync it with a blank email?
OverrideFields replaces values. What if status comes back as an unexpected string? What if role is null instead of a string?
TruncateResults limits array sizes. A paginated response that normally returns 100 items returns 3. Does the connector still follow the pagination link?
InjectFields adds unexpected fields. APIs evolve. A new field appearing in the response shouldn’t break a connector that doesn’t expect it.
These mutations apply via a capture-modify-forward middleware:
The mock server generates a correct response. The middleware captures it, mutates the JSON, and forwards the mutated version. The connector sees a response that’s structurally valid but semantically wrong in a controlled way.
Combined Coverage
Normal input variation exercises the happy path and its boundary conditions. Fault injection exercises error handling, retry logic, and degraded-mode behavior. Between the two, the framework reaches branches that neither covers alone.
For connectors where the framework has source access, the combination gets branch coverage to 100%. Every if err != nil, every pagination check, every rate limit handler, every timeout path – exercised by some combination of input configuration and fault scenario.
The coverage predictor tracks which faults exercise which error-handling branches, using the same DFA-based prediction that guides normal input exploration. The system walks through fault scenarios the same way it walks through the input space: one fault type at a time, one endpoint at a time, Gray code traversal over the fault configuration space, bisecting to find which fault configurations trigger new branches.
Series
This is part of a series on formally verifying identity connectors: