legacy-medical-device-security-frameworks

Test Harness Methodology

This document explains how to design, run, and interpret experiments using the harness. It is intended for researchers, contributors, and any practitioner using the harness to generate evidence for their own compensating-control evaluations.

Goal

Produce a defensible, reproducible matrix of (scenario, control profile) → outcome records that quantifies which compensating controls actually mitigate which threats.

Experimental design

Variables

Scenario: a single attack action targeting the emulated device. Each scenario is mapped to a STRIDE-HC category and exercises one or more of the device’s documented security constraints.
Control profile: a defined combination of compensating controls toggled on or off via Docker Compose profiles. The defined profiles are: baseline (none), ips, pam, segmentation, all.
Outcome: one of SUCCESS, MITIGATED, BLOCKED, BLOCKED_AUTH, BLOCKED_HTTP, BLOCKED_NET, PARTIAL, ERROR.

Hypotheses

The harness is designed to test specific hypotheses about which control profiles mitigate which scenarios. The expected matrix:

Scenario	baseline	ips	pam	segmentation	all
01 ARP poisoning	REACHABLE	REACHABLE	REACHABLE	BLOCKED	BLOCKED
02 Firmware injection	SUCCESS	BLOCKED_NET	BLOCKED_AUTH	BLOCKED	BLOCKED
03 Cleartext HL7 sniff	SUCCESS	SUCCESS	SUCCESS	BLOCKED	BLOCKED
04 Protocol flood DoS	SUCCESS	MITIGATED	SUCCESS	BLOCKED	BLOCKED
05 Default-cred EoP	SUCCESS	BLOCKED_NET	BLOCKED_AUTH	BLOCKED	BLOCKED

Empirical results that diverge from this matrix are interesting and warrant investigation — they may indicate a control gap, a scenario implementation issue, or an emulator behaviour mismatch with real-world devices.

Running an experiment

Single-cell run

docker compose --profile pam up -d
docker compose exec attacker python runner.py 05-eop-default-credential.py --control-profile pam

Full matrix

The Makefile (make matrix) executes all 5 scenarios across all 5 control profiles, restarting the harness between profiles for clean state. Results are written to results/run-<timestamp>.csv.

make matrix
# Equivalent to:
# for profile in baseline ips pam segmentation all; do
#     docker compose down
#     if [ "$profile" != "baseline" ]; then
#         docker compose --profile $profile up -d
#     else
#         docker compose up -d
#     fi
#     sleep 10  # wait for healthchecks
#     docker compose exec -T attacker python runner.py --all --control-profile $profile
# done
# python aggregate-results.py results/run-*.csv > results/matrix.csv

Repetition for variance

A single matrix run produces a snapshot. For results suitable for reporting, run the matrix at least three times and compute outcome consistency. Variability in the harness is expected to come primarily from timing-sensitive scenarios (DoS recovery window, HL7 capture window).

Interpretation

Outcome classification

SUCCESS — the attack achieved its objective (data captured, parameter modified, target downed). Compensating control did not mitigate.
MITIGATED — the attack reached the target but the target’s availability or integrity was preserved. Compensating control reduced impact.
BLOCKED — the attack was prevented from reaching the target. Compensating control fully mitigated.
BLOCKED_AUTH — the attack reached the application but failed authentication. PAM Pattern A working as intended.
BLOCKED_HTTP — the application returned an HTTP error other than 401. Indicates rule-based filtering (e.g., IPS).
BLOCKED_NET — connection refused or timed out at network layer. Network segmentation or IPS at the network layer.
PARTIAL — the attack partially succeeded. Investigate the detail field for context.
ERROR — the scenario failed to execute (test failure, not control success). Re-run.

Statistical considerations

For the binary classification of effective vs not-effective at the per-scenario level, a single matrix run is informative but not sufficient. Repeated runs with consistent outcomes increase confidence. For paper publication or formal evidence, three to five matrix runs with all-consistent outcomes is the suggested floor.

Multi-scenario interactions

A control profile may mitigate scenario X by side effect even when it is not specifically designed for X. The matrix exposes such interactions — for example, network segmentation mitigates scenario 01 (ARP poisoning) and scenario 03 (HL7 sniff) and scenario 04 (DoS) all at once because all three depend on Layer 2/3 attacker positioning. This is a feature of the analysis: it demonstrates that defence-in-depth at the network layer is broadly effective.

Limitations

The harness has known limitations that must be considered when interpreting results:

Emulation fidelity. The pump emulator reproduces specific constraints, not the full behaviour of any specific commercial device. A control that works against the emulator may face additional challenges against a real device with idiosyncratic behaviour.
Scenario simplification. Each scenario is implemented in 50–150 lines of Python. Real attacks are typically more sophisticated and may chain multiple techniques. The harness scenarios test the basic control posture, not edge cases.
Single-device focus. The harness emulates one infusion pump. Multi-device interactions (e.g., scenario where attacker pivots from one pump to a peer pump) are not modelled.
No physical-attacker scenarios. The harness is a software environment. Pattern C MFA shim evaluation requires the hardware reference design (mfa-shim/).
Production controls are richer than harness controls. A real Snort/Suricata IPS, real CyberArk PSM, or real Cisco TrustSec deployment will have more complex behaviours than the harness’s stub controls. The harness demonstrates the methodology and the control taxonomy; per-deployment control efficacy must be validated in deployment context.
No false-positive measurement. The harness does not generate the volume of legitimate traffic needed to measure false-positive rates of the controls. Production deployments should evaluate FP rates against full clinical workloads.

Extending the harness

Adding a new scenario

Create a new file under attacker/scenarios/.
Implement run() -> dict returning the standard result schema.
Include STRIDE_HC and SCENARIO_NAME module-level constants.
Test the scenario standalone.
Update the expected-outcome matrix in this document.
Submit a PR including matrix run results.

Adding a new device emulator

Create a new top-level subdirectory (e.g., target-pacs/).
Provide Dockerfile and emulator script following the structure of target-device/.
Document the constraints reproduced.
Update docker-compose.yml (optionally as a separate compose file if the architecture differs substantially).
Add scenarios specific to the new emulator’s attack surfaces.

Adding a new control profile

Create a new directory under controls/.
Provide Dockerfile and implementation.
Add a service to docker-compose.yml with profiles: ["new-control"].
Document the expected outcome modifications in the matrix.

Reporting

Results from harness runs can be reported in:

CJR effectiveness validation (paper §3.5) — link harness output to the CJR record
STRIDE-HC threat models as evidence of control coverage
MDRS CCD scoring as evidence supporting reduced CCD values for devices with validated controls
Manufacturer security advisories providing customers with evidence that recommended compensating controls are effective
Penetration test reports as automated supplementary evidence

This site is open source. Improve this page.