Failure Geometry Demo

Structured failure analysis on the CARB reasoning dataset — no API key required.

Two lightweight baselines are used intentionally:

Baseline	Strategy	Expected failure shape
`always_1`	Predict 1 for everything	Fails on all false-label items (systematic bias)
`keyword_heuristic`	Predict 0 when statement contains negation markers	Fails on affirmative-false and negated-true items

Because the baselines fail differently, pooling their failures and clustering lets us ask:

Do clusters separate by reasoning type, by model identity, or both?

Mutual information quantifies this. A larger MI(cluster, reasoning_type) than MI(cluster, model_identity) supports the hypothesis that failure structure is organised around reasoning difficulty — not purely around which model ran.

Part of the Obversary Studios evaluation systems research. Live pipeline version (HF Inference API): carb-observability-space