Failure Geometry Demo
Structured failure analysis on the CARB reasoning dataset — no API key required.
Two lightweight baselines are used intentionally:
| Baseline | Strategy | Expected failure shape |
|---|---|---|
always_1 |
Predict 1 for everything | Fails on all false-label items (systematic bias) |
keyword_heuristic |
Predict 0 when statement contains negation markers | Fails on affirmative-false and negated-true items |
Because the baselines fail differently, pooling their failures and clustering lets us ask:
Do clusters separate by reasoning type, by model identity, or both?
Mutual information quantifies this. A larger MI(cluster, reasoning_type) than
MI(cluster, model_identity) supports the hypothesis that failure structure is
organised around reasoning difficulty — not purely around which model ran.
Part of the Obversary Studios evaluation systems research. Live pipeline version (HF Inference API): carb-observability-space
2 6
Failure records (sorted by cluster → reasoning type)