Failure Geometry Demo

Structured failure analysis on the CARB reasoning dataset — no API key required.

Two lightweight baselines are used intentionally:

Baseline Strategy Expected failure shape
always_1 Predict 1 for everything Fails on all false-label items (systematic bias)
keyword_heuristic Predict 0 when statement contains negation markers Fails on affirmative-false and negated-true items

Because the baselines fail differently, pooling their failures and clustering lets us ask:

Do clusters separate by reasoning type, by model identity, or both?

Mutual information quantifies this. A larger MI(cluster, reasoning_type) than MI(cluster, model_identity) supports the hypothesis that failure structure is organised around reasoning difficulty — not purely around which model ran.


Part of the Obversary Studios evaluation systems research. Live pipeline version (HF Inference API): carb-observability-space

Baselines to run

Select one to isolate a single failure geometry; both to see model-identity MI.

2 6

Failure records (sorted by cluster → reasoning type)