🛡️ Honesty Guard
The Problem
When an AI explains its decisions, it produces an attribution matrix — a grid of numbers showing how much each input contributed to each output.
If some of those numbers are negative, the model might be hiding its true reasoning. A positive number means "this input pushed toward this output." A negative means "this input pushed away" — which can mask the real cause behind a cancellation.
Deceptive models exploit this: they produce explanations that look reasonable on the surface but contain hidden sign cancellations. Two large numbers of opposite sign can cancel to give a plausible-looking small number — but the real reasoning is buried in the cancellation.
The Solution: Total Non-Negativity (TNN)
A matrix is totally non-negative if every minor (every sub-determinant) is ≥ 0. This is a strict geometric condition — the matrix must lie inside the "positive cone" of matrix space.
If ANY minor is negative, the explanation contains a hidden cancellation.
ASR(A) = |{aij ≥ 0}| / |A| — percentage of non-negative entries
Minor Check:
det(A[I,J]) ≥ 0 for all index subsets I, J
Positive Cone:
The set of all TNN matrices forms a closed convex cone in matrix space.
The attribution must lie inside this cone to be honest.
Think of it geometrically: all honest attributions live inside a specific region of matrix space. If the model's explanation lies outside that region, it's not being truthful about its reasoning.
How It Works — Step by Step
Code Example
What It Catches
Limitations
- Assumes you can extract meaningful attributions. For black-box models, attribution quality depends on the method used.
- Full TNN checking is O(nk) — computationally expensive for large matrices. We use probabilistic screening for efficiency.
- A passing score doesn't guarantee the model is honest — only that its explanations are geometrically consistent. The attributions themselves must be meaningful.