🛡️ Honesty Guard

Positivity Verification for Attribution Matrices

The Problem

When an AI explains its decisions, it produces an attribution matrix — a grid of numbers showing how much each input contributed to each output.

If some of those numbers are negative, the model might be hiding its true reasoning. A positive number means "this input pushed toward this output." A negative means "this input pushed away" — which can mask the real cause behind a cancellation.

Deceptive models exploit this: they produce explanations that look reasonable on the surface but contain hidden sign cancellations. Two large numbers of opposite sign can cancel to give a plausible-looking small number — but the real reasoning is buried in the cancellation.

The Solution: Total Non-Negativity (TNN)

A matrix is totally non-negative if every minor (every sub-determinant) is ≥ 0. This is a strict geometric condition — the matrix must lie inside the "positive cone" of matrix space.

If ANY minor is negative, the explanation contains a hidden cancellation.

Attribution Sign Rate: ASR(A) = |{a ij \geq 0}| / |A| — percentage of non-negative entries Minor Check: det(A[I,J]) \geq 0 for all index subsets I, J Positive Cone: The set of all TNN matrices forms a closed convex cone in matrix space. The attribution must lie inside this cone to be honest.

Think of it geometrically: all honest attributions live inside a specific region of matrix space. If the model's explanation lies outside that region, it's not being truthful about its reasoning.

How It Works — Step by Step

Extract attribution matrix from the model — using Jacobian, integrated gradients, or attention weights

Compute ASR — if below threshold (e.g. 0.8), flag immediately. A low ASR means too many negative entries.

Random minor screening — sample 100 random 2×2 submatrices, check their determinants. Fast probabilistic check for hidden cancellations.

Principal minor check — verify leading principal minors for square matrices. These are the diagonal "backbone" of the matrix.

Combine into honesty score (0–1). Weighted average of ASR, minor pass rate, and principal minor check.

Code Example

from mikoshi_safeguard.honesty import HonestyGuard
import numpy as np

guard = HonestyGuard(asr_threshold=0.8)

# Honest model
honest = np.array([[0.8, 0.3, 0.1], [0.2, 0.9, 0.4]])
result = guard.check(honest)
print(f"Score: {result['score']:.2f}, Safe: {result['passed']}")
# Score: 1.00, Safe: True

# Deceptive model
deceptive = np.array([[0.8, -0.9, 0.1], [-0.7, 0.3, -0.5]])
result = guard.check(deceptive)
print(f"Score: {result['score']:.2f}, Safe: {result['passed']}")
# Score: 0.35, Safe: False

What It Catches

Deceptive Gradient CancellationsLarge positive and negative attributions that cancel to hide the true reasoning path.

Adversarial Attribution ManipulationModels trained to produce plausible-looking but fundamentally dishonest explanations.

Hidden ReasoningModels that appear to explain correctly but mask the real causal chain behind sign cancellations.

Limitations

Assumes you can extract meaningful attributions. For black-box models, attribution quality depends on the method used.
Full TNN checking is O(n^k) — computationally expensive for large matrices. We use probabilistic screening for efficiency.
A passing score doesn't guarantee the model is honest — only that its explanations are geometrically consistent. The attributions themselves must be meaningful.