🔄 Holonomy Closure Guard

Detecting Reward Hacking via Differential Geometry

The Problem

When an AI updates its behaviour over a sequence of steps, it traces a path through parameter space. If you can follow a sequence of updates around a "loop" and end up with different behaviour than where you started, the model found a loophole.

This is reward hacking — the model discovered a cyclic exploit in its objective function. Each individual update looks fine, but the sequence as a whole takes the model somewhere it shouldn't be.

The Geometry: Connections and Curvature

In differential geometry, a connection describes how vectors change as they're transported along a surface. The curvature of a connection measures whether transporting a vector around a closed loop brings it back to where it started.

Zero curvature (flat connection) = no exploits. Non-zero curvature = the model found a loophole.

Connection One-Form: A i = δ i+1 \otimes δ i T — outer product of successive updates Curvature: F = dA + A\landA — approximated as [A i, A j] (commutator) Non-zero commutator means the order of updates matters — a sign of exploitation. Holonomy: H(γ) = P exp(\oint A) — path-ordered exponential around loop γ Safety Condition: ||H(γ) - I|| < ε for all generating loops The holonomy must be approximately the identity — the loop must close.

How It Works — Step by Step

Record sequence of parameter updates — each training step produces a delta vector δ_i = θ_i+1 − θ_i.

Construct connection one-form from successive update pairs — the outer product A_i = δ_i+1 ⊗ δ_i^T captures how each update relates to the next.

Compute curvature via commutators — [A_i, A_j] = A_iA_j − A_jA_i. Non-zero means the connection is curved.

Check if curvature is approximately zero (flat). Use Frobenius norm of the curvature tensor.

Compute holonomy around generating loops — for any detected non-flatness, check if the path-ordered exponential returns to the identity.

If holonomy is non-trivial → reward hacking detected. The model found a way to exploit the sequence of updates.

Code Example

from mikoshi_safeguard.holonomy import HolonomyGuard
import numpy as np

guard = HolonomyGuard(tolerance=0.01)

# Clean updates (no hacking)
clean = [np.array([0.1, 0.0]), np.array([0.0, 0.1]), np.array([-0.1, -0.1])]
result = guard.check(clean)
print(f"Hacking: {result['hacking_detected']}, Score: {result['score']:.2f}")

# Exploitative updates (loop doesn't close)
hack = [np.array([1.0, 0.0]), np.array([0.0, 1.0]), np.array([-0.5, -0.5])]
result = guard.check(hack)
print(f"Hacking: {result['hacking_detected']}, Score: {result['score']:.2f}")

What It Catches

Reward HackingModels that exploit cyclic sequences in their objective function to gain unintended advantages.

Specification GamingCyclic patterns where each update appears valid but the sequence creates a loophole.

Incremental ExploitsModels that discover loopholes through sequences of individually-small, innocent-looking updates.

Limitations

Requires sufficient update history — at least 3 updates to form a meaningful loop.
Trivial loops (very small updates) may not trigger detection — the curvature might be below the tolerance threshold.
The connection approximation assumes smooth parameter space — discontinuous updates break the geometric framework.