🔄 Holonomy Closure Guard

Detecting Reward Hacking via Differential Geometry

The Problem

When an AI updates its behaviour over a sequence of steps, it traces a path through parameter space. If you can follow a sequence of updates around a "loop" and end up with different behaviour than where you started, the model found a loophole.

This is reward hacking — the model discovered a cyclic exploit in its objective function. Each individual update looks fine, but the sequence as a whole takes the model somewhere it shouldn't be.

The Geometry: Connections and Curvature

In differential geometry, a connection describes how vectors change as they're transported along a surface. The curvature of a connection measures whether transporting a vector around a closed loop brings it back to where it started.

Zero curvature (flat connection) = no exploits. Non-zero curvature = the model found a loophole.

Connection One-Form:
Ai = δi+1 ⊗ δiT — outer product of successive updates

Curvature:
F = dA + A∧A — approximated as [Ai, Aj] (commutator)
Non-zero commutator means the order of updates matters — a sign of exploitation.

Holonomy:
H(γ) = P exp(∮ A) — path-ordered exponential around loop γ

Safety Condition:
||H(γ) − I|| < ε   for all generating loops
The holonomy must be approximately the identity — the loop must close.

How It Works — Step by Step

1
Record sequence of parameter updates — each training step produces a delta vector δi = θi+1 − θi.
2
Construct connection one-form from successive update pairs — the outer product Ai = δi+1 ⊗ δiT captures how each update relates to the next.
3
Compute curvature via commutators — [Ai, Aj] = AiAj − AjAi. Non-zero means the connection is curved.
4
Check if curvature is approximately zero (flat). Use Frobenius norm of the curvature tensor.
5
Compute holonomy around generating loops — for any detected non-flatness, check if the path-ordered exponential returns to the identity.
6
If holonomy is non-trivial → reward hacking detected. The model found a way to exploit the sequence of updates.

Code Example

from mikoshi_safeguard.holonomy import HolonomyGuard import numpy as np guard = HolonomyGuard(tolerance=0.01) # Clean updates (no hacking) clean = [np.array([0.1, 0.0]), np.array([0.0, 0.1]), np.array([-0.1, -0.1])] result = guard.check(clean) print(f"Hacking: {result['hacking_detected']}, Score: {result['score']:.2f}") # Exploitative updates (loop doesn't close) hack = [np.array([1.0, 0.0]), np.array([0.0, 1.0]), np.array([-0.5, -0.5])] result = guard.check(hack) print(f"Hacking: {result['hacking_detected']}, Score: {result['score']:.2f}")

What It Catches

Reward HackingModels that exploit cyclic sequences in their objective function to gain unintended advantages.
Specification GamingCyclic patterns where each update appears valid but the sequence creates a loophole.
Incremental ExploitsModels that discover loopholes through sequences of individually-small, innocent-looking updates.

Limitations