Detecting Reward Hacking via Differential Geometry
The Problem
When an AI updates its behaviour over a sequence of steps, it traces a path through parameter space. If you can follow a sequence of updates around a "loop" and end up with different behaviour than where you started, the model found a loophole.
This is reward hacking — the model discovered a cyclic exploit in its objective function. Each individual update looks fine, but the sequence as a whole takes the model somewhere it shouldn't be.
The Geometry: Connections and Curvature
In differential geometry, a connection describes how vectors change as they're transported along a surface. The curvature of a connection measures whether transporting a vector around a closed loop brings it back to where it started.
Zero curvature (flat connection) = no exploits. Non-zero curvature = the model found a loophole.
Connection One-Form:
Ai = δi+1 ⊗ δiT — outer product of successive updates
Curvature:
F = dA + A∧A — approximated as [Ai, Aj] (commutator)
Non-zero commutator means the order of updates matters — a sign of exploitation.
Holonomy:
H(γ) = P exp(∮ A) — path-ordered exponential around loop γ
Safety Condition:
||H(γ) − I|| < ε for all generating loops
The holonomy must be approximately the identity — the loop must close.
How It Works — Step by Step
1
Record sequence of parameter updates — each training step produces a delta vector δi = θi+1 − θi.
2
Construct connection one-form from successive update pairs — the outer product Ai = δi+1 ⊗ δiT captures how each update relates to the next.
3
Compute curvature via commutators — [Ai, Aj] = AiAj − AjAi. Non-zero means the connection is curved.
4
Check if curvature is approximately zero (flat). Use Frobenius norm of the curvature tensor.
5
Compute holonomy around generating loops — for any detected non-flatness, check if the path-ordered exponential returns to the identity.
6
If holonomy is non-trivial → reward hacking detected. The model found a way to exploit the sequence of updates.
Code Example
from mikoshi_safeguard.holonomy importHolonomyGuardimport numpy as np
guard = HolonomyGuard(tolerance=0.01)
# Clean updates (no hacking)
clean = [np.array([0.1, 0.0]), np.array([0.0, 0.1]), np.array([-0.1, -0.1])]
result = guard.check(clean)
print(f"Hacking: {result['hacking_detected']}, Score: {result['score']:.2f}")
# Exploitative updates (loop doesn't close)
hack = [np.array([1.0, 0.0]), np.array([0.0, 1.0]), np.array([-0.5, -0.5])]
result = guard.check(hack)
print(f"Hacking: {result['hacking_detected']}, Score: {result['score']:.2f}")
What It Catches
Reward HackingModels that exploit cyclic sequences in their objective function to gain unintended advantages.
Specification GamingCyclic patterns where each update appears valid but the sequence creates a loophole.
Incremental ExploitsModels that discover loopholes through sequences of individually-small, innocent-looking updates.
Limitations
Requires sufficient update history — at least 3 updates to form a meaningful loop.
Trivial loops (very small updates) may not trigger detection — the curvature might be below the tolerance threshold.
The connection approximation assumes smooth parameter space — discontinuous updates break the geometric framework.