🧱 Wall Stability Guard

Capability Bounding via Lyapunov Barriers

The Problem

An AI system's capabilities can be measured as energy in parameter space. During training or fine-tuning, this energy can grow — the model becomes more capable.

But capability without bounds is dangerous. If a model's capability exceeds its safety budget, it can do things it wasn't designed for. The question becomes: how do you build an impenetrable wall around capability?

The Physics: Israel Junction Conditions

In cosmology, a bubble of space can be stable or unstable depending on the tension of its boundary wall. The Israel thin-wall junction conditions describe when a bubble remains stable.

We use the same mathematics: the AI's capability is the "interior," the safety budget is the "wall tension," and the Lyapunov barrier is the curved surface that prevents escape.

Capability Energy: E(θ) = ||θ||₂ — L2 norm of model parameters Safety Tension: T(B, E) = B - E(θ) — positive means safe, zero means at the wall Barrier-Lyapunov Function: V(E, B) = -ln(1 - E/B) — diverges to \infty as E \to B This creates an impenetrable wall : the barrier becomes infinite at the budget boundary. Israel Junction Analogue: Stability requires: κ interior < κ exterior + σ Interior curvature must be less than exterior curvature adjusted by wall tension.

The key insight: a logarithmic barrier diverges to infinity as the model approaches its budget. No finite gradient step can cross it — the wall is mathematically impenetrable.

How It Works — Step by Step

Compute capability energy from model parameters — the L2 norm of the parameter vector gives a scalar measure of total capability.

Compare against safety budget — compute tension T = B − E. Positive tension means the model is within bounds.

Evaluate Lyapunov barrier — if V(E,B) is climbing toward infinity, the model is approaching the wall. The rate of climb indicates urgency.

Check trajectory curvature — is the model accelerating toward the boundary? Second-order analysis reveals whether the model is slowing down (safe) or speeding up (dangerous).

Combine into stability score (0–1). Based on normalised tension and barrier proximity.

Code Example

from mikoshi_safeguard.stability import WallStabilityGuard
import numpy as np

guard = WallStabilityGuard(budget=1.0)

# Safe parameters
safe = np.array([0.1, 0.2, 0.3])
result = guard.check(safe)
print(f"Score: {result['score']:.2f}, Energy: {result['energy']:.2f}")
# Score: 0.63, Energy: 0.37

# Escaped parameters
escaped = np.array([5.0, 8.0, 10.0])
result = guard.check(escaped)
print(f"Score: {result['score']:.2f}, Energy: {result['energy']:.2f}")
# Score: 0.00, Energy: 13.75 — 13.7x over budget!

What It Catches

Capability EscapeModels that grow more powerful during training than their safety budget allows.

Parameter DriftGradual, imperceptible growth in capability that slowly exceeds safety bounds over many training steps.

Unintended PowerModels that become capable of tasks they were never designed for — emergent capabilities beyond specification.

Limitations

The budget must be set correctly — too tight and you hamper the model, too loose and it doesn't protect.
L2 norm is a crude measure of capability; domain-specific energy functions would be more precise.
The barrier prevents crossing the boundary but doesn't prevent the model from approaching arbitrarily close — practical implementations need a margin.