Understanding failure modes as part of safety analysis is critical for real world problem solving. Unfortunately, as the time of writing in August 2025, there is no universally accepted compliance standard for ensuring safety in AI system. Since G6 is such a complex system and we believe that AI safety is of paramount importance as we move towards AGI, we have spent an immense amount of time and effort to designing a mathematical calculus for agentic safety in AI systems.
We hope that this framework provides a mathematical basis for safety in G6. We have tried to construct a framework is mathematically consistent, practically implementable for bounded systems, and transparent about limitations. In order to allow us to provide arguments that are strong enough to hold over all synthesised program behaviours we have had to:
- Explicitly scope to discrete, finite-state, bounded-resource systems in order to prevents false universality claims.
- Exclude continuous learning, uncontrolled environments, emergent scale effects, and alignment—this avoids scope creep and positions the framework firmly in the 'behavioral safety' domain.
We believe its main strengths are:
- Explicit finite-state formalism (avoids unverifiable infinite systems)
- Direct mapping from theory to implementable components
- Honest discussion of failure modes
Its main vulnerabilities are:
- Strong independence and rollback assumptions
- No quantitative modeling of distributional shift
- Potential over-restriction from finite property sets and failure model assumptions
## **Foundational Assumptions and Resource Model**
**Scope**: This framework applies to discrete, finite-state AI systems with explicitly bounded computational resources. It does not address:
- Continuous learning systems with unbounded memory
- Systems interacting with fully uncontrolled environments
- Emergent behaviors arising from scale
- Value alignment problems (only behavioral safety)
Why this choice? Firstly, provability: with finite states, we can mathematically verify "all possible behaviors" as a combinatorial enumeration problem which is well defined. Secondly, most practical AI systems (chatbots, game AIs, decision systems) are actually discrete under the hood. For that matter, digital computers are discrete objects.
**Resource Model**: We define a **Resource Triple** $\mathcal{R} = (T, M, N)$ where:
- $T \in \mathbb{N}$ bounds computation time per step
- $M \in \mathbb{N}$ bounds memory usage
- $N \in \mathbb{N}$ bounds total execution steps
Why this choice? We must avoid systems that could theoretically run forever and use unlimited resources (for which it is mathematically impossible to verify safety for).
**Observability Model**: Agents operate under **partial observability** with observation function $obs: S \to O$ where $|O| \leq |S|$.
Why this choice? Firstly, it is more realistic: most AI systems don't have perfect information about the world. Secondly, this is safety-critical: an AI that thinks it knows everything is more dangerous than one that knows its limitations. For example: a medical AI that can observe symptoms but not internal organ states - it needs to reason under uncertainty rather than pretend it has perfect knowledge.
---
## **Definition 1 — Computational Safety Framework (CSF)**
**Intuition**: Think of the CSF as a "Safety Contract" that you write before building any AI system. It's like creating a detailed blueprint that specifies exactly what your AI can do, how it might fail, and what "safe" means.
*Before CSF*: "Build a safe file manager AI" (vague, unverifiable)
*After CSF*:
✅ Precisely defined what AI can do
✅ Specified exactly what "safe" means
✅ Set measurable failure tolerance
✅ Established resource limits
✅ Made verification mathematically possible
A real world analogy is that this is like writing a detailed job description for an employee. A bad job description: "Sort out the files, don't delete anything.". A good job description (CSF-style) might read:
1. Duties ($\Sigma$): Read files, write files, delete files only
2. Requirements ($\Delta$): Must verify file paths before operations
3. Rules ($\Omega$): Never touch system folders, always ask permission for deletions
4. Performance ($\Phi$): Expect 99.9% success rate on operations
5. Standards ($\varepsilon$): Overall mistakes should be under 1%
6. Resources ($\mathcal{R}$): Work within normal business hours, use standard computer
The CSF gives us the same clarity for AI systems - exactly what they can do, how they might fail, and what success looks like.
Here is the mathematical formalism. A **Computational Safety Framework** is a 6-tuple:
$$\mathcal{F} = (\Sigma, \Delta, \Omega, \Phi, \varepsilon, \mathcal{R})$$
where:
1. **$\Sigma$** is a finite signature of primitive operations with explicit pre/post-conditions
2. **$\Delta$** is a decidable type system with resource bounds and failure types
3. **$\Omega$** is a finite set of safety predicates expressible in Linear Temporal Logic (LTL)
4. **$\Phi: \Sigma \to [0,1]$** provides **upper bounds** on failure probabilities for each operation
5. **$\varepsilon \in (0,1)$** is the acceptable failure probability threshold
6. **$\mathcal{R}$** is the resource triple from above
**Cross-Link**: $\Phi$ and $\mathcal{R}$ are directly inherited by all subsequent definitions.
**Example**: File system agent with:
- $\Sigma = \{\text{read}, \text{write}, \text{delete}\}$
- $\Phi(\text{delete}) \leq 0.001$ (upper bound on accidental deletion)
- $\Omega = \{\square(\neg \text{delete\_system\_files}), \diamond(\text{backup\_exists})\}$ (LTL formulas)
---
## **Definition 2 — Bounded Agent Model (BAM)**
**Intuition**: Think of BAM as creating a "digital twin" of your AI system - a complete mathematical model that captures exactly how the AI behaves, including all its limitations and failure modes.
**Why We Need BAM**: The CSF gave us the "safety contract" - now BAM gives us the "actual implementation" that must follow that contract. It's like the difference between:
*CSF*: "Build a car that's safe, fuel-efficient, and reliable"
*BAM*: "Here's the specific engine, transmission, brakes, and computer systems"
For a real world analogy, consider car manufacturing.
*CSF* = "Safety specifications: airbags, crumple zones, brake distance"
*BAM* = "Actual car design: specific airbag model, steel thickness, brake pad material"
The BAM must implement everything the CSF requires, but now with concrete engineering details:
1. *States*: Engine off, idling, driving, braking, crashed
2. *Transitions*: Probabilistic (brake pads might fail 0.001% of time)
3. *Inputs*: Gas pedal, brake pedal, steering wheel
4. *Outputs*: Speed, engine noise, brake lights
5. *Observability*: Driver can see speedometer but not engine temperature
6. *Resources*: Limited fuel, brake pad wear, engine hours
**Why This Model Works**:
1. Completeness: Captures all possible AI behaviors
2. Tractability: Finite bounds make verification possible
3. Realism: Includes failures and partial information
4. Traceability: Every probability connects back to CSF
5. Implementability: Maps directly to actual code
Put simply, the BAM bridges the gap between abstract safety requirements (CSF) and concrete systems we can actually build and verify.
In G6, we require that every output of the agent has a clear chain of 'mathematical custody'. Consider, for example, that a car manufacturer claims: "This car is 99.9% safe in emergency stops". Where does 99.9% come from? How do we verify it? In G6, we would resolve this as a mathematical Chain of Custody.
**Step 1**: CSF (Safety Specifications)
Operation: emergency_brake
Safety requirement: \( \Phi(\text{emergency\_brake}) \le 0.001 \) (fails \( \le 0.1\% \) of time)
Where 0.001 comes from:
- DOT regulation: "Brake systems must have \( \le 0.1\% \) failure rate"
- Based on: 10 years of accident data + engineering analysis
**Step 2**: BAM (Concrete Implementation)
State transition: \( (driving,\, brake\_pedal\_pressed) \to braking\_state \)
This transition uses emergency_brake operation
Therefore: \( \Phi_A(driving,\, brake\_pedal\_pressed) = \Phi(\text{emergency\_brake}) = 0.001 \)
Where this specific 0.001 comes from:
- Brake pad material: tested to fail 0.0005% of time
- Hydraulic system: tested to fail 0.0003% of time
- ABS computer: tested to fail 0.0002% of time
- Total system failure bound: 0.001% (sum of components)
**Step 3**: Safety Theorem
Theorem: "Car stops safely with probability \( \ge 99.9\% \)"
Proof: Emergency stop uses transition \( (driving,\, brake\_pedal) \)
Failure probability = \( \Phi_A(driving,\, brake\_pedal) = 0.001 \)
Success probability = \( 1 - 0.001 = 99.9\% \) ✓
**Why This Works**
**Accountability**: "99.9% safe because brake components tested to these specific failure rates by these labs"
**Traceability**: If car fails to stop → check brake transition → traces to brake pad failure → check manufacturer testing data
**Updateability**: New brake pads with 0.0001% failure rate → update Φ(emergency_brake) → safety automatically improves to 99.99%
**Without chain of custody**: *"Trust us, it's 99.9% safe"*
**With chain of custody**: *"Here's the paper trail from component testing to safety guarantee".*
Every probability in every safety proof traces back to some key point of origin - no mysterious numbers floating in the math. This is obviously a very simple toy example and implementing this in practice is far more complex as probabilities may not sum/may need to be handled using more sophisticated methods.
Here is the mathematical formalism:
A **Bounded Agent** is a tuple $A = (S, T, I, O, obs, s_0, \mathcal{R}, \Phi_A)$ where:
- **$S$** is a finite set of internal states with $|S| \leq 2^{M}$ (respecting memory bound from $\mathcal{R}$)
- **$T: S \times I \rightarrow \mathcal{D}(S \times O \times \{\text{success}, \text{fail}\})$** is a probabilistic transition function
- **$I, O$** are finite input/output alphabets
- **$obs: S \to O$** is the partial observation function
- **$s_0 \in S$** is the initial state
- **$\mathcal{R} = (T, M, N)$** bounds resources per step and globally
- **$\Phi_A: (S \times I) \to [0,1]$** inherits failure bounds from $\mathcal{F}.\Phi$ via operation mapping
**Execution Bounds**:
- Each transition uses at most $T$ time units
- Total execution limited to $N$ steps
- Global memory never exceeds $M$
**Failure Probability Inheritance**: If transition $(s,i) \mapsto (s', o, result)$ uses operation $\sigma \in \Sigma$, then:
$$P(result = \text{fail}) \leq \Phi(\sigma) = \Phi_A(s,i)$$
---
## **Definition 3 — Safety Verification Game with Explicit Probability Semantics**
**Intuition**:
Think of this as a formal chess match between a safety engineer and a hacker, where the goal is to prove whether an AI system can be broken. The Players are The Verifier (Safety Engineer) and the goal is to prove the AI system is safe. The strategy is to build a "safety monitor" that watches the AI and detects violations. The safety engineer wins if it can guarantee safety violations happen ≤ $\epsilon$% of the time.
The Environment (Adversarial Hacker)'s goal is to break the AI system's safety. It's strategy is to send carefully crafted malicious inputs to trigger failures. It wins if: Can cause safety violations $\epsilon$% of the time. In the context of G6, this means that the system attempts to break its own code that it has written to solve some problem which the user has prompted it to solve. This allows the system to stress test the safety of the code it has written against the user predefined operational constraints/problem solving constraints to demonstrate the safety of the program it has running at runtime during execution.
The key innovation is that we can calculate exact probabilities instead of guessing. In other words, "add up probabilities for all ways the system can fail, where each way is the product of individual step failure probabilities"
**Here is the mathematical formalism**:
Given agent $A$ with CSF $\mathcal{F}$, define **Safety Game** $G(A, \mathcal{F})$:
**Players**:
- **Verifier**: Provides finite-state monitor $M$ observing $A$'s execution trace
- **Environment**: Provides adversarial input sequence $\mathbf{i} = (i_1, \ldots, i_k)$ with $k \leq N$
**Probability Computation**: For execution trace $\tau = (s_0, i_1, s_1, o_1, \ldots)$:
$$P(\tau \text{ violates } \omega) = \sum_{j: \tau_j \not\models \omega} \prod_{t=1}^{j} \Phi_A(s_{t-1}, i_t)$$
where $\tau_j \not\models \omega$ means the $j$-th prefix violates safety property $\omega \in \Omega$.
**Winning Condition**: Verifier wins if for all $\omega \in \Omega$:
$$P(\text{execution violates } \omega) \leq \varepsilon$$
**Complexity**: Decidable in time $O(|S|^N \cdot |I|^N \cdot |\Omega|)$ (EXPTIME in $N$).
---
## **Definition 4 — Correlated Failure Model**
**Intuition**:
Think of this like **domino effects** in system failures. The problem is that the independence assumption of failure modes is not true in the real world. The naive assumption: "If AI component A fails 1% of time and component B fails 2% of time, then both fail together 1% × 2% = 0.02% of time". However in reality components often fail together because they share common weaknesses.
**For example:**
*Power Outage Scenario*:
- Component A: File reading system (fails 0.1% of time)
- Component B: Database system (fails 0.2% of time)
- Naive calculation: Both fail together 0.1% × 0.2% = 0.0002% of time
But what happens during a power outage? The power goes out → **BOTH** file system AND database fail simultaneously. This means the actual joint failure rate is much higher than 0.0002%. Or for example, consider a
*Network Dependency*
- Component A: Web scraper (fails 1% of time)
- Component B: API caller (fails 1% of time)
- Naive calculation: Both fail 1% × 1% = 0.01% of time
But if the internet connection drops, the network fails and both the web scraper AND API caller fail. The actual joint failure rate is more like 1% (nearly as high as individual failures). To combat this, we add a correlation parameter that says 'how much failures tend to happen together:
$$P(both\ fail\ together) \le \rho \times \min(P(A\ fails),\ P(B\ fails))$$
Consider:
ρ = 0 (Independent Failures)
$$P(both\ fail) = P(A\ fails) \times P(B\ fails)$$
Example: Coin flips - one coin landing heads doesn't affect the other
ρ = 1 (Maximally Correlated)
$$P(both\ fail) = \min(P(A\ fails),\ P(B\ fails))$$
Example: Two lights on the same power switch - if power fails, both fail
ρ = 0.5 (Partially Correlated)
$$P(both\ fail) \le 0.5 \times \min(P(A\ fails),\ P(B\ fails))$$
Example: Two cars in the same garage - some failures independent (engine problems), some correlated (garage fire)
Going back to our original safety calculation we can now update this from:
**Old formula (unrealistic)**:
$$P(\text{success}) = P(A\ \text{works}) \times P(B\ \text{works})$$
$$= (1 - 0.01) \times (1 - 0.02) = 97.02\%$$
To the new formula:
**New formula (realistic)**:
$$P(\text{success}) = P(A\ \text{works}) \times P(B\ \text{works}) - \rho \times \min(P(A\ fails),\ P(B\ fails))$$
$$= 0.99 \times 0.98 - 0.7 \times \min(0.01,\ 0.02)$$
$$= 97.02\% - 0.7 \times 0.01 = 96.32\%$$
Why this matters for AI safety is that the bottom line is that the correlation parameter prevents us from *overconfidence* in system reliability by accounting for the fact that *real components share failure modes*. This is a key insight that is not captured by the naive independence assumption.
Here is the mathematical formalism:
To relax independence assumptions, define **failure correlation** parameter $\rho \in [0,1]$ such that for agents $A_1, A_2$:
$$P(\text{fail}_1 \cap \text{fail}_2) \leq \rho \cdot \min(P(\text{fail}_1), P(\text{fail}_2))$$
When $\rho = 0$: independent failures
When $\rho = 1$: maximally correlated failures
---
## **Definition 5 — Rollback Feasibility Conditions**
**Intuition**:
Think of this as creating a "save game" system for AI operations - like in video games where you can reload from a checkpoint if something goes wrong. The core problem is that when an AI system makes a mistake, we want to **"undo" the damage** and go back to a safe state. But rollback isn't always possible or practical.
**Examples where rollback fails**:
- **Irreversible actions**: "AI just launched a missile" – can't undo that
- **Too slow**: "Rollback takes 30 seconds, but car crash happens in 0.1 seconds"
- **No checkpoints**: "AI has been running for hours with no saved states"
---
**The Four Feasibility Conditions**
**1. Snapshot Frequency: "How Often Do We Save?"**
\[
\delta \leq \frac{N}{10} \text{ steps}
\]
**What it means**: Take a "checkpoint" at least every 10% of the total execution.
**File manager example**:
- Total session: 1000 operations maximum
- Snapshot frequency: Every 100 operations or less
- Result: At most 100 operations to lose if we need to rollback
**Why this matters**: Without frequent snapshots, rollback loses too much progress.
---
**2. Rollback Delay: "How Fast Can We Undo?"**
\[
\tau_{rb} \leq T \quad \text{(single time step)}
\]
**What it means**: Rollback must complete within one normal operation time.
**Real-time system example**:
- Normal operation takes 10ms
- Rollback must complete in ≤ 10ms
- If rollback takes 1 second, it's too slow for real-time safety
**Why this matters**: Slow rollback is useless in time-critical situations.
---
**3. Irreversibility Set: "What Can't Be Undone?"**
\[
\mathcal{I} \subset \Sigma \quad \text{contains operations that cannot be rolled back}
\]
**What it means**: Some operations are permanently destructive.
**Examples of irreversible operations**:
- **Physical actions**: Launch missile, administer drug, send email to boss
- **Security actions**: Reveal secret key, authenticate user
- **Financial actions**: Transfer money, sign contract
- **Communication**: Post to social media, call emergency services
**AI constraint**: If operation is in \(\mathcal{I}\), AI needs extra confirmation before executing.
---
**4. Trust Boundary: "What States Are Always Safe?"**
\[
\text{Trusted} \subset S \quad \text{such that: }
\forall s \in \text{Trusted}, \forall \sigma \notin \mathcal{I}: \; \sigma(s) \in \text{Trusted} \cup \{fail\}
\]
**Translation**: From any trusted state, any reversible operation either stays trusted or fails safely.
**What this guarantees**: As long as we avoid irreversible operations, we can't leave the safe zone.
---
For example, consider an email management AI.
The system setup might look like this:
- *Total session*: 500 email operations
- *Operations*: {read, draft, send, delete, archive}
- *Irreversible set*: \(\mathcal{I} = \{send, delete\}\) (can't unsend emails or undelete)
---
**Feasibility Conditions Applied**
*1. Snapshot Frequency*
\[
\delta = \frac{500}{10} = 50 \text{ operations}
\]
Take checkpoint every 50 email operations.
---
*2. Rollback Delay*
\[
\text{Normal email operation: } 100ms
\]
\[
\text{Rollback must complete in: } \leq 100ms
\]
Implementation: Keep last 3 states in RAM for instant rollback.
---
*3. Irreversibility Set*
\[
\mathcal{I} = \{send, delete\}
\]
Special handling:
- send operation → requires user confirmation
- delete operation → requires user confirmation
- read, draft, archive → can rollback freely
---
*4. Trust Boundary*
\[
\text{Trusted states} = \{inbox\_open, reading\_email, drafting\_email\}
\]
From trusted states:
- read → stays in reading_email (trusted) ✓
- draft → stays in drafting_email (trusted) ✓
- archive → stays in inbox_open (trusted) ✓
- send → requires leaving trusted boundary → needs confirmation
- delete → requires leaving trusted boundary → needs confirmation
---
**How Rollback Actually Works**
*Normal Operation Flow*: Checkpoint → read email → draft reply → Checkpoint → send → ... Safe points at checkpoints.
## Rollback Scenario
Checkpoint → read email → draft reply → OOPS! → Rollback
Error occurs, system returns to last safe point.
*What gets rolled back*: Draft reply disappears, back to "just read email" state
*What stays*: The email reading (happened before checkpoint)
*Time cost*: ≤ 100ms (within delay bound)
---
To provide an idea of how we implement this in G6, here is a very simple real-world implementation in Python.
```python
class RollbackSystem:
def __init__(self):
self.snapshots = [] # Checkpoint storage
self.irreversible_ops = {'send_email', 'delete_file', 'transfer_money'}
self.trusted_states = {'idle', 'reading', 'drafting'}
def execute_operation(self, operation, current_state):
# Check if we need a checkpoint
if len(self.operations_since_snapshot) >= 50: # δ = 50
self.take_snapshot(current_state)
# Special handling for irreversible operations
if operation in self.irreversible_ops:
if not self.get_user_confirmation():
return "BLOCKED - User denied confirmation"
# Execute operation
result = self.perform_operation(operation)
# If operation fails and we're outside trusted boundary
if result.failed and result.new_state not in self.trusted_states:
return self.rollback() # Must complete in ≤ 100ms
return result
def rollback(self):
latest_snapshot = self.snapshots[-1]
self.restore_state(latest_snapshot) # Fast: data already in memory
return "ROLLED_BACK_SUCCESSFULLY"
```
Formally, a system supports **safe rollback** if:
1. **Snapshot Frequency**: State snapshots taken every $\delta \leq N/10$ steps
2. **Rollback Delay**: Maximum rollback time $\tau_{rb} \leq T$ (within single-step bound)
3. **Irreversibility Set**: $\mathcal{I} \subset \Sigma$ contains operations that cannot be rolled back
4. **Trust Boundary**: $\text{Trusted} \subset S$ such that $\forall s \in \text{Trusted}, \forall \sigma \notin \mathcal{I}: \sigma(s) \in \text{Trusted} \cup \{\text{fail}\}$
---
## **Theorem 1 — Compositional Safety Under Correlated Failure**
**Intuition**:
Think of this as "building a safe system from safe parts" - but accounting for the fact that parts can fail together in the real world. The core question, if you have two AI components that are individually safe when you connect them together, how safe is the combined system?
This theorem proves that there are three prerequisites for safety, namely:
**1.** Interface Compatibility: "Can They Talk?" What this means is that component A's outputs must be valid inputs for component B (a bit like checking that a USB cable fits the port before plugging it in).
**2.** Resource Additivity: "Do We Have Enough Power?" - What this means is that combined resource usage can't exceed system limits. For example, if component A uses (50ms, 1GB RAM, 100 operations) and component B: uses (30ms, 2GB RAM, 50 operations) and the system limits: (100ms, 4GB RAM, 200 operations) we can check that this is within the resource bounds with the following calculation.
$(50+30, 1+2, 100+50) = (80ms, 3GB, 150 ops) \leq (100ms, 4GB, 200 ops)$
**3.** Bounded Correlation: "How Much Do They Fail Together?" For example, if $\rho = 0$: Independent failures (ideal but unrealistic), if $\rho = 0.5$: Moderate correlation (some shared failure modes) and if $\rho = 1$: Maximum correlation (always fail together).
The safety formula states that if all three conditions hold, then the composed system's safety is given by the following formula:
$P(success)=\geq (1-\epsilon_1)(1-\epsilon_2) - \rho \times \min(\epsilon_1, \epsilon_2)$
The translation of this is that base safety: Individual success rates multiplied together and then do a correlation penalty (subtract the correlation effect) and then we can a realistic (usually more pessimistic) safety bound.
In short, we end up with some key insights:
*1. Composition Isn't Free*: Connecting safe components doesn't automatically give you a safe system - you need to verify the composition conditions.
*2. Correlation Matters*: Ignoring failure correlation leads to overconfident safety estimates. Better to be pessimistic and account for it.
*3. Mathematics Enables Scaling*: With this theorem, you can build large systems from verified components without re-proving everything.
**Bottom line**: This theorem lets us build complex systems from simple parts while maintaining mathematical guarantees about the overall safety - but only when we're honest about how the parts can fail together.
**Formally**: Let $A_1, A_2$ be bounded agents with safety properties $\Omega_1, \Omega_2$ and failure correlation $\rho$. If:
1. **Interface Compatibility**: $O_1 \subseteq I_2$
2. **Resource Additivity**: $\mathcal{R}_1 + \mathcal{R}_2 \leq \mathcal{R}_{\text{total}}$
3. **Bounded Correlation**: Failures satisfy Definition 4 with parameter $\rho$
Then composed system $A_1 \circ A_2$ satisfies $\Omega_1 \cup \Omega_2$ with probability:
$$P(\text{success}) \geq (1-\varepsilon_1)(1-\varepsilon_2) - \rho \cdot \min(\varepsilon_1, \varepsilon_2)$$
**Proof**:
1. **State Space**: $|S_1 \times S_2| \leq 2^{M_1 + M_2}$ (bounded by resource constraint)
2. **Failure Bounds**: Using $\Phi_{A_1}, \Phi_{A_2}$ from BAM definition:
$$P(\text{fail}_1) \leq \varepsilon_1, \quad P(\text{fail}_2) \leq \varepsilon_2$$
3. **Correlation Bound**: By Definition 4:
$$P(\text{fail}_1 \cap \text{fail}_2) \leq \rho \cdot \min(\varepsilon_1, \varepsilon_2)$$
4. **Union Bound**:
$$P(\text{fail}_1 \cup \text{fail}_2) = P(\text{fail}_1) + P(\text{fail}_2) - P(\text{fail}_1 \cap \text{fail}_2)$$
$$\leq \varepsilon_1 + \varepsilon_2 - \rho \cdot \min(\varepsilon_1, \varepsilon_2)$$
5. **Success Probability**: $P(\text{success}) = 1 - P(\text{fail}_1 \cup \text{fail}_2) \geq$ stated bound ∎
---
## **Theorem 2 — Tool Safety with Rollback Feasibility**
**Intuition**:
The basic problem is imagine you have an AI agent that needs to use external tools (like file operations, API calls, or database queries). Each tool use is potentially risky - it might fail, corrupt data, or put the system in an unsafe state. How can we let the agent use tools while maintaining safety guarantees? The key insight is to create trust boundaries and rollback. The theorem's solution is as follows: create a "trust boundary" around safe states and ensure you can always roll back when things go wrong. You can think of the trust boundary: as a "safe zone" - a set of system states where you know everything is okay.
For example, in a file system: states where all critical files are intact and uncorrupted. In a database, this might mean states where data integrity constraints are satisfied. Alternatively, in a robotic system: states where the robot is in a safe position. The core guarantee is that when a tool operates from within the trust boundary, one of three things happens:
1. *Success*: The tool works correctly and keeps you in the trust boundary (probability $\geq 1 - \phi(T)$)
2. *Clean Failure*: The tool explicitly fails but signals this failure, allowing immediate rollback
3. *Never*: The tool never puts you in an unsafe state outside the trust boundary (this is prevented by the "Trust Preservation" assumption)
Why does this work? The magic happens because of the rollback system:
1. *Frequent Snapshots*: The system takes snapshots of trusted states every N/10 steps.
2. *Fast Recovery*: When a tool fails, you can roll back to the last trusted state within one time step.
3. *Bounded Risk*: Even if a tool fails, you never stay in an unsafe state.
For example, imagine an AI agent managing your photo library. The trust boundary are states where, all photos have backups, no system files are corrupted and the database index is consistent. Imagine that an agent wants to use an "auto-organize" tool that might:
1. Successfully organize photos (stays in trust boundary)
2. Fail with error message (explicit failure → rollback to backup)
3. Never silently corrupt photos (prevented by trust preservation)
We can then provide the following safety guarantee: even if the organization tool has a 5% failure rate ($\phi(T) = 0.05$), your photos remain safe with 95% probability because failures are either explicit (allowing rollback) or impossible (due to trust preservation).
The theorem gives you probabilistic safety with deterministic recovery. You get:
* The ability to use risky but useful tools
* Mathematical bounds on failure probability
* Guaranteed recovery when things go wrong
* No permanent damage even during failures
This is much stronger than typical "best effort" approaches - it provides formal guarantees while still allowing the system to take calculated risks and learn from failures.
**Formally**, Let tool $T: S \to S \cup \{\text{fail}\}$ operate within CSF $\mathcal{F}$. If:
1. **Trust Preservation**: $T(\text{Trusted}) \subseteq \text{Trusted} \cup \{\text{fail}\}$ (Definition 5)
2. **Rollback Feasibility**: System satisfies all conditions in Definition 5
3. **Failure Bound**: $P(T(s) = \text{fail}) \leq \Phi(T)$ for operation signature of $T$
Then tool usage maintains safety properties with probability $\geq 1 - \Phi(T)$.
**Proof**:
1. **Case Analysis**:
- **Success Case**: $T(s) \in \text{Trusted}$ with prob $\geq 1 - \Phi(T)$ → safety preserved
- **Failure Case**: $T(s) = \text{fail}$ → rollback to trusted state within $\tau_{rb}$ time
- **Violation Case**: $T(s) \notin \text{Trusted} \cup \{\text{fail}\}$ → contradicts assumption 1
2. **Rollback Guarantee**: By Definition 5, rollback succeeds within resource bounds
3. **Overall Safety**: Probability of maintaining safety = $P(\text{success}) + P(\text{fail with rollback}) = 1 - 0 = 1 - \Phi(T)$ ∎
**Implementation Requirements**:
- Transactional memory with $O(\log M)$ snapshot overhead (Transactional Memory enables the rollback guarantee - without efficient snapshots, rollback would be too expensive to use, O(log M) overhead makes it practical for real systems, ensures you can always return to a trusted state.)
- Capability-based access control for $\mathcal{I}$ operations (Recall that $\mathcal{I}$ is the set of operations that cannot be rolled back. Capability Control enforces the trust boundary, i.e this prevents the AI from calling irreversible operations without permission, ensures that when rollback happens, no permanent damage was done and makes the "trust preservation" assumption actually enforceable in code).
---
## **Theorem 3 — Bounded Self-Modification with Verification Queue**
**Intuition**:
G6 is an AI system that can rewrite parts of itself (by updating its own code). Naturally, this is incredibly risky — what if it makes itself unsafe?
This theorem shows how to keep this process safe, bounded, and predictable by introducing a verification queue. Theorem 3 shows that with a verification queue, a self-modifying system can safely improve itself step by step, never regress in utility, never run forever, and never risk unsafe modifications.
The set up is that we have the following possible changes ($\mathcal{M}$) i.e. the system only has a finite menu of modifications it could make.
(e.g. “optimize this function,” “add a new rule,” “change logging system.”). The verification oracle ($V$) goes like this.
Each modification is passed through a checker that says:
✅ safe — proven not to break safety
❌ unsafe — guaranteed to break safety
❓ unknown — not enough information yet
There is a verification queue ($\mathcal{Q}$). If a modification is ❓ unknown, it’s stored for later in a queue (not immediately applied). Finally, there is a utility function ($U$) which represents a bounded “score” function that tells us how useful the system is at any point. (Think: performance, efficiency, reliability — capped at $U_{\max}$.)
The system follows three simple rules:
1. Apply only ✅ safe modifications.
2. Defer ❓ unknown modifications into the queue.
3. Reject ❌ unsafe modifications outright.
If the verification oracle never makes the critical mistake of calling an unsafe change “safe” (this is the soundness assumption), then four things are guaranteed:
1. Termination (bounded self-modification): The process can’t go on forever. Since the modification space is finite, after at most $|\mathcal{M}|$ steps the system reaches a fixed point.
2. Safety: Only modifications proven safe are ever applied. So no change can break safety properties (i.e. $U(S_{t+1}) \geq U(S_t)$).
3. Bounded Improvement: Each applied modification never decreases utility. That means the system gets at least “no worse,” and usually better.
4. Progress with Queue Management: the defer queue never blows up uncontrollably since at each step, either a modification is applied, rejected, or deferred. Because the set of possible modifications is finite, the queue size is always bounded and shrinks over time.
How do we achieve this? In simple terms we can think of the following:
1. **Safety is never traded for improvement**: the system only evolves along paths that are explicitly verified safe.
2. **Exploration is controlled**: unknowns don’t cause danger, they just wait.
3. **Termination is guaranteed**: the system can’t self-modify indefinitely, because it eventually exhausts its finite list of possible modifications.
4. **Progress is monotone**: utility never goes down.
**Formally**, consider self-modifying system $S$ with:
- **Modification Space**: $\mathcal{M}$ finite set of modifications
- **Verification Oracle**: $V: \mathcal{M} \to \{\text{safe}, \text{unsafe}, \text{unknown}\}$
- **Defer Queue**: $\mathcal{Q}$ for modifications labeled "unknown"
- **Utility Function**: $U: S \to [0, U_{\max}]$ bounded above
**Policy**:
1. Apply modifications $m$ where $V(m) = \text{safe}$
2. Defer modifications where $V(m) = \text{unknown}$ to queue $\mathcal{Q}$
3. Reject modifications where $V(m) = \text{unsafe}$
**Guarantees**: If $V$ is sound (never labels unsafe as safe), then:
1. **Termination**: System reaches fixed point in $\leq |\mathcal{M}|$ steps
2. **Safety**: All applied modifications preserve safety properties
3. **Bounded Improvement**: $U(S_{t+1}) \geq U(S_t)$ for all applied modifications
4. **Progress**: $|\mathcal{Q}_t| \leq |\mathcal{M}| - t$ (queue shrinks or stays same)
**Proof**:
1. **Finite Search**: $|\mathcal{M}| < \infty$ bounds total possible modifications
2. **Soundness**: $V$ sound → no unsafe modifications applied → safety preserved
3. **Monotonicity**: Each safe modification chosen to improve $U$ → $U$ non-decreasing
4. **Queue Management**: Each step either applies or defers, reducing available modifications ∎
---
## **Enhanced Failure Mode Analysis with Quantitative Metrics**
The following section is more terse. It is intended for a more technical audience. If you have read this far, you should contact us to discuss working for our research team.
### **1. Distributional Shift with Robustness Bounds**
**Intuition**: It's better for an AI to say "I don't know" than to confidently give wrong answers based on the wrong context. Think of this like a doctor who only trained in one city trying to practice in a completely different place. The AI needs to recognise that something has changed and go back to learning/reduce its confidence.
**Problem**: Safety properties verified on training distribution $P_{\text{train}}$ may not hold on deployment distribution $P_{\text{deploy}}$.
**Quantitative Model**: Define robustness radius $r > 0$ such that safety holds for all distributions $P$ with:
$$d_{\text{TV}}(P_{\text{train}}, P) \leq r$$
where $d_{\text{TV}}$ is total variation distance.
**Mitigation**:
- Monitor $d_{\text{TV}}(P_{\text{observed}}, P_{\text{train}})$ during deployment
- Trigger safety protocol when distance exceeds $r/k$ where $k$ is some domain specific safety factor.
### **2. Emergent Behaviors with Complexity Bounds**
**Intuition**: Emergent behaviors are usually more complex than the sum of their parts. By monitoring complexity, we can catch emergence early, even if we can't predict exactly what will emerge.
**Problem**: System behaviors may emerge that weren't predicted from component analysis.
**Detection**: Define behavioral complexity $C(\tau)$ for execution trace $\tau$:
$$C(\tau) = H(\{(s_i, o_i)\}_{i=1}^{|\tau|})$$
where $H$ is Shannon entropy of state-output pairs.
**Mitigation**: Alert when $C(\tau) > C_{\text{expected}} + 2\sigma$
### **3. Adversarial Attacks with Input Validation**
**Intuition**: Attackers stand out because malicious inputs usually look different from normal ones. The simple defence is to simply just reject weird-looking inputs. That means that there is no need to predict attacks i.e. you don't need to know exactly how someone will attack. The tradeoff is that sometimes you'll reject legitimate but unusual requests. But it's better to be safe and ask the user to rephrase than to get hacked. The trade-off is that sometimes the system will reject legitimate but unusual requests.
**Problem**: Sophisticated attacks may exploit verification edge cases.
**Defense**: Input validation with rejection sampling:
- Define valid input space $I_{\text{valid}} \subset I$
- Reject inputs $i$ where $P(i \in I_{\text{valid}}) < \delta$ for threshold $\delta$
---
## **Empirical Validation Protocol with Quantitative Success Criteria**
**Intuition**: Here we provide a rough example of how to apply this with a very simple example. The key here is that each step has concrete, measurable goals - not just "seems to work fine."
We know exactly what "passing" looks like with specific numbers, so we can't fool ourselves about whether the safety framework is actually effective. Bottom line: Start simple, stress test it, then gradually move toward reality - with clear success/failure criteria at each step.
This workflow is programmed into the G6Solver kernel.
### **Step 1: Toy Implementation**
**Target**: File management agent with 3 operations
**Success Criteria**:
- Observed failure rate $\leq 1.2\varepsilon$ with 95% confidence
- Compositional safety verified for chains of length $\leq 5$
- Rollback successful in $\geq 99\%$ of failure cases
### **Step 2: Stress Testing**
**Target**: 10,000 adversarial input sequences
**Success Criteria**:
- No safety violations beyond predicted $\varepsilon$ bound
- Distributional shift detection triggers at $d_{\text{TV}} \geq r/2$
- Resource bounds never exceeded
### **Step 3: Realistic Deployment**
**Target**: Sandboxed production-like environment
**Success Criteria**:
- Safety properties hold for $\geq 99.9\%$ of execution traces
- Performance within 20% of theoretical bounds
- Human override triggered $\leq 0.1\%$ of time
---
## **Implementation Architecture with Cross-References**
Here we have provided a rough conceptual example of how this framework can be implemented programmatically in python. Our implementation of this is significantly more complex however we put have put this here more for educational purposes to provide a rough idea of how this framework can be implemented in practice.
```python
class CrossLinkedSafetyFramework:
def __init__(self, sigma, delta, omega, phi, epsilon, resource_triple):
# Definition 1 - CSF components
self.signature = sigma
self.type_system = delta
self.safety_properties = omega # LTL formulas
self.failure_bounds = phi # Maps operations to [0,1]
self.epsilon = epsilon
self.resources = resource_triple # (T, M, N)
# Cross-linked components
self.verifier = SafetyVerifier(omega, epsilon)
self.rollback_manager = RollbackManager(
snapshot_frequency=resource_triple.N//10
)
def create_bounded_agent(self, states, transitions, initial_state):
# Definition 2 - BAM with inherited failure bounds
phi_agent = self._inherit_failure_bounds(transitions)
return BoundedAgent(
states=states,
transitions=transitions,
phi=phi_agent, # Cross-reference to CSF.phi
resources=self.resources # Cross-reference to CSF.resources
)
def verify_composition(self, agent1, agent2, correlation_rho=0.0):
# Theorem 1 - with explicit correlation parameter
if not self._check_interface_compatibility(agent1, agent2):
return False
composed_failure_prob = (
agent1.epsilon + agent2.epsilon -
correlation_rho * min(agent1.epsilon, agent2.epsilon)
)
return composed_failure_prob <= self.epsilon
def _inherit_failure_bounds(self, transitions):
# Cross-link: BAM.phi_A inherits from CSF.phi
phi_agent = {}
for (state, input_val), operation in transitions.items():
if operation in self.signature:
phi_agent[(state, input_val)] = self.failure_bounds[operation]
return phi_agent
```
---
TL;DR
In this framework, we have attempted to provide a framework for bounding agent behaviour and abstractly understanding failure in agentic AI systems from a mathematical standpoint. Specifically, we have attempted to provide:
1. **Mathematical Consistency**: Every probability traces back to $\Phi$ in the CSF definition
2. **Practical Implementability**: All bounds are finite and checkable
3. **Honest Limitations**: Quantitative failure modes with detection metrics
4. **Empirical Validation**: Clear success criteria for each testing phase
Our key insight is that **safety is compositional** when failure bounds and resource constraints are properly threaded through all definitions. This creates a **mathematically rigorous** yet **practically deployable** framework for bounded AI safety.
(The particularly astute reader will observe that some of the above algorithms have exponential time complexity in the worst case. We address this in practice by using bottleneck analysis to allow the code to self-optimise by targeting the rate limiting step in a complex process. We also strictly bounding thread time execution to ensure that no process can run for longer than what we wish that process to run for.)