# How to Build a Formal Verification AI Platform (Cajal Clone)

A step-by-step technical guide for building a system that discovers and formally verifies mathematical proofs using multi-agent AI. Each step is scoped for a developer working with Claude Code and modern tooling.

## Step 1: Set Up the Proof Assistant Environment

**Goal:** Get Lean 4 and mathlib running, expose them programmatically, and establish your baseline proof-checking infrastructure.

Install Lean 4 via `elan` (the Lean version manager):

“`bash
curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh
lake new proof_env
cd proof_env && lake add mathlib
“`

Mathlib is the Lean community’s massive mathematics library — over 150,000 theorems. This is your ground truth corpus and your starting vocabulary. You’ll need it.

Build a thin Python wrapper around the Lean REPL (Read-Eval-Print Loop) using the `lean4-repl` project or by spawning Lean processes directly:

“`python
# lean_env.py
import subprocess, json

class LeanEnvironment:
def __init__(self):
self.proc = subprocess.Popen(
[“lake”, “env”, “lean”, “–server”],
stdin=subprocess.PIPE, stdout=subprocess.PIPE
)

def check_proof(self, tactic_block: str) -> dict:
payload = json.dumps({“cmd”: tactic_block, “env”: 0})
self.proc.stdin.write((payload + “\n”).encode())
self.proc.stdin.flush()
return json.loads(self.proc.stdout.readline())
“`

**Key metric:** Proof check latency should be under 50ms for simple tactics. Optimize this aggressively — it’s your inner loop for everything that follows.

Database schema for tracking proof states:

“`sql
CREATE TABLE proof_attempts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
theorem_id UUID NOT NULL REFERENCES theorems(id),
tactic_sequence JSONB NOT NULL,
lean_output TEXT,
verified BOOLEAN DEFAULT FALSE,
error_msg TEXT,
check_latency_ms INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE theorems (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
statement TEXT NOT NULL,
lean_statement TEXT NOT NULL,
domain TEXT, — ‘algebra’, ‘topology’, ‘number_theory’, etc.
difficulty_estimate FLOAT,
source TEXT,
verified_proof TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
“`

## Step 2: Build the Proof Search Engine

**Goal:** Implement Monte Carlo Tree Search (MCTS) over the tactic space, using Lean as the state evaluator.

Proof search is a tree problem. Each node is a proof state (a set of goals remaining), each edge is a tactic applied, and success is a leaf with zero remaining goals. MCTS is a strong baseline because it balances exploration (trying novel tactics) with exploitation (following paths that have worked before).

“`python
# mcts.py
import math
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ProofNode:
state: str # Lean tactic state as string
tactic: Optional[str] = None
parent: Optional[‘ProofNode’] = None
children: list = field(default_factory=list)
visits: int = 0
value: float = 0.0
is_terminal: bool = False
is_proved: bool = False

def ucb_score(self, exploration_weight=1.4) -> float:
if self.visits == 0:
return float(‘inf’)
exploitation = self.value / self.visits
exploration = exploration_weight * math.sqrt(
math.log(self.parent.visits) / self.visits
)
return exploitation + exploration

class MCTSProofSearch:
def __init__(self, lean_env, llm_client, num_simulations=500):
self.lean = lean_env
self.llm = llm_client
self.num_simulations = num_simulations

def search(self, theorem: str) -> Optional[list[str]]:
root = ProofNode(state=theorem)
for _ in range(self.num_simulations):
node = self._select(root)
result = self._expand_and_simulate(node)
self._backpropagate(node, result)
if result[‘proved’]:
return self._extract_proof_path(node)
return None

def _select(self, node: ProofNode) -> ProofNode:
while node.children and not node.is_terminal:
node = max(node.children, key=lambda n: n.ucb_score())
return node

def _expand_and_simulate(self, node: ProofNode) -> dict:
# Ask LLM for candidate tactics given current proof state
tactics = self.llm.suggest_tactics(node.state, n=8)
for tactic in tactics:
result = self.lean.apply_tactic(node.state, tactic)
child = ProofNode(
state=result[‘new_state’],
tactic=tactic,
parent=node,
is_terminal=result[‘is_terminal’],
is_proved=result[‘is_proved’]
)
node.children.append(child)
return {‘proved’: any(c.is_proved for c in node.children)}
“`

Add beam search as a complementary strategy for simpler theorems where MCTS overhead isn’t worth it. Switch between strategies based on estimated theorem difficulty.

## Step 3: Train a Proof-Generation Model

**Goal:** Fine-tune a language model specifically on formal proof corpora so it generates valid Lean tactics rather than plausible-looking nonsense.

Start with a strong base model (Qwen2.5-Math or DeepSeek-Prover are solid open-source options). Fine-tune on Lean 4 proof data using next-token prediction on tactic sequences.

Data format for fine-tuning:

“`jsonl
{“messages”: [
{“role”: “system”, “content”: “You are a Lean 4 proof assistant. Given a theorem statement and current proof state, suggest the next tactic.”},
{“role”: “user”, “content”: “Theorem: ∀ n : â„•, n + 0 = n\nCurrent state: ⊢ ∀ n : â„•, n + 0 = n”},
{“role”: “assistant”, “content”: “intro n\nsimp [Nat.add_zero]”}
]}
“`

After supervised fine-tuning, apply GRPO (Group Relative Policy Optimization) with the Lean kernel as the reward function:

“`python
def compute_reward(proof_attempt: list[str], theorem: str, lean_env) -> float:
result = lean_env.check_full_proof(theorem, proof_attempt)
if result[‘verified’]:
return 1.0
# Partial credit for making progress (fewer goals remaining)
progress = result.get(‘goals_closed’, 0) / result.get(‘total_goals’, 1)
return progress * 0.3
“`

The reward signal is clean and binary at the top level — the proof either checks out or it doesn’t. This is what makes formal verification uniquely powerful for RL: no learned reward model, no reward hacking, no ambiguity.

## Step 4: Build the Tau Multi-Agent Orchestration System

**Goal:** Coordinate multiple specialized agents to collaborate on proof discovery.

Different agents handle different parts of the search. Implement a supervisor that routes tasks and aggregates results:

“`python
# orchestrator.py
from enum import Enum
from typing import Protocol

class AgentRole(Enum):
STRATEGIST = “strategist” # High-level proof plan
TACTICIAN = “tactician” # Low-level tactic generation
CRITIC = “critic” # Evaluates partial proofs
SPECIALIST = “specialist” # Domain expert (algebra, analysis, etc.)
VERIFIER = “verifier” # Calls Lean kernel

class ProofAgent(Protocol):
role: AgentRole
async def act(self, state: dict) -> dict: …

class TauOrchestrator:
def __init__(self, agents: list[ProofAgent], lean_env, max_rounds=50):
self.agents = {a.role: a for a in agents}
self.lean = lean_env
self.max_rounds = max_rounds

async def prove(self, theorem: str) -> dict:
state = {
“theorem”: theorem,
“proof_steps”: [],
“current_goals”: [theorem],
“failed_tactics”: [],
“round”: 0
}

while state[“round”] < self.max_rounds and state[“current_goals”]:
# Strategist sets the plan
strategy = await self.agents[AgentRole.STRATEGIST].act(state)
state[“strategy”] = strategy[“plan”]

# Tactician generates concrete steps
tactics = await self.agents[AgentRole.TACTICIAN].act(state)

# Critic filters bad moves before wasting Lean calls
filtered = await self.agents[AgentRole.CRITIC].act({
**state, “proposed_tactics”: tactics[“tactics”]
})

# Apply surviving tactics, verify with Lean
for tactic in filtered[“approved_tactics”]:
result = self.lean.apply_tactic(
state[“current_goals”][0], tactic
)
if result[“success”]:
state[“proof_steps”].append(tactic)
state[“current_goals”] = result[“remaining_goals”]
break
else:
state[“failed_tactics”].append(tactic)

state[“round”] += 1

verified = self.lean.check_full_proof(theorem, state[“proof_steps”])
return {“proof”: state[“proof_steps”], “verified”: verified[“success”]}
“`

Use a message queue (Redis Streams or RabbitMQ) for agent coordination in production. Each agent is a separate service; the orchestrator is the control plane.

## Step 5: Build the Dataset Pipeline

**Goal:** Curate, formalize, and verify mathematical corpora at scale for sale to AI labs.

This is your primary business asset. Build it like it matters — because it does.

Pipeline stages:

1. **Ingest** — Scrape arXiv math papers, ProofWiki, existing Lean/Coq/Isabelle libraries. Parse LaTeX with `latexml` or `plasTeX`.
2. **Formalize** — Use your proof-generation model to translate informal math into Lean 4 statements.
3. **Verify** — Every statement gets checked by the Lean kernel. Failed verifications go to a human review queue or back to the model.
4. **Grade** — Assign difficulty scores, domain tags, and proof complexity metrics.
5. **Deduplicate** — Embedding-based dedup to remove near-identical theorems.

“`sql
— Dataset versioning schema
CREATE TABLE dataset_versions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
version_tag TEXT UNIQUE NOT NULL, — ‘v1.2.0’
proof_assistant TEXT NOT NULL, — ‘lean4’, ‘coq’, ‘isabelle’
theorem_count INTEGER,
verified_count INTEGER,
domain_breakdown JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
s3_path TEXT NOT NULL
);

CREATE TABLE theorem_provenance (
theorem_id UUID REFERENCES theorems(id),
dataset_version_id UUID REFERENCES dataset_versions(id),
source_url TEXT,
formalization_model TEXT,
human_reviewed BOOLEAN DEFAULT FALSE,
PRIMARY KEY (theorem_id, dataset_version_id)
);
“`

Export in multiple formats: raw Lean files, JSONL for training, Parquet for analytics. Automate nightly builds and publish checksums.

## Step 6: Build the API Layer

**Goal:** Expose your RL environments, datasets, and evaluation endpoints to paying customers.

Three distinct API surfaces, each with different latency and throughput requirements.

**RL Environment API** (latency-critical, sub-millisecond target):

“`python
# FastAPI with async Lean pool
@app.post(“/v1/env/step”)
async def env_step(request: StepRequest, api_key: APIKey = Depends(verify_key)):
env = await lean_pool.acquire(request.env_id)
result = await env.apply_tactic_async(request.tactic)
return {
“observation”: result.new_state,
“reward”: 1.0 if result.proved else 0.0,
“done”: result.is_terminal,
“info”: {“goals_remaining”: result.goal_count}
}

@app.post(“/v1/env/reset”)
async def env_reset(request: ResetRequest, api_key: APIKey = Depends(verify_key)):
env_id = await lean_pool.spawn(request.theorem)
return {“env_id”: env_id, “observation”: request.theorem}
“`

Maintain a warm pool of pre-initialized Lean processes. Cold-starting Lean is slow (200–500ms); warm instances check tactics in under 5ms.

**Dataset API** (throughput-optimized):

“`python
@app.get(“/v1/datasets/{version}/theorems”)
async def get_theorems(
version: str,
domain: Optional[str] = None,
min_difficulty: float = 0.0,
limit: int = 1000,
offset: int = 0,
api_key: APIKey = Depends(verify_key)
):
# Stream from S3 or serve from read replica

“`

**Eval API:**

“`python
@app.post(“/v1/eval/run”)
async def run_evaluation(request: EvalRequest, api_key: APIKey = Depends(verify_key)):
job_id = await eval_queue.enqueue({
“model_endpoint”: request.model_endpoint,
“benchmark_id”: request.benchmark_id,
“pass_at_k”: request.k,
“timeout_per_problem”: request.timeout_s
})
return {“job_id”: job_id, “status”: “queued”}
“`

## Step 7: Deploy and Productize

**Goal:** Ship to production, onboard customers, and build the billing/usage infrastructure.

**Infrastructure:**

– API layer: Kubernetes on GKE or EKS, autoscaled on request latency
– Lean pool: Stateful pods, pre-warmed, drained gracefully before termination
– Database: Postgres (RDS or Cloud SQL) with read replicas for dataset queries
– Queue: Redis for RL environment session state, RabbitMQ for eval jobs
– Storage: S3 for dataset artifacts, versioned with lifecycle policies

**Billing schema:**

“`sql
CREATE TABLE usage_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID REFERENCES organizations(id),
event_type TEXT NOT NULL, — ‘env_step’, ‘dataset_download’, ‘eval_run’
quantity INTEGER DEFAULT 1,
metadata JSONB,
billed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE subscriptions (
org_id UUID REFERENCES organizations(id) PRIMARY KEY,
plan TEXT NOT NULL, — ‘research’, ‘enterprise’, ‘lab’
env_steps_quota BIGINT, — monthly RL environment steps
dataset_gb_quota INTEGER,
eval_runs_quota INTEGER,
overage_rate_usd NUMERIC(10,4),
stripe_subscription_id TEXT
);
“`

**Customer onboarding checklist:**
– Provision org + API key via internal admin panel
– Send Lean environment quickstart (Python SDK + example RL training loop)
– Slack connect for enterprise customers
– Weekly usage report email

**Monitoring:** Track `env_step_p99_latency`, `proof_verification_error_rate`, `dataset_download_throughput`. Page on p99 > 10ms for RL endpoints — your customers are training on this in real time.

The hardest part of this build is not the code. It’s accumulating enough verified theorems that your dataset is worth paying for, and getting your RL environment trusted enough that a lab plugs it into a live training run. Both of those are slow, trust-based processes. Start building both on day one.