Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Verification and Validation Loops for Agent Reliability Through Runtime Checks

14 Nov 2025

When an AI agent writes code, queries a database, or makes a critical decision, how do you know it got it right? Unlike traditional software where logic is deterministic, AI agents introduce probabilistic behavior that can fail in subtle ways. Verification and Validation (V&V) loops are the safety net that catches these failures before they cause harm.

Concept Introduction

V&V loops are architectural patterns where agents:

  1. Generate an output (code, text, decision, action)
  2. Verify syntactic correctness (does it parse? follow format rules?)
  3. Validate semantic correctness (does it solve the right problem? meet constraints?)
  4. Correct or retry if checks fail

This creates a feedback loop where the agent acts as its own quality assurance system, improving reliability in production environments.

Historical & Theoretical Context

The concept comes from software engineering’s V-model (1980s), where each development phase has a corresponding testing phase. For AI agents, this pattern gained prominence around 2020-2023 with:

V&V loops embody the “trust but verify” principle from control theory and systems engineering. They recognize that:

Algorithms & Math

Basic V&V Loop Algorithm

def vv_loop(task, max_retries=3):
    """
    Execute task with verification and validation
    """
    for attempt in range(max_retries):
        # Generate
        output = agent.generate(task)

        # Verify (syntactic checks)
        if not verify(output):
            feedback = get_verification_errors(output)
            task = enhance_with_feedback(task, feedback)
            continue

        # Validate (semantic checks)
        if not validate(output, task.constraints):
            feedback = get_validation_errors(output, task)
            task = enhance_with_feedback(task, feedback)
            continue

        # Success
        return output

    # All retries exhausted
    raise ValidationError("Failed to produce valid output")

Probabilistic Formulation

Let $G$ be the generation function, $V$ the verification function, and $C$ the correction function. The probability of success after $n$ attempts is:

$$P(\text{success})_n = 1 - \prod_{i=1}^{n} (1 - P(V(G(C^{i-1}(prompt)))))$$

Where $C^{i-1}$ represents $i-1$ correction iterations. Each retry with feedback typically increases success probability.

Multi-Validator Ensemble

For critical applications, use multiple validators:

def ensemble_validation(output, validators):
    """
    Require majority agreement among validators
    """
    votes = [v(output) for v in validators]
    confidence = sum(votes) / len(votes)

    return {
        'valid': confidence >= 0.5,
        'confidence': confidence,
        'disagreements': [i for i, v in enumerate(votes) if not v]
    }

Design Patterns & Architectures

Generate-Verify-Correct (GVC)

graph LR
    A[Task] --> B[Generate]
    B --> C{Verify}
    C -->|Pass| D[Execute]
    C -->|Fail| E[Analyze Error]
    E --> F[Generate with Feedback]
    F --> C
  

Use when: Output has clear correctness criteria (code, structured data)

Dual-Agent Validator

graph LR
    A[Task] --> B[Generator Agent]
    B --> C[Validator Agent]
    C -->|Accept| D[Execute]
    C -->|Reject| E[Feedback to Generator]
    E --> B
  

Use when: Validation requires complex reasoning (security review, logic checking)

Hierarchical V&V

class HierarchicalVV:
    def __init__(self):
        self.checks = [
            ('syntax', fast_syntax_check),
            ('type', type_checker),
            ('logic', expensive_formal_verification),
            ('security', security_scan)
        ]

    def validate(self, output):
        """Run checks in order of cost/speed"""
        for name, check in self.checks:
            if not check(output):
                return False, f"Failed {name} check"
        return True, "All checks passed"

Use when: Some checks are expensive. Fail fast on cheap checks first.

Integration with Agent Architecture

V&V loops fit into the Planner-Executor-Memory pattern:

Planner → Executor → [V&V Loop] → Memory
                ↑_________|

The V&V loop gates entry to execution and memory storage, preventing bad outputs from propagating.

Practical Application

Code Validation Example

import ast
import subprocess
from typing import Tuple

class CodeValidator:
    def __init__(self, llm):
        self.llm = llm

    def generate_code(self, task: str, max_retries: int = 3) -> str:
        """Generate Python code with V&V loop"""
        prompt = task

        for attempt in range(max_retries):
            # Generate
            code = self.llm.generate(prompt)

            # Verify: Syntax check
            syntax_valid, syntax_error = self._verify_syntax(code)
            if not syntax_valid:
                prompt = f"{task}\n\nPrevious attempt had syntax error:\n{syntax_error}\n\nPlease fix."
                continue

            # Verify: Runs without error
            runtime_valid, runtime_error = self._verify_runtime(code)
            if not runtime_valid:
                prompt = f"{task}\n\nPrevious code raised error:\n{runtime_error}\n\nPlease fix."
                continue

            # Validate: Meets requirements
            semantic_valid, semantic_feedback = self._validate_semantics(code, task)
            if not semantic_valid:
                prompt = f"{task}\n\nCode runs but: {semantic_feedback}\n\nPlease revise."
                continue

            return code

        raise ValueError(f"Failed to generate valid code after {max_retries} attempts")

    def _verify_syntax(self, code: str) -> Tuple[bool, str]:
        """Check if code is syntactically valid Python"""
        try:
            ast.parse(code)
            return True, ""
        except SyntaxError as e:
            return False, str(e)

    def _verify_runtime(self, code: str) -> Tuple[bool, str]:
        """Execute code in sandbox and catch errors"""
        try:
            result = subprocess.run(
                ['python', '-c', code],
                capture_output=True,
                text=True,
                timeout=5
            )
            if result.returncode != 0:
                return False, result.stderr
            return True, ""
        except subprocess.TimeoutExpired:
            return False, "Code execution timed out"

    def _validate_semantics(self, code: str, task: str) -> Tuple[bool, str]:
        """Use LLM to check if code solves the task"""
        validation_prompt = f"""
        Task: {task}

        Generated code:
        ```python
        {code}
        ```

        Does this code correctly solve the task?
        Answer with YES or NO, followed by explanation.
        """

        response = self.llm.generate(validation_prompt)
        valid = response.strip().upper().startswith('YES')
        feedback = response.split('\n', 1)[1] if '\n' in response else ""

        return valid, feedback

# Usage with LangChain
from langchain.chat_models import ChatOpenAI

validator = CodeValidator(ChatOpenAI(model="gpt-4", temperature=0))

task = "Write a function that finds the longest palindromic substring"
code = validator.generate_code(task)
print(code)

Integration with LangGraph

from langgraph.graph import StateGraph, END

def create_vv_agent():
    workflow = StateGraph()

    # Nodes
    workflow.add_node("generate", generate_output)
    workflow.add_node("verify", verify_output)
    workflow.add_node("validate", validate_output)
    workflow.add_node("correct", apply_corrections)

    # Edges
    workflow.set_entry_point("generate")
    workflow.add_edge("generate", "verify")

    # Conditional edges
    workflow.add_conditional_edges(
        "verify",
        should_retry,
        {
            "pass": "validate",
            "fail": "correct"
        }
    )

    workflow.add_conditional_edges(
        "validate",
        should_retry,
        {
            "pass": END,
            "fail": "correct"
        }
    )

    workflow.add_edge("correct", "generate")

    return workflow.compile()

Latest Developments & Research

Recent Research (2023-2025)

Self-Debugging (Chen et al., 2023)

Constitutional AI for Validation (Anthropic, 2024)

Formal Verification Integration (2024)

Benchmarks

Open Problems

  1. Validator training: How to train specialized validator models?
  2. Feedback quality: What feedback maximizes correction success?
  3. Optimal retry strategies: When to give up vs. keep trying?
  4. Cost optimization: How to minimize validation overhead?

Cross-Disciplinary Insight

From Software Engineering: Design by Contract

Bertrand Meyer’s Design by Contract (1986) introduced preconditions, postconditions, and invariants. V&V loops are the runtime equivalent:

def divide(a: float, b: float) -> float:
    # Precondition (verify)
    assert b != 0, "Divisor cannot be zero"

    result = a / b

    # Postcondition (validate)
    assert abs(result * b - a) < 1e-10, "Result failed sanity check"

    return result

From Control Theory: Closed-Loop Control

V&V loops implement closed-loop control where:

This provides stability and error correction.

From Biology: Immune System

The immune system uses multi-layered validation:

  1. Skin barrier (syntax checks)
  2. Innate immunity (pattern matching)
  3. Adaptive immunity (learned validation)

Effective V&V uses the same defense-in-depth approach.

Daily Challenge

Challenge: Build a JSON Schema Validator Agent

Task: Create an agent that generates JSON data matching a schema, with a V&V loop ensuring compliance.

import json
from jsonschema import validate, ValidationError

# Your task:
# Write a function that uses an LLM to generate JSON for this schema
# Add verification (valid JSON syntax)
# Add validation (matches schema)
# Implement retry with feedback
# Test with edge cases

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0, "maximum": 120},
        "email": {"type": "string", "format": "email"}
    },
    "required": ["name", "age", "email"]
}

task = "Generate profile for a 25-year-old software engineer named Alice"

# Your implementation here
def generate_json_with_vv(task, schema, llm):
    # TODO: Implement V&V loop
    pass

# Test cases to handle:
# - Invalid JSON syntax
# - Missing required fields
# - Type mismatches
# - Constraint violations (age out of range)

Extension: Add a semantic validator that checks if generated data is plausible (e.g., name matches gender, age matches job seniority).

Time: 30 minutes

References & Further Reading

Papers

Tools & Frameworks

Blog Posts

GitHub Repositories


● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.