Design of the Iterative Code Generator

The AutoCoderAgent implements a robust, self-healing code generation system designed to bridge the gap between high-level natural language prompts and executable, verified Python scripts. At its core, the system relies on an iterative Generate-Test-Fix cycle that treats code generation not as a single-shot task, but as an evolutionary process within an isolated environment.

Orchestration and State Management

The design separates configuration from execution state using two primary classes:

AutoCoderAgent: Serves as the immutable configuration layer. It defines the LLM model, resource requirements (CPU/Memory), base packages, and iteration limits (max_iterations).
_CodeGenSession: An internal, stateful manager created for every generate() call. It tracks the evolution of the solution, including conversation history, detected dependencies, token usage, and the current sandbox image.

This separation ensures that the AutoCoderAgent remains reusable across different tasks, while the _CodeGenSession encapsulates the complexity of a specific retry loop.

The Generate-Test-Fix Cycle

The iterative loop, implemented in _CodeGenSession.run(), follows a structured sequence to ensure reliability:

Planning: Before writing code, the agent generates a CodePlan using generate_plan. This forces the LLM to reason about the approach and required libraries before implementation.
Generation: The agent produces the solution code (CodeSolution) and a corresponding suite of pytest tests.
Environment Sync: The system dynamically detects required packages from the generated code using detect_and_track_packages.
Sandbox Execution: Code and tests are executed within a Flyte sandbox. The sandbox provides isolation and enforces constraints like network blocking or resource limits.
Diagnosis: If tests fail, the system doesn't just retry; it performs a structured diagnosis.

Intelligent Error Diagnosis

The system uses the ErrorDiagnosis and TestFailure models (found in flyteplugins.codegen.core.types) to categorize failures into three distinct types:

Environment: Missing system or Python packages.
Logic: Bugs in the generated solution code.
Test Error: Bugs or incorrect expectations in the generated test suite itself.

This categorization is critical because it dictates the "fix" strategy. Environment errors trigger an image rebuild, logic errors trigger a code patch, and test errors trigger a test suite update.

Error Reclassification Logic

A key innovation in the _CodeGenSession is the _reclassify_errors method. LLMs occasionally misdiagnose the root cause of a failure—for example, blaming the code logic when the test's expected value is actually wrong.

To prevent the agent from getting stuck in an infinite loop fixing the wrong entity, the session tracks fix attempts for specific failure signatures. If a test_error persists after multiple fix attempts, the system reclassifies it as a logic error (and vice versa). This design choice acknowledges LLM fallibility and allows the agent to "change its mind" about where the bug resides.

# Example of reclassification logic in _CodeGenSession
if self.test_fix_attempts[key] > self.max_test_attempts:
    failure.error_type = "logic"
    failure.root_cause = f"Test failed {self.max_test_attempts + 1} times... The code logic could be wrong."

Sandbox Isolation and Environment Evolution

The generator manages environment drift through deterministic image naming. The _compute_image_name method generates a hash based on the language, Python packages, and system packages.

def _compute_image_name(self, packages: list[str], system_packages: list[str]) -> str:
    spec = {
        "language": self.language,
        "packages": sorted(packages),
        "system_packages": sorted(system_packages),
    }
    config_hash = hashlib.sha256(json.dumps(spec, sort_keys=True).encode()).hexdigest()[:12]
    return f"auto-coder-agent-{self.language}-{config_hash}"

If the diagnosis identifies a missing package (e.g., gcc or pandas), the _CodeGenSession updates its package list, which triggers a rebuild of the sandbox image for the next iteration.

Verification and Forceful Prompting

To ensure that the LLM actually follows the suggested fixes, the agent employs a multi-stage verification process in _generate_code. If a diagnosis provides specific logic fixes, the agent calls verify_logic_fixes_applied after the next generation attempt.

If the LLM fails to apply the fixes, the system becomes progressively more "forceful" in its prompting. The _generate_code method appends critical warnings and final-attempt ultimatums to the conversation history, ensuring the LLM prioritizes the specific patches over general code generation.

Design Constraints and Tradeoffs

Persistence: The system enforces a strict constraint on the /var/outputs directory. Generated code is explicitly instructed never to delete or recreate this directory, as it is the primary mechanism for returning data from the sandbox.
Token Usage: The iterative loop tracks total_input_tokens and total_output_tokens across all attempts. While iteration increases reliability, it also increases cost, which is why max_iterations is configurable.
Isolation vs. Performance: Building a new image for every package change ensures a clean environment but adds latency. The system mitigates this by caching images based on the package hash.

Orchestration and State Management​

The Generate-Test-Fix Cycle​

Intelligent Error Diagnosis​

Error Reclassification Logic​

Sandbox Isolation and Environment Evolution​

Verification and Forceful Prompting​

Design Constraints and Tradeoffs​