Design of the Iterative Code Generator
The AutoCoderAgent implements a robust, self-healing code generation system designed to bridge the gap between high-level natural language prompts and executable, verified Python scripts. At its core, the system relies on an iterative Generate-Test-Fix cycle that treats code generation not as a single-shot task, but as an evolutionary process within an isolated environment.
Orchestration and State Management
The design separates configuration from execution state using two primary classes:
- AutoCoderAgent: Serves as the immutable configuration layer. It defines the LLM model, resource requirements (CPU/Memory), base packages, and iteration limits (
max_iterations). - _CodeGenSession: An internal, stateful manager created for every
generate()call. It tracks the evolution of the solution, including conversation history, detected dependencies, token usage, and the current sandbox image.
This separation ensures that the AutoCoderAgent remains reusable across different tasks, while the _CodeGenSession encapsulates the complexity of a specific retry loop.
The Generate-Test-Fix Cycle
The iterative loop, implemented in _CodeGenSession.run(), follows a structured sequence to ensure reliability:
- Planning: Before writing code, the agent generates a
CodePlanusinggenerate_plan. This forces the LLM to reason about the approach and required libraries before implementation. - Generation: The agent produces the solution code (
CodeSolution) and a corresponding suite ofpytesttests. - Environment Sync: The system dynamically detects required packages from the generated code using
detect_and_track_packages. - Sandbox Execution: Code and tests are executed within a Flyte sandbox. The sandbox provides isolation and enforces constraints like network blocking or resource limits.
- Diagnosis: If tests fail, the system doesn't just retry; it performs a structured diagnosis.
Intelligent Error Diagnosis
The system uses the ErrorDiagnosis and TestFailure models (found in flyteplugins.codegen.core.types) to categorize failures into three distinct types:
- Environment: Missing system or Python packages.
- Logic: Bugs in the generated solution code.
- Test Error: Bugs or incorrect expectations in the generated test suite itself.
This categorization is critical because it dictates the "fix" strategy. Environment errors trigger an image rebuild, logic errors trigger a code patch, and test errors trigger a test suite update.
Error Reclassification Logic
A key innovation in the _CodeGenSession is the _reclassify_errors method. LLMs occasionally misdiagnose the root cause of a failure—for example, blaming the code logic when the test's expected value is actually wrong.
To prevent the agent from getting stuck in an infinite loop fixing the wrong entity, the session tracks fix attempts for specific failure signatures. If a test_error persists after multiple fix attempts, the system reclassifies it as a logic error (and vice versa). This design choice acknowledges LLM fallibility and allows the agent to "change its mind" about where the bug resides.
# Example of reclassification logic in _CodeGenSession
if self.test_fix_attempts[key] > self.max_test_attempts:
failure.error_type = "logic"
failure.root_cause = f"Test failed {self.max_test_attempts + 1} times... The code logic could be wrong."
Sandbox Isolation and Environment Evolution
The generator manages environment drift through deterministic image naming. The _compute_image_name method generates a hash based on the language, Python packages, and system packages.
def _compute_image_name(self, packages: list[str], system_packages: list[str]) -> str:
spec = {
"language": self.language,
"packages": sorted(packages),
"system_packages": sorted(system_packages),
}
config_hash = hashlib.sha256(json.dumps(spec, sort_keys=True).encode()).hexdigest()[:12]
return f"auto-coder-agent-{self.language}-{config_hash}"
If the diagnosis identifies a missing package (e.g., gcc or pandas), the _CodeGenSession updates its package list, which triggers a rebuild of the sandbox image for the next iteration.
Verification and Forceful Prompting
To ensure that the LLM actually follows the suggested fixes, the agent employs a multi-stage verification process in _generate_code. If a diagnosis provides specific logic fixes, the agent calls verify_logic_fixes_applied after the next generation attempt.
If the LLM fails to apply the fixes, the system becomes progressively more "forceful" in its prompting. The _generate_code method appends critical warnings and final-attempt ultimatums to the conversation history, ensuring the LLM prioritizes the specific patches over general code generation.
Design Constraints and Tradeoffs
- Persistence: The system enforces a strict constraint on the
/var/outputsdirectory. Generated code is explicitly instructed never to delete or recreate this directory, as it is the primary mechanism for returning data from the sandbox. - Token Usage: The iterative loop tracks
total_input_tokensandtotal_output_tokensacross all attempts. While iteration increases reliability, it also increases cost, which is whymax_iterationsis configurable. - Isolation vs. Performance: Building a new image for every package change ensures a clean environment but adds latency. The system mitigates this by caching images based on the package hash.