Architecture Overview
This section contains architecture diagrams and documentation for flyte-sdk.
Available Diagrams
Flytekit SDK System Context Diagram
The Flytekit SDK serves as the primary interface for users (Data Scientists and Developers) to author, manage, and execute workflows on the Flyte platform.
Key Components and Interactions:
- Flytekit Internal Architecture: The core Python library and CLI tool. It handles workflow/task definition, serialization, and communication with the Flyte backend.
- Architecture Overview: A set of gRPC/ConnectRPC services (Project, Task, Run, etc.) that manage the lifecycle of Flyte entities. The SDK uses these APIs for registration and execution requests.
- Architecture Overview: The core execution engine that orchestrates workflow graphs and manages task execution. While the SDK doesn't talk to it directly, it defines the instructions Propeller follows.
- Storage: S3, GCS, or Azure Blob Storage used for storing task inputs, outputs, and artifacts. The SDK interacts with storage via
fsspec,obstore, and signed URLs provided by theDataProxyservice. - Container Images: Stores the Docker images used for task execution. The SDK can build and push images locally (using Docker/Podman) or trigger remote builds.
- Connectors: Integrated via a robust plugin system, allowing Flyte tasks to interact with distributed computing (Spark, Ray), data warehouses (BigQuery, Snowflake), and AI services (OpenAI, Anthropic).
- Flyte Console: The web UI for monitoring and managing executions. The SDK provides helper methods to generate direct links to resources in the Console.
Key Architectural Findings:
- Flytekit SDK uses ConnectRPC to communicate with a suite of backend services including Project, Task, Run, and DataProxy services.
- Storage abstraction is handled via fsspec and obstore, supporting S3, GCS, and Azure Blob Storage through signed URLs.
- Image management includes local builds via Docker/Podman and remote builds triggered through the Flyte backend.
- A comprehensive plugin architecture enables direct integration with external platforms like Spark, Ray, Databricks, Snowflake, and various LLM providers.
- The SDK includes a rich CLI (built with click) for deploying applications, managing resources, and fetching logs.
Flytekit Internal Architecture
This diagram illustrates the internal structure of the Flytekit Python SDK, showing how Core Concepts interact with flyte.remote and flyte.extend. It highlights the separation between the flyte.cli and the underlying flyte.models used for communication with the Flyte backend.
Task Registration and Execution Flow
This diagram illustrates the lifecycle of a Flyte task, starting from its definition in Task Registration and Execution Flow using the @task decorator, through registration via flyte.remote, and finally its execution on the Architecture Overview managed by Project Administration.
Flytekit Deployment and Execution Environment
This diagram illustrates the lifecycle of a Flyte workflow from local development to execution on a Kubernetes cluster. It highlights the role of the flyte.cli in packaging code, the interaction with the Project Administration service for registration, and how Architecture Overview orchestrates task execution within Kubernetes Pods using an Storage for data persistence.
Execution Lifecycle States
This state diagram represents the lifecycle of a Flyte execution (referred to as an "Action" in the SDK). The states are derived from the ActionPhase enum found in src/flyte/models.py.
The lifecycle begins in an Undefined state (representing the protobuf ACTION_PHASE_UNSPECIFIED) and moves to Queued upon creation. From there, it progresses through resource allocation (Waiting for Resources) and setup (Initializing) before entering the Running state.
Terminal states include Succeeded, Failed, Aborted, and Timed Out. The diagram also captures the retry mechanism where a Running action can transition back to Queued if a retryable failure occurs. Transitions can be triggered by the Flyte backend (scheduling, execution completion, timeouts) or by the user via the SDK (e.g., calling Action.abort()).
Key findings from the code:
- The
ActionPhaseenum defines the core states. - The
is_terminalproperty inActionPhaseidentifies the final states. - The
Action.abort()method insrc/flyte/remote/_action.pyexplicitly triggers a transition to the Aborted state. - The
Action.wait()andAction.watch()methods allow users to monitor these state transitions in real-time.
Key Architectural Findings:
- The
ActionPhaseenum insrc/flyte/models.pydefines the primary execution states: QUEUED, WAITING_FOR_RESOURCES, INITIALIZING, RUNNING, SUCCEEDED, FAILED, ABORTED, and TIMED_OUT. - Terminal states are explicitly defined in the code via the
is_terminalproperty and the_action_done_checkutility function. - The SDK provides an
abort()method on theActionclass to manually transition an execution to the ABORTED state. - Retry logic (though managed by the backend) is reflected in the SDK through attempt tracking and the ability for an action to return to a non-terminal state after a failure.
- The
ActionPhase.from_protobufmethod handles the mapping from the underlying Flyte IDL phases to the SDK's enum, including handling of the UNSPECIFIED (Undefined) state.