Skip to main content

Architecture Overview

This section contains architecture diagrams and documentation for flyte-sdk.

Available Diagrams

Flytekit SDK System Context Diagram

The Flytekit SDK serves as the primary interface for users (Data Scientists and Developers) to author, manage, and execute workflows on the Flyte platform.

Key Components and Interactions:

  • Flytekit Internal Architecture: The core Python library and CLI tool. It handles workflow/task definition, serialization, and communication with the Flyte backend.
  • Architecture Overview: A set of gRPC/ConnectRPC services (Project, Task, Run, etc.) that manage the lifecycle of Flyte entities. The SDK uses these APIs for registration and execution requests.
  • Architecture Overview: The core execution engine that orchestrates workflow graphs and manages task execution. While the SDK doesn't talk to it directly, it defines the instructions Propeller follows.
  • Storage: S3, GCS, or Azure Blob Storage used for storing task inputs, outputs, and artifacts. The SDK interacts with storage via fsspec, obstore, and signed URLs provided by the DataProxy service.
  • Container Images: Stores the Docker images used for task execution. The SDK can build and push images locally (using Docker/Podman) or trigger remote builds.
  • Connectors: Integrated via a robust plugin system, allowing Flyte tasks to interact with distributed computing (Spark, Ray), data warehouses (BigQuery, Snowflake), and AI services (OpenAI, Anthropic).
  • Flyte Console: The web UI for monitoring and managing executions. The SDK provides helper methods to generate direct links to resources in the Console.

Key Architectural Findings:

  • Flytekit SDK uses ConnectRPC to communicate with a suite of backend services including Project, Task, Run, and DataProxy services.
  • Storage abstraction is handled via fsspec and obstore, supporting S3, GCS, and Azure Blob Storage through signed URLs.
  • Image management includes local builds via Docker/Podman and remote builds triggered through the Flyte backend.
  • A comprehensive plugin architecture enables direct integration with external platforms like Spark, Ray, Databricks, Snowflake, and various LLM providers.
  • The SDK includes a rich CLI (built with click) for deploying applications, managing resources, and fetching logs.

Flytekit Internal Architecture

This diagram illustrates the internal structure of the Flytekit Python SDK, showing how Core Concepts interact with flyte.remote and flyte.extend. It highlights the separation between the flyte.cli and the underlying flyte.models used for communication with the Flyte backend.

Task Registration and Execution Flow

This diagram illustrates the lifecycle of a Flyte task, starting from its definition in Task Registration and Execution Flow using the @task decorator, through registration via flyte.remote, and finally its execution on the Architecture Overview managed by Project Administration.

Flytekit Deployment and Execution Environment

This diagram illustrates the lifecycle of a Flyte workflow from local development to execution on a Kubernetes cluster. It highlights the role of the flyte.cli in packaging code, the interaction with the Project Administration service for registration, and how Architecture Overview orchestrates task execution within Kubernetes Pods using an Storage for data persistence.

Execution Lifecycle States

This state diagram represents the lifecycle of a Flyte execution (referred to as an "Action" in the SDK). The states are derived from the ActionPhase enum found in src/flyte/models.py.

The lifecycle begins in an Undefined state (representing the protobuf ACTION_PHASE_UNSPECIFIED) and moves to Queued upon creation. From there, it progresses through resource allocation (Waiting for Resources) and setup (Initializing) before entering the Running state.

Terminal states include Succeeded, Failed, Aborted, and Timed Out. The diagram also captures the retry mechanism where a Running action can transition back to Queued if a retryable failure occurs. Transitions can be triggered by the Flyte backend (scheduling, execution completion, timeouts) or by the user via the SDK (e.g., calling Action.abort()).

Key findings from the code:

  • The ActionPhase enum defines the core states.
  • The is_terminal property in ActionPhase identifies the final states.
  • The Action.abort() method in src/flyte/remote/_action.py explicitly triggers a transition to the Aborted state.
  • The Action.wait() and Action.watch() methods allow users to monitor these state transitions in real-time.

Key Architectural Findings:

  • The ActionPhase enum in src/flyte/models.py defines the primary execution states: QUEUED, WAITING_FOR_RESOURCES, INITIALIZING, RUNNING, SUCCEEDED, FAILED, ABORTED, and TIMED_OUT.
  • Terminal states are explicitly defined in the code via the is_terminal property and the _action_done_check utility function.
  • The SDK provides an abort() method on the Action class to manually transition an execution to the ABORTED state.
  • Retry logic (though managed by the backend) is reflected in the SDK through attempt tracking and the ability for an action to return to a non-terminal state after a failure.
  • The ActionPhase.from_protobuf method handles the mapping from the underlying Flyte IDL phases to the SDK's enum, including handling of the UNSPECIFIED (Undefined) state.