Container Reuse & Scaling
This project provides two primary mechanisms for managing container lifecycles and scaling: ReusePolicy for optimizing task execution through warm container pools, and Scaling for managing the replica lifecycle of long-running applications.
Optimizing Task Performance with ReusePolicy
When environment creation is expensive (e.g., loading large ML models or initializing GPU drivers), the cold-start overhead can dominate task runtime. The ReusePolicy class in flyte._reusable_environment allows you to maintain a pool of warm containers that persist across multiple task invocations.
The Warm Pool Concept
By default, Flyte tasks run in isolated, ephemeral containers. When a TaskEnvironment is configured with a ReusePolicy, the system keeps a set of replicas running. Subsequent calls to tasks within that environment reuse the same Python process, significantly reducing latency.
import flyte
# Configure a warm pool for GPU tasks
gpu_env = flyte.TaskEnvironment(
name="gpu_worker",
resources=flyte.Resources(gpu="L4:1", memory="16Gi"),
reusable=flyte.ReusePolicy(
replicas=(1, 5), # Scale between 1 and 5 replicas
concurrency=10, # Support 10 concurrent tasks per replica
idle_ttl=300 # Shut down the pool after 5 minutes of total idleness
),
)
Key Configuration Parameters
replicas: Can be a fixedintor atuple(min, max). A minimum of 2 replicas is recommended to prevent starvation (see below).concurrency: The maximum number of concurrent tasks per replica. Values greater than 1 are only supported forasynctasks.idle_ttl: The duration (seconds ortimedelta) after which the entire environment shuts down if no tasks are running.scaledown_ttl: The minimum time an individual replica must be idle before it is removed during auto-scaling.
Important Constraints and Gotchas
- Starvation: If a parent task (running on a replica) invokes a child task that requires the same environment, and no other replicas are available, the child task will wait indefinitely for the parent to finish. The
ReusePolicyimplementation insrc/flyte/_reusable_environment.pyissues a warning ifmax_replicasandconcurrencyare both set to 1. - Resource Overrides: When a task is part of a reusable environment, you cannot override its
resources,env_vars, orsecretsusingtask.override()unless you explicitly disable reusability for that call. This is enforced insrc/flyte/_task.py:# This will raise a ValueError because reusable is active
my_task.override(resources=flyte.Resources(cpu="4"))
# Correct way to override resources for a specific call
my_task.override(reusable="off", resources=flyte.Resources(cpu="4")) - State Management: Because the Python process is reused, global state (like caches or singleton objects) persists between invocations. You must manage memory and global variables carefully to avoid side effects.
Scaling Long-Running Apps
For long-running services like APIs or model servers, the Scaling class in flyte.app._types controls how many replicas are active and how they respond to traffic. This is used primarily with AppEnvironment and its subclasses (e.g., FastAPIAppEnvironment).
Scaling Patterns
The Scaling class supports several common deployment patterns:
- Scale-to-Zero (Default):
Scaling(replicas=(0, 1))ensures no resources are consumed when the app is idle. The first request will trigger a cold start. - Always-On:
Scaling(replicas=(1, 1))maintains exactly one replica at all times, eliminating cold starts for the first user. - Burstable:
Scaling(replicas=(1, 10))maintains a baseline of 1 replica and scales up to 10 based on traffic metrics. - High-Availability:
Scaling(replicas=(2, 10))ensures at least two replicas are always running across different nodes.
Autoscaling Metrics
You can define the trigger for scaling using the metric parameter:
Scaling.Concurrency(val): Scales up when the number of concurrent requests per replica exceedsval.Scaling.RequestRate(val): Scales up when the number of requests per second per replica exceedsval.
Example: Scaling a Model Server
In examples/ml/batch_inference_saturate_app.py, a FastAPI application is configured to handle high-concurrency batch inference:
from flyte.app import Scaling, FastAPIAppEnvironment
app_env = FastAPIAppEnvironment(
name="batch-inference-app",
app=my_fastapi_app,
scaling=Scaling(
replicas=(0, 2),
metric=Scaling.Concurrency(val=10),
scaledown_after=300, # Wait 5 minutes before scaling down
),
)
Comparison: ReusePolicy vs. Scaling
While both manage replicas, they serve different purposes:
ReusePolicyis for Tasks. It manages a pool of workers that execute discrete units of work (functions) triggered by the Flyte orchestrator.Scalingis for Apps. It manages replicas of a persistent server (like FastAPI or Streamlit) that responds to external network requests.