Container Reuse & Scaling

This project provides two primary mechanisms for managing container lifecycles and scaling: ReusePolicy for optimizing task execution through warm container pools, and Scaling for managing the replica lifecycle of long-running applications.

Optimizing Task Performance with ReusePolicy

When environment creation is expensive (e.g., loading large ML models or initializing GPU drivers), the cold-start overhead can dominate task runtime. The ReusePolicy class in flyte._reusable_environment allows you to maintain a pool of warm containers that persist across multiple task invocations.

The Warm Pool Concept

By default, Flyte tasks run in isolated, ephemeral containers. When a TaskEnvironment is configured with a ReusePolicy, the system keeps a set of replicas running. Subsequent calls to tasks within that environment reuse the same Python process, significantly reducing latency.

import flyte

# Configure a warm pool for GPU tasks
gpu_env = flyte.TaskEnvironment(
    name="gpu_worker",
    resources=flyte.Resources(gpu="L4:1", memory="16Gi"),
    reusable=flyte.ReusePolicy(
        replicas=(1, 5),  # Scale between 1 and 5 replicas
        concurrency=10,   # Support 10 concurrent tasks per replica
        idle_ttl=300      # Shut down the pool after 5 minutes of total idleness
    ),
)

Key Configuration Parameters

replicas: Can be a fixed int or a tuple(min, max). A minimum of 2 replicas is recommended to prevent starvation (see below).
concurrency: The maximum number of concurrent tasks per replica. Values greater than 1 are only supported for async tasks.
idle_ttl: The duration (seconds or timedelta) after which the entire environment shuts down if no tasks are running.
scaledown_ttl: The minimum time an individual replica must be idle before it is removed during auto-scaling.

Important Constraints and Gotchas

Starvation: If a parent task (running on a replica) invokes a child task that requires the same environment, and no other replicas are available, the child task will wait indefinitely for the parent to finish. The ReusePolicy implementation in src/flyte/_reusable_environment.py issues a warning if max_replicas and concurrency are both set to 1.
Resource Overrides: When a task is part of a reusable environment, you cannot override its resources, env_vars, or secrets using task.override() unless you explicitly disable reusability for that call. This is enforced in src/flyte/_task.py:
```
# This will raise a ValueError because reusable is active
my_task.override(resources=flyte.Resources(cpu="4"))

# Correct way to override resources for a specific call
my_task.override(reusable="off", resources=flyte.Resources(cpu="4"))
```
State Management: Because the Python process is reused, global state (like caches or singleton objects) persists between invocations. You must manage memory and global variables carefully to avoid side effects.

Scaling Long-Running Apps

For long-running services like APIs or model servers, the Scaling class in flyte.app._types controls how many replicas are active and how they respond to traffic. This is used primarily with AppEnvironment and its subclasses (e.g., FastAPIAppEnvironment).

Scaling Patterns

The Scaling class supports several common deployment patterns:

Scale-to-Zero (Default): Scaling(replicas=(0, 1)) ensures no resources are consumed when the app is idle. The first request will trigger a cold start.
Always-On: Scaling(replicas=(1, 1)) maintains exactly one replica at all times, eliminating cold starts for the first user.
Burstable: Scaling(replicas=(1, 10)) maintains a baseline of 1 replica and scales up to 10 based on traffic metrics.
High-Availability: Scaling(replicas=(2, 10)) ensures at least two replicas are always running across different nodes.

Autoscaling Metrics

You can define the trigger for scaling using the metric parameter:

Scaling.Concurrency(val): Scales up when the number of concurrent requests per replica exceeds val.
Scaling.RequestRate(val): Scales up when the number of requests per second per replica exceeds val.

Example: Scaling a Model Server

In examples/ml/batch_inference_saturate_app.py, a FastAPI application is configured to handle high-concurrency batch inference:

from flyte.app import Scaling, FastAPIAppEnvironment

app_env = FastAPIAppEnvironment(
    name="batch-inference-app",
    app=my_fastapi_app,
    scaling=Scaling(
        replicas=(0, 2),
        metric=Scaling.Concurrency(val=10),
        scaledown_after=300, # Wait 5 minutes before scaling down
    ),
)

Comparison: ReusePolicy vs. Scaling

While both manage replicas, they serve different purposes:

ReusePolicy is for Tasks. It manages a pool of workers that execute discrete units of work (functions) triggered by the Flyte orchestrator.
Scaling is for Apps. It manages replicas of a persistent server (like FastAPI or Streamlit) that responds to external network requests.

Optimizing Task Performance with ReusePolicy​

The Warm Pool Concept​

Key Configuration Parameters​

Important Constraints and Gotchas​

Scaling Long-Running Apps​

Scaling Patterns​

Autoscaling Metrics​

Example: Scaling a Model Server​

Comparison: ReusePolicy vs. Scaling​