Resource Management
To allocate compute resources such as CPU, Memory, GPU, and TPU to your tasks, use the Resources class within a TaskEnvironment or as an override during task execution.
import flyte
# Define resources for all tasks in an environment
env = flyte.TaskEnvironment(
name="ml-env",
resources=flyte.Resources(
cpu=2,
memory="4Gi",
gpu="A100:1",
shm="auto"
),
)
@env.task
async def train_model():
...
# Or override resources for a specific task call
await train_model.override(
resources=flyte.Resources(cpu=4, memory="16Gi", gpu="A100:2")
)()
Specifying CPU and Memory
The Resources class (defined in src/flyte/_resources.py) allows you to specify CPU and Memory using either single values or request/limit ranges.
- CPU: Accepts
int,float, or Kubernetes-style strings (e.g.,"500m"). - Memory: Accepts strings with Kubernetes units (e.g.,
"1Gi","512Mi").
To set separate requests and limits, provide a tuple:
# Request 1 CPU (limit 2) and 2Gi memory (limit 4Gi)
flyte.Resources(
cpu=(1, 2),
memory=("2Gi", "4Gi")
)
Allocating Accelerators (GPU, TPU, Neuron)
You can allocate accelerators using three different formats for the gpu parameter:
1. Simple Count
Pass an int to request a generic GPU.
flyte.Resources(gpu=1)
2. String Format (Type and Quantity)
Pass a string in the format "Type:Quantity". The type must match one of the supported Accelerators literals in src/flyte/_resources.py (e.g., T4, L4, A100, H100, V100).
flyte.Resources(gpu="A100 80G:8")
3. Advanced Device Configuration
For complex requirements like MIG partitioning or TPU slices, use the Device helper functions: GPU, TPU, Neuron, AMD_GPU, or HABANA_GAUDI.
# A100 with MIG partitioning (1g.5gb slice)
flyte.Resources(
gpu=flyte.GPU(device="A100", quantity=1, partition="1g.5gb")
)
# Google Cloud TPU v5p with a specific slice
flyte.Resources(
gpu=flyte.TPU(device="V5P", partition="2x2x1")
)
Shared Memory and Disk
- Disk: Use the
diskparameter to request ephemeral storage. - Shared Memory (shm): Useful for ML data loading. Setting
shm="auto"automatically requests the maximum shared memory available on the node.
flyte.Resources(
disk="100Gi",
shm="16Gi" # Or "auto"
)
Troubleshooting and Constraints
- GPU Quantity: When using the
Deviceclass orGPU()helper directly, thequantitymust be at least 1. - Validation:
Resourcesvalidates that CPU and Memory tuples contain exactly two elements. - String Literals: If using the string format for GPUs (e.g.,
"H100:1"), the device name must exactly match the supported types defined in theAcceleratorsliteral insrc/flyte/_resources.py. - OOM Recovery: You can use
.override()inside atry/exceptblock to retry a task with more memory if it fails with anOOMError.
try:
await my_task()
except flyte.errors.OOMError:
# Retry with more memory
await my_task.override(resources=flyte.Resources(memory="16Gi"))()