Skip to main content

Structured Data: Dataclasses and Pydantic

Flyte supports complex, structured data as task inputs and outputs through Python's standard @dataclass and Pydantic's BaseModel. This implementation allows developers to group related data and metadata into single objects while maintaining strong typing and automatic serialization.

The core of this system lies in the DataclassTransformer and PydanticTransformer classes within flyte.types._type_engine. These transformers manage the lifecycle of structured data, from Python objects to binary transport formats and JSON schemas for the Flyte UI.

Dataclasses and Mashumaro

The DataclassTransformer handles standard Python dataclasses. It relies on the mashumaro library to perform efficient serialization.

Serialization and Transport

When a dataclass is used as a task output, the transformer converts it into MessagePack bytes. This format is chosen for its efficiency and compactness compared to standard JSON.

The lifecycle in DataclassTransformer.to_literal follows these steps:

  1. Lazy Uploading: Before serialization, the transformer calls _invoke_lazy_uploaders. This ensures that any nested Flyte IO types (like FlyteFile or FlyteDirectory) are uploaded to the blob store before the parent object is serialized.
  2. Encoding: It uses a MessagePackEncoder (from mashumaro) to convert the dataclass into bytes.
  3. IDL Representation: The bytes are wrapped in a Flyte Literal as a Binary scalar with the MESSAGEPACK tag.

JSON Schema Generation

To enable the Flyte UI to display and validate these structures, DataclassTransformer.get_literal_type generates a JSON schema (Draft 2020-12) using mashumaro.jsonschema.build_json_schema. This schema is embedded in the LiteralType metadata.

# Example of a nested dataclass structure
@dataclass
class InferenceRequest:
feature_a: float
feature_b: float

@dataclass
class BatchRequest:
requests: List[InferenceRequest]

Pydantic Models

The PydanticTransformer provides support for Pydantic's BaseModel. This is particularly useful for data validation and when working with existing Pydantic-based codebases.

Handling Flyte IO Types

When using Flyte-specific types like File or Dir inside a Pydantic model, you must enable arbitrary_types_allowed in the model's configuration. This allows Pydantic to accept Flyte's internal IO classes which are not standard Python types.

from pydantic import BaseModel
from flyte.io import File

class BatchPredictionResults(BaseModel):
predictions: List[float]
results_file: File

class Config:
arbitrary_types_allowed = True

Enum Consistency

Flyte maintains consistency between standalone Enums and Enums nested within Pydantic models. The CustomPydanticJsonSchemaGenerator ensures that Enums are represented by their member names (e.g., "RED") rather than their values (e.g., 1) in the generated JSON schema. This matches the behavior of Flyte's EnumTransformer.

Nested Flyte IO Types

One of the most powerful features of the structured data implementation is the recursive handling of Flyte IO types. If a dataclass or Pydantic model contains a flyte.io.File, flyte.io.Dir, or StructuredDataset, the Flyte SDK automatically manages the data movement.

In to_literal, both transformers execute:

await _invoke_lazy_uploaders(python_val)

This function traverses the object tree and triggers the upload of any local files to the remote Flyte metadata store (e.g., S3/GCS). When the task on the receiving end deserializes the object, these IO types are reconstructed with their remote paths, ready for the next task to download or stream.

Interoperability: Pydantic inside Dataclasses

The codebase provides a PydanticSchemaPlugin for mashumaro. This plugin allows DataclassTransformer to correctly generate JSON schemas even when a dataclass contains a Pydantic BaseModel as a field. It bridges the two systems by delegating schema generation for the Pydantic field to Pydantic's own model_json_schema method.

Performance and Metadata

Both transformers use MessagePack for binary transport, which is indicated in the LiteralType metadata:

meta_struct.update(
{
CACHE_KEY_METADATA: {
SERIALIZATION_FORMAT: MESSAGEPACK,
}
}
)

This metadata ensures that the Flyte engine and other SDKs know how to decode the binary payload. If the original Python class is unavailable (e.g., when inspecting a execution from a different environment), the guess_python_type method in both transformers can reconstruct a dynamic class from the JSON schema metadata stored in the LiteralType.