Structured Data: Dataclasses and Pydantic
Flyte supports complex, structured data as task inputs and outputs through Python's standard @dataclass and Pydantic's BaseModel. This implementation allows developers to group related data and metadata into single objects while maintaining strong typing and automatic serialization.
The core of this system lies in the DataclassTransformer and PydanticTransformer classes within flyte.types._type_engine. These transformers manage the lifecycle of structured data, from Python objects to binary transport formats and JSON schemas for the Flyte UI.
Dataclasses and Mashumaro
The DataclassTransformer handles standard Python dataclasses. It relies on the mashumaro library to perform efficient serialization.
Serialization and Transport
When a dataclass is used as a task output, the transformer converts it into MessagePack bytes. This format is chosen for its efficiency and compactness compared to standard JSON.
The lifecycle in DataclassTransformer.to_literal follows these steps:
- Lazy Uploading: Before serialization, the transformer calls
_invoke_lazy_uploaders. This ensures that any nested Flyte IO types (likeFlyteFileorFlyteDirectory) are uploaded to the blob store before the parent object is serialized. - Encoding: It uses a
MessagePackEncoder(frommashumaro) to convert the dataclass into bytes. - IDL Representation: The bytes are wrapped in a Flyte
Literalas aBinaryscalar with theMESSAGEPACKtag.
JSON Schema Generation
To enable the Flyte UI to display and validate these structures, DataclassTransformer.get_literal_type generates a JSON schema (Draft 2020-12) using mashumaro.jsonschema.build_json_schema. This schema is embedded in the LiteralType metadata.
# Example of a nested dataclass structure
@dataclass
class InferenceRequest:
feature_a: float
feature_b: float
@dataclass
class BatchRequest:
requests: List[InferenceRequest]
Pydantic Models
The PydanticTransformer provides support for Pydantic's BaseModel. This is particularly useful for data validation and when working with existing Pydantic-based codebases.
Handling Flyte IO Types
When using Flyte-specific types like File or Dir inside a Pydantic model, you must enable arbitrary_types_allowed in the model's configuration. This allows Pydantic to accept Flyte's internal IO classes which are not standard Python types.
from pydantic import BaseModel
from flyte.io import File
class BatchPredictionResults(BaseModel):
predictions: List[float]
results_file: File
class Config:
arbitrary_types_allowed = True
Enum Consistency
Flyte maintains consistency between standalone Enums and Enums nested within Pydantic models. The CustomPydanticJsonSchemaGenerator ensures that Enums are represented by their member names (e.g., "RED") rather than their values (e.g., 1) in the generated JSON schema. This matches the behavior of Flyte's EnumTransformer.
Nested Flyte IO Types
One of the most powerful features of the structured data implementation is the recursive handling of Flyte IO types. If a dataclass or Pydantic model contains a flyte.io.File, flyte.io.Dir, or StructuredDataset, the Flyte SDK automatically manages the data movement.
In to_literal, both transformers execute:
await _invoke_lazy_uploaders(python_val)
This function traverses the object tree and triggers the upload of any local files to the remote Flyte metadata store (e.g., S3/GCS). When the task on the receiving end deserializes the object, these IO types are reconstructed with their remote paths, ready for the next task to download or stream.
Interoperability: Pydantic inside Dataclasses
The codebase provides a PydanticSchemaPlugin for mashumaro. This plugin allows DataclassTransformer to correctly generate JSON schemas even when a dataclass contains a Pydantic BaseModel as a field. It bridges the two systems by delegating schema generation for the Pydantic field to Pydantic's own model_json_schema method.
Performance and Metadata
Both transformers use MessagePack for binary transport, which is indicated in the LiteralType metadata:
meta_struct.update(
{
CACHE_KEY_METADATA: {
SERIALIZATION_FORMAT: MESSAGEPACK,
}
}
)
This metadata ensures that the Flyte engine and other SDKs know how to decode the binary payload. If the original Python class is unavailable (e.g., when inspecting a execution from a different environment), the guess_python_type method in both transformers can reconstruct a dynamic class from the JSON schema metadata stored in the LiteralType.