Skip to main content

Error Handling and Troubleshooting

The vik-advani-flyte-sdk-9b3ce04 codebase implements a structured error handling system designed to distinguish between different failure modes (user, system, and unknown) and to provide clean, actionable feedback in distributed execution environments.

The Error Hierarchy

All runtime exceptions in the SDK inherit from BaseRuntimeError (found in src/flyte/errors.py). This base class provides a consistent interface for error reporting, including an error code, a kind (e.g., "user", "system"), and an optional worker identifier.

BaseRuntimeError and Arithmetic Re-raising

A unique design choice in BaseRuntimeError is the implementation of arithmetic dunder methods (like __add__, __mul__, etc.) to re-raise the exception. This pattern is specifically designed for flyte.map tasks where return_exceptions=True is used.

When exceptions are returned as values from a map task, user code might inadvertently attempt to perform operations on them (e.g., sum(results)). Instead of failing with a confusing TypeError: unsupported operand type(s) for +: 'int' and 'RuntimeUserError', the SDK re-raises the original subtask error.

# src/flyte/errors.py

class BaseRuntimeError(RuntimeError):
def _reraise(self, *_args):
"""Re-raise this error when user code mistakenly treats it as a value."""
raise self

__add__ = _reraise
__radd__ = _reraise
# ... other arithmetic methods

RuntimeUserError

The RuntimeUserError class is the primary exception raised when a task fails due to an error in the user's code. It categorizes the failure as kind="user", which informs the platform that retries should follow user-defined retry policies.

Common specialized subclasses include:

  • OOMError: Raised when a task exceeds memory limits.
  • TaskTimeoutError: Raised when execution exceeds the allocated time.
  • RetriesExhaustedError: Raised when all retry attempts for a task have failed.

Data Validation and Serialization

The RuntimeDataValidationError is a specialized RuntimeUserError that occurs during the boundary between tasks. It is raised when the SDK fails to serialize inputs or deserialize outputs, often due to type mismatches or missing type annotations.

# src/flyte/errors.py

class RuntimeDataValidationError(RuntimeUserError):
def __init__(self, var: str, e: Exception | str, task_name: str = ""):
super().__init__(
"DataValidationError",
f"In task {task_name} variable {var}, failed to serialize/deserialize because of {e}"
)

This error is critical for troubleshooting "contract" failures between tasks in a workflow. If a task produces a value that the next task cannot consume, the RuntimeDataValidationError provides the specific variable name and the underlying serialization error.

Troubleshooting and Observability

Traceback Filtering

To reduce noise in logs, the SDK implements a custom exception hook in src/flyte/_excepthook.py that filters out internal framework frames. Frames containing strings like _internal, syncify, or _code_bundle are suppressed by default.

This ensures that when a RuntimeUserError is displayed, the traceback focuses on the user's logic rather than the SDK's internal orchestration.

Interpreting Error Codes

When catching exceptions in a workflow or test, you should inspect the .code attribute. This string identifier allows for programmatic handling of specific failure cases without relying on string matching against the error message.

# Example based on examples/basics/exception_handling.py

try:
await task_that_might_fail()
except flyte.errors.RuntimeUserError as e:
if e.code == "ValueError":
# Handle specific value error from user code
print(f"User code raised a ValueError: {e}")
elif e.code == "DataValidationError":
# Handle serialization issues
print(f"Type mismatch detected: {e}")

Debugging with Logs

If the filtered tracebacks are insufficient for debugging, the SDK's logging behavior can be adjusted. The system uses rich for formatting in the CLI, and full tracebacks can be enabled by setting the FLYTE_SDK_LOGGING_LEVEL to DEBUG. This bypasses the filtering logic in _excepthook.py and shows the complete execution stack, including internal SDK transitions.