[DataCatalog]: Add a data schema evaluation mechanism #3943

ElenaKhaustova · 2024-06-06T19:20:11Z

Description

Users express the need for data schema evaluation to enable "fail-fast" capabilities during data loading and consistency checks before execution. They highlight the potential benefits of schema evaluation in integrating with other services, validating pipelines before execution, and running API checks.

We propose to explore the feasibility and necessity of implementing a data schema evaluation mechanism.

Relates to #3613

Context

Responses obtained during user research interview:

Integration with Services and Pre-Run Checks: Schema evaluation is crucial for integrating with external services and for conducting pre-pipeline execution checks. This ensures that the data conforms to expected schemas, allowing for validation before processing begins, enhancing reliability and reducing errors during runtime.
Implementation Concerns and Flexibility: Implementing schema checks at the catalog level could complicate the system due to the need to bridge static data configurations with dynamic runtime requirements. The current method annotates Python functions directly, which links schemas more tightly with the execution logic and provides immediate feedback during development, aiding in maintaining type safety and contractual adherence.
Potential for Catalog-Level Implementation: While the current approach focuses on runtime validations tied to Python code, there's recognized potential in extending schema validations to the catalog level. This would allow for offline checks, enabling "fail-fast" capabilities during data loading and consistency checks before execution. This dual approach could provide comprehensive coverage, ensuring data integrity both at rest and in motion, and could align with practices seen in other data management frameworks like DBT, which supports schema checks both at rest and during execution.

merelcht · 2024-06-12T13:31:40Z

Is this at all related to Move kedro catalog validation schema to kedro-datasets?

yury-fedotov · 2024-06-30T01:48:48Z

Shouldn't that be leveraging kedro-pandera?

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 6, 2024

ElenaKhaustova added this to the Redesign the API for IO (catalog) milestone Jun 6, 2024

ElenaKhaustova mentioned this issue Jun 6, 2024

Research summary of insights for redesigning Kedro's data catalog API #3934

Closed

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Closed

kedro-org locked and limited conversation to collaborators Jan 20, 2025

astrojuanlu converted this issue into discussion #4432 Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[DataCatalog]: Add a data schema evaluation mechanism #3943

[DataCatalog]: Add a data schema evaluation mechanism #3943

ElenaKhaustova commented Jun 6, 2024

merelcht commented Jun 12, 2024

yury-fedotov commented Jun 30, 2024 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

[DataCatalog]: Add a data schema evaluation mechanism #3943

[DataCatalog]: Add a data schema evaluation mechanism #3943

Comments

ElenaKhaustova commented Jun 6, 2024

Description

Context

merelcht commented Jun 12, 2024

yury-fedotov commented Jun 30, 2024 • edited Loading

This issue was moved to a discussion.

yury-fedotov commented Jun 30, 2024 •

edited

Loading