Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Add a data schema evaluation mechanism #3943

Closed
ElenaKhaustova opened this issue Jun 6, 2024 · 2 comments
Closed

[DataCatalog]: Add a data schema evaluation mechanism #3943

ElenaKhaustova opened this issue Jun 6, 2024 · 2 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

Description

Users express the need for data schema evaluation to enable "fail-fast" capabilities during data loading and consistency checks before execution. They highlight the potential benefits of schema evaluation in integrating with other services, validating pipelines before execution, and running API checks.

We propose to explore the feasibility and necessity of implementing a data schema evaluation mechanism.

Relates to #3613

Context

Responses obtained during user research interview:

  • Integration with Services and Pre-Run Checks: Schema evaluation is crucial for integrating with external services and for conducting pre-pipeline execution checks. This ensures that the data conforms to expected schemas, allowing for validation before processing begins, enhancing reliability and reducing errors during runtime.
  • Implementation Concerns and Flexibility: Implementing schema checks at the catalog level could complicate the system due to the need to bridge static data configurations with dynamic runtime requirements. The current method annotates Python functions directly, which links schemas more tightly with the execution logic and provides immediate feedback during development, aiding in maintaining type safety and contractual adherence.
  • Potential for Catalog-Level Implementation: While the current approach focuses on runtime validations tied to Python code, there's recognized potential in extending schema validations to the catalog level. This would allow for offline checks, enabling "fail-fast" capabilities during data loading and consistency checks before execution. This dual approach could provide comprehensive coverage, ensuring data integrity both at rest and in motion, and could align with practices seen in other data management frameworks like DBT, which supports schema checks both at rest and during execution.
@merelcht
Copy link
Member

Is this at all related to Move kedro catalog validation schema to kedro-datasets?

@yury-fedotov
Copy link
Contributor

yury-fedotov commented Jun 30, 2024

Shouldn't that be leveraging kedro-pandera?

@kedro-org kedro-org locked and limited conversation to collaborators Jan 20, 2025
@astrojuanlu astrojuanlu converted this issue into discussion #4432 Jan 20, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants