Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: In each record to filter and transform, publish a local service field holding the original object the record is extracted from #214

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

rpopov
Copy link

@rpopov rpopov commented Jan 13, 2025

Implement airbytehq/airbyte#50395 discussed in airbytehq/airbyte#49971

Implemented a new DpathEnhancingExtractor that navigates through a path of field names to reach the records to extract (exactly as the DpathExtractor does) and:

  • in each record, it binds the $root service field to the original response object
  • in each object/mapping of the JSON structure (thus in each record in particular), it binds the $parent service field to an object/mapping of the parent object's field, including its $parent field.

Published the DpathEnhancingExtractor to the Builder UI.

These service fields are accessible in the filter and transform steps, after which they are removed from the records.
This allows the filter and transformation steps to navigate to fields and structures that are left out of the selected records.
Examples:

  • {{ record["$root"]["attribute name"] }} accesses the attributes of the root object of the response, while the selected records are somewhere nested in the response
  • {{ record["$parent"]["attribute name"] accesses the attributes of the object that owns the selected record(s)
  • {{ record["$parent"]["$parent"]["attribute name"] accesses the attributes of the object that owns the owner of the selected record(s)
  • {{ record["$parent"]....["$parent"]["attribute name"] accesses the attributes of the object that owns the owner of the selected record(s) at any level.

More service fields could be added in the future.
The service fields are available in the record's filtering and transform steps.

This design avoids:

  • building cyclic structures in the JSON response
  • transforming the service fields in the Flatten transformation.

Explicitly clean the service field(s) after the transform step makes the service fields:

  • local for the filter and transform steps
  • not visible to the next mapping and store steps (as they should be)
  • not visible in the tests beyond the test_record_selector (as they should be)
    This allows the record transformation logic to define its "local variables" to reuse
    some interim calculations.

The contract of body parsing seems irregular in representing the cases of bad JSON, no JSON and empty JSON.
It cannot be unified as that irregularity is already used.

Updated the development environment setup documentation

  • to organize and present the setup steps explicitly
  • to avoid misunderstandings and wasted efforts.

Updated CONTRIBUTING.md to:

  • collect and organize the knowledge on running the test locally;
  • state the actual testing steps;
  • clarify and make explicit the procedures and steps.

The unit, integration, and acceptance tests in this exact version fail due to pipeline configuration issues, not related to the change themselves.

Summary by CodeRabbit

  • New Features

    • Introduced an enhanced record extraction mode that improves handling of nested data and includes contextual reference fields.
    • Added a new DpathEnhancingExtractor for improved data extraction capabilities.
  • Bug Fixes / Improvements

    • Enhanced error reporting during JSON parsing and introduced default behaviors to prevent unexpected interruptions.
    • Streamlined pagination handling to manage empty responses more reliably.
  • Documentation

    • Updated contributing guidelines, release notes, and installation instructions for improved clarity.
  • Tests

    • Expanded test coverage to validate the new extraction, transformation, and state management behaviors, including comprehensive tests for the DpathEnhancingExtractor.
  • Chores

    • Upgraded formatting tool dependencies.

At record extraction step, in each record add the service field $root holding a reference to:
* the root response object, when parsing JSON format
* the original record, when parsing JSONL format
that each record to process is extracted from.
More service fields could be added in future.
The service fields are available in the record's filtering and transform steps.

Avoid:
* reusing the maps/dictionaries produced, thus avoid building cyclic structures
* transforming the service fields in the Flatten transformation.

Explicitly cleanup the service field(s) after the transform step, thus making them:
* local for the filter and transform steps
* not visible to the next mapping and store steps (as they should be)
* not visible in the tests beyond the test_record_selector (as they should be)
This allows the record transformation logic to define its "local variables" to reuse
some interim calculations.

The contract of body parsing seems irregular in representing the cases of bad JSON, no JSON and empty JSON.
Cannot be unified as that that irregularity is already used.

Update the development environment setup documentation
* to organize and present the setup steps explicitly
* to avoid misunderstandings and wasted efforts.

Update CONTRIBUTING.md to
* collect and organize the knowledge on running the test locally.
* state the actual testing steps.
* clarify and make explicit the procedures and steps.

The unit, integration, and acceptance tests in this exactly version succeed under Fedora 41, while
one of them fails under Oracle Linux 8.7. not related to the contents of this PR.
The integration tests of the CDK fail due to missing `secrets/config.json` file for the Shopify source.
See airbytehq#197
Copy link
Contributor

coderabbitai bot commented Jan 13, 2025

📝 Walkthrough

Walkthrough

This pull request introduces a variety of updates across the Airbyte CDK. The JSON decoding logic has been improved with enhanced error logging and simplified JSON body parsing. Several extractor modules have been extended: DpathExtractor now features new methods for updating the response body and individual records, and service key management functions were added. A new extractor, DpathEnhancingExtractor, has been introduced along with its corresponding schema, factory method, parser mappings, and tests. In addition, pagination strategies, documentation, and dependency versions have been refined and updated.

Changes

File(s) Change Summary
airbyte_cdk/.../decoders/json_decoder.py
unit_tests/.../decoders/test_json_decoder.py
Updated JSON decoding to improve error handling and simplify JSON body parsing; test cases modified to update parameter names and expected output.
airbyte_cdk/.../extractors/dpath_extractor.py
airbyte_cdk/.../extractors/record_extractor.py
airbyte_cdk/.../extractors/record_selector.py
unit_tests/.../extractors/test_dpath_extractor.py
unit_tests/.../extractors/test_record_extractor.py
unit_tests/.../extractors/test_record_selector.py
Revised extraction logic by replacing certain empty checks, introducing new methods (update_body, update_record), and adding service key management functions (e.g., SERVICE_KEY_PREFIX, add_service_key, exclude_service_keys, is_service_key, remove_service_keys); RecordSelector updated to filter out service keys before normalization.
airbyte_cdk/.../transformations/flatten_fields.py
unit_tests/.../transformations/test_flatten_fields.py
Integrated the is_service_key function into the flattening transformation to preserve service keys; added new tests to verify behavior with list and set values.
airbyte_cdk/.../requesters/paginators/strategies/cursor_pagination_strategy.py
airbyte_cdk/.../requesters/paginators/strategies/offset_increment.py
airbyte_cdk/.../requesters/paginators/strategies/page_increment.py
unit_tests/.../retrievers/test_simple_retriever.py
Refined pagination logic: ensuring consistent assignment of page_size and providing a default value for next page tokens; updated tests now assert that retriever state reflects completed full refresh sync.
airbyte_cdk/__init__.py
airbyte_cdk/.../declarative_component_schema.yaml
airbyte_cdk/.../extractors/__init__.py
airbyte_cdk/.../extractors/dpath_enhancing_extractor.py
airbyte_cdk/.../models/declarative_component_schema.py
airbyte_cdk/.../parsers/manifest_component_transformer.py
airbyte_cdk/.../parsers/model_to_component_factory.py
unit_tests/.../extractors/test_dpath_enhancing_extractor.py
Introduced a new extractor type, DpathEnhancingExtractor, with methods to update the response body (including recursive parent setting) and modify extracted records by adding service fields; corresponding schema, parser mappings, factory method, and tests were added, and the public API was updated to export the new component.
docs/CONTRIBUTING.md
docs/RELEASES.md
package.json
Updated documentation for clarity and organization (including setup instructions and formatting adjustments) and upgraded the Prettier dependency version in package.json.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Extractor as DpathEnhancingExtractor
  participant Selector as RecordSelector

  Client->>Extractor: Receive HTTP response
  Extractor->>Extractor: update_body() (set parent context recursively)
  Extractor->>Extractor: update_record() (add root service field)
  Extractor-->>Client: Yield enhanced record
  Client->>Selector: Submit record for filtering and normalization
  Selector->>Selector: remove_service_keys() from record
  Selector-->>Client: Return normalized record
Loading

Suggested labels

enhancement

Suggested reviewers

  • maxi297 – Would you be available to review these changes, wdyt?
  • natikgadzhi – Could you take a look at the modifications when you have a moment, wdyt?
  • brianjlai – Would you be able to provide your insights on this PR?
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (17)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (3)

10-14: Small typo in the documentation

There's a typo in "transormation" -> "transformation". Otherwise, the documentation clearly explains the convention and lifecycle of service fields.

-- The service fields are kept only during the record's filtering and transormation.
++ The service fields are kept only during the record's filtering and transformation.

21-27: Consider removing type ignore and adding return type annotation

The # type: ignore[no-untyped-def] seems unnecessary here. What do you think about adding a return type annotation of None since the function modifies the dict in place? wdyt?

-def remove_service_keys(mapping: dict[str, Any]):  # type: ignore[no-untyped-def]
+def remove_service_keys(mapping: dict[str, Any]) -> None:

34-35: Consider removing type ignore and improving error message

Similar to above, the type ignore seems unnecessary. Also, would it be helpful to include the actual mapping in the assertion message to aid debugging? wdyt?

-def assert_service_keys_exist(mapping: Mapping[str, Any]):  # type: ignore[no-untyped-def]
-    assert mapping != exclude_service_keys(mapping), "The mapping should contain service keys"
+def assert_service_keys_exist(mapping: Mapping[str, Any]) -> None:
+    assert mapping != exclude_service_keys(mapping), f"The mapping should contain service keys, got: {mapping}"
airbyte_cdk/sources/declarative/transformations/flatten_fields.py (1)

36-44: Consider adjusting indentation for better readability

The logic looks good! The service keys are correctly preserved. Would you consider adjusting the indentation of the else clause to match the if condition for better readability? wdyt?

                    if not is_service_key(current_key):
                        new_key = (
                            f"{parent_key}.{current_key}"
                            if (current_key in transformed_record or force_with_parent_name)
                            else current_key
                        )
                        stack.append((value, new_key))
-                    else:  # transfer the service fields without change
-                        transformed_record[current_key] = value
+                    else:
+                        # transfer the service fields without change
+                        transformed_record[current_key] = value
unit_tests/sources/declarative/transformations/test_flatten_fields.py (1)

50-67: Consider adding test cases for nested service keys

The test coverage for service keys looks good! Would you consider adding a test case for nested objects containing service keys to ensure they're handled correctly at any depth? For example:

(
    {
        SERVICE_KEY_PREFIX + "root": {
            "nested": {SERVICE_KEY_PREFIX + "child": "value"}
        },
        "regular": "field"
    },
    {
        SERVICE_KEY_PREFIX + "root": {
            "nested": {SERVICE_KEY_PREFIX + "child": "value"}
        },
        "regular": "field"
    }
)
unit_tests/sources/declarative/extractors/test_record_extractor.py (2)

62-67: Consider catching specific assertion error

The test looks good, but would it be safer to catch the specific AssertionError instead of using a bare except? This would prevent masking other unexpected exceptions. wdyt?

    try:
        assert_service_keys_exist(original)
        success = False
-    except:  # OK, expected
+    except AssertionError:  # OK, expected
        success = True

71-76: Consider adding edge cases for service key detection

The basic test cases look good! Would you consider adding some edge cases to ensure robust handling of special characters and edge cases? For example:

def test_service_key_edge_cases():
    assert is_service_key(SERVICE_KEY_PREFIX)  # just the prefix
    assert is_service_key(SERVICE_KEY_PREFIX + "$nested")  # double prefix
    assert not is_service_key("prefix" + SERVICE_KEY_PREFIX)  # prefix in middle
airbyte_cdk/sources/declarative/decoders/json_decoder.py (1)

38-41: Consider making stack trace logging configurable?

The debug logging with stack_info=True provides great debugging context but might be verbose in production. What do you think about making it configurable through a parameter, wdyt?

-            logger.debug("Response to parse: %s", response.text, exc_info=True, stack_info=True)
+            logger.debug(
+                "Response to parse: %s",
+                response.text,
+                exc_info=True,
+                stack_info=self.parameters.get("debug_stack_trace", False)
+            )
airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (2)

23-29: Consider using dict constructor for a cleaner copy?

The implementation safely handles both dict and non-dict cases. For the dict copy, what do you think about using the dict constructor for a slightly cleaner approach, wdyt?

-        copy = {k: v for k, v in record.items()}
+        copy = dict(record)

92-109: Consider consolidating the yield statements?

The implementation handles all cases well. What do you think about consolidating the yield statements for better maintainability, wdyt?

-                if isinstance(extracted, list):
-                    for record in extracted:
-                        yield update_record(record, root_response)
-                elif isinstance(extracted, dict):
-                    yield update_record(extracted, root_response)
-                elif extracted:
-                    yield extracted
-                else:
-                    yield from []
+                if not extracted:
+                    return
+                records = [extracted] if isinstance(extracted, dict) else extracted if isinstance(extracted, list) else [extracted]
+                yield from (update_record(record, root_response) for record in records)
airbyte_cdk/sources/declarative/extractors/record_selector.py (1)

112-113: Consider adding a docstring to explain the service fields removal step?

The addition of service fields removal before normalization is a significant change in the data processing pipeline. Would you consider adding a brief docstring to explain why this step is necessary here? wdyt?

unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)

196-201: Consider extracting the service key validation into a helper function?

The service key validation logic is repeated across test cases. Would it make sense to extract this into a helper function to improve maintainability? wdyt?

+def validate_service_keys(records):
+    for record in records:
+        if record != {}:
+            assert_service_keys_exist(record)
+    return [exclude_service_keys(record) for record in records]
+
 def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List):
     # ...
-    for record in actual_records:
-        if record != {}:
-            assert_service_keys_exist(record)
-
-    actual_records = [exclude_service_keys(record) for record in actual_records]
+    actual_records = validate_service_keys(actual_records)
unit_tests/sources/declarative/extractors/test_record_selector.py (1)

169-171: Consider consistent whitespace around service key operations?

The whitespace around service key operations differs between test methods. Would you like to make it consistent? For example, always having one blank line before and after? wdyt?

Also applies to: 244-245

unit_tests/sources/declarative/partition_routers/test_parent_state_stream.py (1)

325-325: Nice addition of deduplication logic! 👍

The deduplication using dictionary comprehension with orjson.dumps as keys is elegant. Would you consider extracting this into a helper function since it's used twice in the test? Something like _dedupe_records(records), wdyt?

+def _dedupe_records(records):
+    return list({orjson.dumps(record): record for record in records}.values())

 # Usage:
-cumulative_records_state_deduped = list(
-    {orjson.dumps(record): record for record in cumulative_records_state}.values()
-)
+cumulative_records_state_deduped = _dedupe_records(cumulative_records_state)
docs/CONTRIBUTING.md (2)

97-115: Comprehensive test execution instructions! 🎯

The detailed breakdown of test commands with explanations is very helpful. Consider adding a note about expected test execution times for the different commands to help developers plan their workflow, wdyt?

🧰 Tools
🪛 LanguageTool

[style] ~101-~101: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~108-~108: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format to format your Python code. ### Run Code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


167-167: Minor: Missing comma in the sentence

-On the **Settings / Domains** page find the subdomain
+On the **Settings / Domains** page, find the subdomain
🧰 Tools
🪛 LanguageTool

[uncategorized] ~167-~167: Possible missing comma found.
Context: ...file. - On the Settings / Domains page find the subdomain of myshopify.com a...

(AI_HYDRA_LEO_MISSING_COMMA)

docs/RELEASES.md (1)

60-61: Clear instructions for testing pre-release versions

The instructions for testing pre-release versions are well-structured. Consider adding a note about the expected time for the publishing pipeline to complete, wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e7802a and 4cdc87c.

⛔ Files ignored due to path filters (2)
  • package-lock.json is excluded by !**/package-lock.json
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • airbyte_cdk/sources/declarative/decoders/json_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (2 hunks)
  • airbyte_cdk/sources/declarative/extractors/record_extractor.py (1 hunks)
  • airbyte_cdk/sources/declarative/extractors/record_selector.py (3 hunks)
  • airbyte_cdk/sources/declarative/requesters/paginators/strategies/cursor_pagination_strategy.py (1 hunks)
  • airbyte_cdk/sources/declarative/transformations/flatten_fields.py (2 hunks)
  • docs/CONTRIBUTING.md (5 hunks)
  • docs/RELEASES.md (2 hunks)
  • package.json (1 hunks)
  • unit_tests/sources/declarative/decoders/test_json_decoder.py (1 hunks)
  • unit_tests/sources/declarative/extractors/test_dpath_extractor.py (5 hunks)
  • unit_tests/sources/declarative/extractors/test_record_extractor.py (1 hunks)
  • unit_tests/sources/declarative/extractors/test_record_selector.py (5 hunks)
  • unit_tests/sources/declarative/partition_routers/test_parent_state_stream.py (1 hunks)
  • unit_tests/sources/declarative/retrievers/test_simple_retriever.py (2 hunks)
  • unit_tests/sources/declarative/test_manifest_declarative_source.py (2 hunks)
  • unit_tests/sources/declarative/transformations/test_flatten_fields.py (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • package.json
  • unit_tests/sources/declarative/test_manifest_declarative_source.py
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md

[style] ~101-~101: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~108-~108: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format to format your Python code. ### Run Code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[uncategorized] ~167-~167: Possible missing comma found.
Context: ...file. - On the Settings / Domains page find the subdomain of myshopify.com a...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~269-~269: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...

(DO_ACHIEVE)


[locale-violation] ~287-~287: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....

(HAVE_A_LOOK)

docs/RELEASES.md

[grammar] ~68-~68: Did you mean the noun “publishing”?
Context: ...ng Low-Code Python connectors Once the publish pipeline has completed, set the version...

(PREPOSITION_VERB)

🪛 Markdownlint (0.37.0)
docs/CONTRIBUTING.md

140-140: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


239-239: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


86-86: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (14)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (1)

47-50: Nice simplification! 👍

The use of yield from makes the code more elegant and Pythonic. Great improvement!

airbyte_cdk/sources/declarative/requesters/paginators/strategies/cursor_pagination_strategy.py (1)

73-73: Good defensive programming! 👍

Adding a default empty dict to next() prevents StopIteration exceptions when the decoder yields no items. This aligns well with the JsonDecoder changes.

airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (2)

12-20: Clean implementation of service field! 👍

Good use of the service key prefix convention for the root field. The implementation aligns well with the PR objectives.


88-90: Clear empty case handling! 👍

The explicit empty dict check and the explanatory comment make the contract compliance clear.

unit_tests/sources/declarative/decoders/test_json_decoder.py (2)

16-27: Test cases align well with the new behavior! 👍

The parameter rename to expected_json and updated test cases clearly reflect the decoder's behavior.


34-48: Great test coverage! 👍

The additional test cases and descriptive IDs improve the test suite's comprehensiveness and readability.

airbyte_cdk/sources/declarative/extractors/record_selector.py (2)

11-14: LGTM! Clean import organization.

The imports are well-organized and the new exclude_service_keys import aligns with the PR's objective of managing service fields.


162-167: LGTM! Clean implementation of service keys removal.

The _remove_service_keys method is simple, focused, and effectively uses the imported exclude_service_keys function. The implementation aligns well with the PR's objective of making service fields local to filtering and transformation.

unit_tests/sources/declarative/extractors/test_dpath_extractor.py (2)

18-21: LGTM! Well-organized imports.

The new imports for service key management functions are properly organized.


40-94: Great test coverage for edge cases!

The test cases thoroughly cover various scenarios including empty strings, objects, arrays, and nested structures. The parameterization is well-organized and the test data is clear.

unit_tests/sources/declarative/extractors/test_record_selector.py (2)

12-19: LGTM! Clean import organization.

The imports are well-organized and properly grouped.


50-79: Great test coverage for nested records!

The test cases thoroughly cover scenarios with nested records and properly validate the behavior with and without filters.

unit_tests/sources/declarative/retrievers/test_simple_retriever.py (1)

247-247: LGTM! Clear state management assertions.

The assertions properly verify that the state is updated to indicate sync completion after reading records.

Also applies to: 362-362

docs/CONTRIBUTING.md (1)

17-23: Great addition of Python version requirements!

The explicit Python version requirements and Fedora-specific installation instructions make it easier for contributors to get started.

@rpopov rpopov changed the title In each record to filter and transform, publish a local service field holding the original object the record is extracted from feat: In each record to filter and transform, publish a local service field holding the original object the record is extracted from Jan 13, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (4)

10-14: Documentation looks great! Would you consider adding an example? 🤔

The convention is well-documented. To make it even more helpful for developers, what do you think about adding a simple example of a service field, like $root? wdyt?


21-27: The implementation looks solid! Quick question about typing 🤔

The # type: ignore[no-untyped-def] comment seems unnecessary since the function has type hints. Could we remove it? wdyt?


30-31: How about using str.startswith() for better readability? 🤔

-    return key.find(SERVICE_KEY_PREFIX) == 0
+    return key.startswith(SERVICE_KEY_PREFIX)

Both work the same way, but startswith() might be more intuitive. What do you think?


34-35: The assertion looks good! A couple of suggestions to consider 🤔

  1. The # type: ignore[no-untyped-def] seems unnecessary since we have type hints
  2. Would you like to make the assertion message more descriptive? Maybe something like:
    assert mapping != exclude_service_keys(mapping), f"The mapping should contain at least one service key with prefix '{SERVICE_KEY_PREFIX}'"

What do you think?

airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (3)

23-29: The implementation looks solid! A few enhancement ideas 🤔

  1. Would you like to add a docstring explaining the function's purpose and return type?
  2. We could narrow the return type using Union:
-def update_record(record: Any, root: Any) -> Any:
+def update_record(record: Any, root: Any) -> Union[dict[str, Any], Any]:

What are your thoughts on these suggestions?


88-91: The empty response handling looks good! One suggestion for clarity 🤔

What do you think about making the contract more explicit in the comment? Maybe:

-            # An empty/invalid JSON parsed, keep the contract
+            # Return empty dict for empty/invalid JSON to maintain the contract that we always yield a mapping

92-109: The extraction logic looks solid! A couple of suggestions to consider 🤔

  1. Would you like to add type hints for root_response and extracted?
  2. The elif extracted: condition on line 106 might be more explicit as elif not isinstance(extracted, (list, dict)) and extracted: to clearly show the intent. wdyt?
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4cdc87c and 7b6ce9b.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (2 hunks)
  • airbyte_cdk/sources/declarative/extractors/record_extractor.py (1 hunks)
  • airbyte_cdk/sources/declarative/transformations/flatten_fields.py (2 hunks)
  • unit_tests/sources/declarative/transformations/test_flatten_fields.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • unit_tests/sources/declarative/transformations/test_flatten_fields.py
  • airbyte_cdk/sources/declarative/transformations/flatten_fields.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (1)

17-18: Clean and efficient implementation! 👍

Nice use of dictionary comprehension and type hints.

airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (1)

12-20: Clean imports and clear constant definition! 👍

Good practice using the SERVICE_KEY_PREFIX to construct RECORD_ROOT_KEY.

@rpopov
Copy link
Author

rpopov commented Jan 20, 2025

Still #197 is a problem. It blocks the build.

Created a DPath Enhancing Extractor
Refactored the record enhancement logic - moved to the extracted class
Split the tests of DPathExtractor and DPathEnhancingExtractor

Fix the failing tests:

FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_custom_components[test_create_custom_component_with_subcomponent_that_uses_parameters]
FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_custom_components_do_not_contain_extra_fields
FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_parse_custom_component_fields_if_subcomponent
FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_page_increment
FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_offset_increment
FAILED unit_tests/sources/file_based/test_file_based_scenarios.py::test_file_based_read[simple_unstructured_scenario]
FAILED unit_tests/sources/file_based/test_file_based_scenarios.py::test_file_based_read[no_file_extension_unstructured_scenario]

They faile because of comparing string and int values of the page_size (public) attribute.
Imposed an invariant:
  on construction, page_size can be set to a string or int
  keep only values of one type in page_size for uniform comparison (convert the values of the other type)
  _page_size holds the internal / working value
... unless manipulated directly.

Merged:
feat(low-code concurrent): Allow async job low-code streams that are incremental to be run by the concurrent framework (airbytehq#228)
fix(low-code): Fix declarative low-code state migration in SubstreamPartitionRouter (airbytehq#267)
feat: combine slash command jobs into single job steps (airbytehq#266)
feat(low-code): add items and property mappings to dynamic schemas (airbytehq#256)
feat: add help response for unrecognized slash commands (airbytehq#264)
ci: post direct links to html connector test reports (airbytehq#252) (airbytehq#263)
fix(low-code): Fix legacy state migration in SubstreamPartitionRouter (airbytehq#261)
fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254)
feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236)
feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111)
fix: don't mypy unit_tests (airbytehq#241)
fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225)
feat(concurrent cursor): attempt at clamping datetime (airbytehq#234)
fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254)
feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236)
feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111)
fix: don't mypy unit_tests (airbytehq#241)
fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225)
feat(concurrent cursor): attempt at clamping datetime (airbytehq#234)
ci: use `ubuntu-24.04` explicitly (resolves CI warnings) (airbytehq#244)
Fix(sdm): module ref issue in python components import (airbytehq#243)
feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174)
chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133)
docs: comments on what the `Dockerfile` is for (airbytehq#240)
chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237)
Fix(sdm): module ref issue in python components import (airbytehq#243)
feat(low-code): add DpathFlattenFields (airbytehq#227)
feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174)
chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133)
docs: comments on what the `Dockerfile` is for (airbytehq#240)
chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237)
Trigger a new build. Hopefully, the integration test infrastructure is fixed.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (11)
airbyte_cdk/sources/declarative/extractors/dpath_enhancing_extractor.py (2)

72-108: Performance caution when reshaping large JSON objects.
Recursively constructing nested dictionaries and adding the parent service key could be computationally expensive for very large or deeply nested JSON objects. Perhaps we could optionally disable this recursive parent assignment for known large data sets, wdyt?


135-142: Collision avoidance with existing 'root' fields.
update_record sets a service key named "root". If the record already contains a key named root, this might overwrite it or create confusion. Maybe prefix it with a special character or handle collisions differently, wdyt?

airbyte_cdk/sources/declarative/extractors/__init__.py (1)

21-21: Confirm __all__ alignment with project naming conventions.
DpathEnhancingExtractor is appended to __all__. This is consistent, though sometimes teams prefer alphabetical order. Would you like to reorder these entries to maintain a standard format, wdyt?

airbyte_cdk/sources/declarative/requesters/paginators/strategies/page_increment.py (1)

34-42: Simplify page_size initialization further?
The updated logic is easier to follow, but the flow around lines 34–42 could be extracted into a helper method for clarity. That might remove the nested ifs and keep it tidy, wdyt?

unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py (1)

43-340: Great test coverage! A few suggestions to consider.

The test suite is comprehensive and well-structured. Really nice job on covering edge cases! 👍

A few minor suggestions to consider:

  1. Would it be helpful to add docstrings to explain the purpose of complex test cases?
  2. What about adding some negative test cases to verify error handling?
  3. Consider grouping related test cases using pytest's @pytest.mark.group decorator for better organization

What are your thoughts on these suggestions?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1824-1842: The implementation looks good, but should we add some validation?

The implementation follows the pattern of other extractors, but would it be helpful to add validation for the field path to ensure it's not empty? wdyt?

 def create_dpath_enhancing_extractor(
     self,
     model: DpathEnhancingExtractorModel,
     config: Config,
     decoder: Optional[Decoder] = None,
     **kwargs: Any,
 ) -> DpathEnhancingExtractor:
+    if not model.field_path:
+        raise ValueError("field_path cannot be empty")
     if decoder:
         decoder_to_use = decoder
     else:
         decoder_to_use = JsonDecoder(parameters={})
airbyte_cdk/sources/declarative/extractors/record_extractor.py (1)

29-40: The recursive implementation looks good, but should we add a depth limit?

The exclude_service_keys function handles nested structures well, but would it be worth adding a max depth parameter to prevent potential stack overflow with deeply nested structures? wdyt?

-def exclude_service_keys(struct: Any) -> Any:
+def exclude_service_keys(struct: Any, max_depth: int = 100) -> Any:
     """
     :param struct: any object/JSON structure
+    :param max_depth: maximum recursion depth to prevent stack overflow
     :return: a copy of struct without any service fields at any level of nesting
     """
+    if max_depth <= 0:
+        return struct
     if isinstance(struct, dict):
-        return {k: exclude_service_keys(v) for k, v in struct.items() if not is_service_key(k)}
+        return {k: exclude_service_keys(v, max_depth - 1) for k, v in struct.items() if not is_service_key(k)}
     elif isinstance(struct, list):
-        return [exclude_service_keys(v) for v in struct]
+        return [exclude_service_keys(v, max_depth - 1) for v in struct]
     else:
         return struct
airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (2)

73-75: Consider adding a docstring to explain the empty body contract?

The empty body handling looks good, but adding a docstring might help future maintainers understand why we need to maintain this contract, wdyt?


98-113: The template methods look great! Consider adding @AbstractMethod?

The update_body and update_record template methods are well-documented. Since they're meant to be overridden by subclasses, would you consider marking them with @AbstractMethod to make the contract more explicit, wdyt?

unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)

44-93: Comprehensive nested data test coverage!

The test cases thoroughly cover various nested data structures and extraction patterns. Would you consider adding a test case for very deep nesting to verify any potential recursion limits, wdyt?

airbyte_cdk/sources/declarative/requesters/paginators/strategies/offset_increment.py (1)

55-58: Consider adding type validation for page_size conversion?

The current implementation converts integers to strings without validation. We could add a type check to ensure page_size is either an int or str before conversion, wdyt?

-        self.page_size = str(self.page_size) if isinstance(self.page_size, int) else self.page_size
+        if not isinstance(self.page_size, (int, str, type(None))):
+            raise ValueError(f"page_size must be an integer or string, got {type(self.page_size)}")
+        self.page_size = str(self.page_size) if isinstance(self.page_size, int) else self.page_size
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b6ce9b and 73b8fb8.

📒 Files selected for processing (18)
  • airbyte_cdk/__init__.py (2 hunks)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
  • airbyte_cdk/sources/declarative/extractors/__init__.py (2 hunks)
  • airbyte_cdk/sources/declarative/extractors/dpath_enhancing_extractor.py (1 hunks)
  • airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (1 hunks)
  • airbyte_cdk/sources/declarative/extractors/record_extractor.py (1 hunks)
  • airbyte_cdk/sources/declarative/extractors/record_selector.py (2 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
  • airbyte_cdk/sources/declarative/parsers/manifest_component_transformer.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
  • airbyte_cdk/sources/declarative/requesters/paginators/strategies/offset_increment.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/paginators/strategies/page_increment.py (1 hunks)
  • airbyte_cdk/sources/declarative/transformations/flatten_fields.py (2 hunks)
  • docs/CONTRIBUTING.md (5 hunks)
  • unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py (1 hunks)
  • unit_tests/sources/declarative/extractors/test_dpath_extractor.py (4 hunks)
  • unit_tests/sources/declarative/extractors/test_record_extractor.py (1 hunks)
  • unit_tests/sources/declarative/extractors/test_record_selector.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • airbyte_cdk/sources/declarative/transformations/flatten_fields.py
  • airbyte_cdk/sources/declarative/extractors/record_selector.py
  • docs/CONTRIBUTING.md
  • unit_tests/sources/declarative/extractors/test_record_extractor.py
  • unit_tests/sources/declarative/extractors/test_record_selector.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (17)
airbyte_cdk/sources/declarative/extractors/dpath_enhancing_extractor.py (2)

66-71: Looks clean and straightforward!
The __post_init__ method delegates seamlessly to the parent class, ensuring necessary initialization logic is still applied.


109-134: Double-check handling of None and empty values.
Currently, _set_parent filters out values if they are dictionaries or lists (lines 124-125). If v can be zero, empty string, or False, are we confident none of those cases should be filtered out? Might we risk losing relevant boolean or numeric fields, wdyt?

airbyte_cdk/sources/declarative/extractors/__init__.py (1)

5-7: Importing the new extractor looks good!
Having DpathEnhancingExtractor in a separate file helps keep the code organization clear.

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1674-1674: LGTM! The new extractor type is correctly integrated.

The addition of DpathEnhancingExtractor to the extractor union type in RecordSelector is clean and follows the existing pattern.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

80-80: LGTM! Clean import addition.

The import is correctly placed with other extractor imports.


562-562: LGTM! Mapping is correctly added.

The new extractor is properly mapped to its creation method in the factory's initialization.

airbyte_cdk/sources/declarative/extractors/record_extractor.py (4)

10-15: LGTM! Clear convention documentation.

The convention for service fields is well-documented and the prefix constant is appropriately defined.


17-27: LGTM! Clean utility function implementation.

The add_service_key function is well-documented and follows immutable pattern by returning a new dict.


42-44: LGTM! Simple and focused implementation.

The is_service_key function is clean and follows single responsibility principle.


46-49: LGTM! Generator pattern is well-utilized.

The remove_service_keys function efficiently uses generator pattern to process records one at a time.

airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (1)

77-78: LGTM! Smart way to preserve the root response.

Storing the original response before modification and passing it through the update chain enables powerful contextual transformations.

unit_tests/sources/declarative/extractors/test_dpath_extractor.py (2)

39-41: Great test coverage for edge cases!

The empty input test cases effectively verify the contract compliance for empty string, object, and array inputs.


125-128: LGTM! Clear JSONL format test cases.

The JSONL-specific test cases effectively verify the handling of array records and line-based parsing.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

1458-1488: Well-structured component definition!

The DpathEnhancingExtractor component is well-documented and maintains consistency with other extractors. The service fields ($root, $parent) are clearly explained.


2806-2806: LGTM! Clean integration with RecordSelector.

The addition of DpathEnhancingExtractor to the RecordSelector's extractor property enables seamless integration.

airbyte_cdk/sources/declarative/parsers/manifest_component_transformer.py (1)

50-51: LGTM! Clean addition of DpathEnhancingExtractor mapping

The new entry follows the established pattern and correctly maps the decoder to JsonDecoder.

airbyte_cdk/__init__.py (1)

95-95: LGTM! Clean addition of DpathEnhancingExtractor to public API

The new extractor is properly exposed in both the imports and all list, following the module's conventions.

Also applies to: 237-237

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
docs/CONTRIBUTING.md (4)

17-17: The installation instructions are well-organized, but could we make the Python version requirement more prominent?

The Python version requirement is crucial for development. What do you think about moving it to the very top of the setup instructions and adding a note about version compatibility? Something like:

-1. Clone the CDK repo. If you will be testing connectors, you should clone the CDK into the same parent directory as `airbytehq/airbyte`, which contains the connector definitions.
-1. Make sure your Python version is 3.11
+## Prerequisites
+
+- Python 3.11 (required)
+- Poetry 2.0 or higher
+
+## Setup Steps
+
+1. Clone the CDK repo. If you will be testing connectors, you should clone the CDK into the same parent directory as `airbytehq/airbyte`, which contains the connector definitions.

wdyt? 🤔

Also applies to: 26-86


111-138: The testing section is comprehensive, but could we improve the organization?

The testing instructions are detailed but could be more structured. What do you think about grouping related commands under subsections with descriptive headers? For example:

### Run Unit Tests

+#### Basic Test Commands
- `poetry run pytest` to run all unit tests.
- `poetry run pytest -k <suite or test name>` to run specific unit tests.
+
+#### Performance Optimized Testing
- `poetry run pytest-fast` to run the subset of PyTest tests, which are not flagged as `slow`. (It should take <5 min for fast tests only.)
+
+#### Advanced Options
- `poetry run pytest -s unit_tests` if you want to pass pytest options.

This organization might make it easier for contributors to find the specific command they need. Thoughts? 🤔

🧰 Tools
🪛 LanguageTool

[style] ~116-~116: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~123-~123: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format to format your Python code. ### Run Code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


346-352: Consider adding a note about container cleanup.

When using Docker for MockServer testing, it might be helpful to add a note about cleaning up the container after testing. What do you think about adding:

docker run -d --rm -v $(pwd)/secrets/mock_server_config:/config -p 8113:8113 --env MOCKSERVER_LOG_LEVEL=TRACE --env MOCKSERVER_SERVER_PORT=8113 --env MOCKSERVER_WATCH_INITIALIZATION_JSON=true --env MOCKSERVER_PERSISTED_EXPECTATIONS_PATH=/config/expectations.json --env MOCKSERVER_INITIALIZATION_JSON_PATH=/config/expectations.json mockserver/mockserver:5.15.0

+# To stop the MockServer container when done:
+docker ps | grep mockserver | awk '{print $1}' | xargs docker stop

wdyt? This would help prevent orphaned containers. 🐳


100-100: Add language specifiers to code blocks.

Some code blocks are missing language specifiers. This affects syntax highlighting and markdown linting.

-```
+```yaml
 connector_image: airbyte/source-shopify:dev    to    connector_image: airbyte/source-shopify:<version>

Also, remove spaces inside code spans:

-`     domain: nw0ipt-vr.myshopify.com`
+`domain: nw0ipt-vr.myshopify.com`

Also applies to: 184-184

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

100-100: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73b8fb8 and c82151b.

📒 Files selected for processing (1)
  • docs/CONTRIBUTING.md (5 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md

[style] ~116-~116: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~123-~123: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format to format your Python code. ### Run Code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~287-~287: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...

(DO_ACHIEVE)


[locale-violation] ~305-~305: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....

(HAVE_A_LOOK)

🪛 markdownlint-cli2 (0.17.2)
docs/CONTRIBUTING.md

100-100: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


155-155: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)


184-184: Spaces inside code span elements
null

(MD038, no-space-in-code)


254-254: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
docs/CONTRIBUTING.md (1)

8-8: LGTM! Nice improvement in capitalization.

The capitalization of "GitHub" has been corrected, which aligns with proper noun conventions.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
docs/CONTRIBUTING.md (6)

16-81: Setup instructions could be more explicit about Python version requirements, wdyt?

The Python version requirement is mentioned but could be more explicit about whether this is a minimum or exact version requirement. Consider clarifying if versions > 3.11 are also supported.

-1. Make sure your Python version is 3.11
+1. Make sure your Python version is 3.11 or higher

93-101: Consider adding a note about version synchronization, what do you think?

The instructions about changing the connector image version could benefit from a note explaining why version synchronization is important and what problems it prevents.

 where the **version** comes from airbyte/airbyte-integrations/connectors/source-shopify/metadata.yaml, as
-the value of the **dockerImageTag** parameter.
+the value of the **dockerImageTag** parameter. This ensures version consistency between the connector image and its metadata, preventing potential compatibility issues.
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

95-95: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


106-124: The testing section looks great, but what about adding examples? 🤔

The testing instructions are clear, but could be even more helpful with example commands showing typical usage patterns. For instance, showing how to run a specific test suite or how to use the coverage report.

 - `poetry run pytest` to run all unit tests.
+Example:
+```bash
+# Run a specific test suite
+poetry run pytest unit_tests/test_source.py
+
+# Run tests with coverage report
+poetry run pytest --cov=airbyte_cdk
+```
🧰 Tools
🪛 LanguageTool

[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format to format your Python code. ### Run Code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


179-181: Fix code block formatting for better readability?

The domain and shop examples are using inconsistent formatting. Consider using proper code blocks for better readability.

-    domain: nw0ipt-vr.myshopify.com
-    shop: nw0ipt-vr
+```yaml
+domain: nw0ipt-vr.myshopify.com
+shop: nw0ipt-vr
+```
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

179-179: Spaces inside code span elements
null

(MD038, no-space-in-code)


339-345: Consider adding a troubleshooting section for MockServer setup?

The MockServer setup instructions are good, but common issues and their solutions could be helpful. For example, what to do if the port is already in use or if Docker can't mount the volume.

 HTTP requests to `localhost:8113/data` should now return the body defined in the expectations file. To test this, the implementer either has to change the code that defines the base URL for the Python source or update the `url_base` from low-code. With the Connector Builder running in docker, you will have to use domain `host.docker.internal` instead of `localhost` as the requests are executed within docker.

+### Troubleshooting MockServer Setup
+
+- If port 8113 is already in use, you can change it to another port in both the docker command and your configuration
+- If you see volume mount errors, ensure the `secrets/mock_server_config` directory exists and has proper permissions
+- For Windows users using WSL2, you might need to use `%cd%` instead of `$(pwd)` in the volume mount

298-298: Minor style suggestion: consider using American English consistently?

For consistency with other documentation, consider using "take a look" instead of "have a look".

-Please have a look at the [Release Management](./RELEASES.md) guide
+Please take a look at the [Release Management](./RELEASES.md) guide
🧰 Tools
🪛 LanguageTool

[locale-violation] ~298-~298: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....

(HAVE_A_LOOK)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c82151b and 51dad94.

📒 Files selected for processing (1)
  • docs/CONTRIBUTING.md (4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md

[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format to format your Python code. ### Run Code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~280-~280: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...

(DO_ACHIEVE)


[locale-violation] ~298-~298: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....

(HAVE_A_LOOK)

🪛 markdownlint-cli2 (0.17.2)
docs/CONTRIBUTING.md

95-95: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


150-150: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)


179-179: Spaces inside code span elements
null

(MD038, no-space-in-code)


249-249: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

661-675: Consider adding more descriptive documentation for the new extractor.

The new DpathEnhancingExtractor class seems to be a key component for implementing the $root service field feature mentioned in the PR objectives. However, its purpose and behavior aren't immediately clear from the documentation. Would you consider adding:

  1. A description of how it enhances the extraction process
  2. An example showing how it adds the $root service field
  3. Documentation about any limitations or considerations when using this extractor
 class DpathEnhancingExtractor(BaseModel):
     type: Literal["DpathEnhancingExtractor"]
     field_path: List[str] = Field(
         ...,
-        description='List of potentially nested fields describing the full path of the field to extract. Use "*" to extract all values from an array. See more info in the [docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/record-selector).',
+        description='''List of potentially nested fields describing the full path of the field to extract. Use "*" to extract all values from an array.
+
+        This extractor enhances each extracted record by adding a `$root` service field containing a reference to:
+        - The root response object when parsing JSON format
+        - The original record when parsing JSONL format
+
+        This enables advanced filtering and transformation capabilities by providing access to the complete context.
+        Note: Service fields are local to filtering and transformation steps and are cleaned afterward.
+
+        See more info in the [docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/record-selector).
+        ''',
         examples=[
             ["data"],
             ["data", "records"],
             ["data", "{{ parameters.name }}"],
             ["data", "*", "record"],
         ],
         title="Field Path",
     )
     parameters: Optional[Dict[str, Any]] = Field(None, alias="$parameters")
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1865-1882: Consider reducing code duplication with create_dpath_extractor?

The implementation is correct, but I notice it shares a lot of logic with create_dpath_extractor. Would you consider extracting the common decoder initialization and field path handling into a shared private method to reduce duplication? Something like this perhaps:

+    def _create_dpath_extractor_base(
+        self,
+        decoder: Optional[Decoder],
+        field_path: List[str],
+        config: Config,
+        parameters: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[Decoder, List[Union[InterpolatedString, str]]]:
+        decoder_to_use = decoder if decoder else JsonDecoder(parameters={})
+        model_field_path: List[Union[InterpolatedString, str]] = [x for x in field_path]
+        return decoder_to_use, model_field_path
+
     def create_dpath_extractor(
         self,
         model: DpathExtractorModel,
         config: Config,
         decoder: Optional[Decoder] = None,
         **kwargs: Any,
     ) -> DpathExtractor:
-        if decoder:
-            decoder_to_use = decoder
-        else:
-            decoder_to_use = JsonDecoder(parameters={})
-        model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path]
+        decoder_to_use, model_field_path = self._create_dpath_extractor_base(
+            decoder, model.field_path, config, model.parameters
+        )
         return DpathExtractor(
             decoder=decoder_to_use,
             field_path=model_field_path,
             config=config,
             parameters=model.parameters or {},
         )

     def create_dpath_enhancing_extractor(
         self,
         model: DpathEnhancingExtractorModel,
         config: Config,
         decoder: Optional[Decoder] = None,
         **kwargs: Any,
     ) -> DpathEnhancingExtractor:
-        if decoder:
-            decoder_to_use = decoder
-        else:
-            decoder_to_use = JsonDecoder(parameters={})
-        model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path]
+        decoder_to_use, model_field_path = self._create_dpath_extractor_base(
+            decoder, model.field_path, config, model.parameters
+        )
         return DpathEnhancingExtractor(
             decoder=decoder_to_use,
             field_path=model_field_path,
             config=config,
             parameters=model.parameters or {},
         )

This would make the code more maintainable and reduce the risk of divergence in the future. What do you think?

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

1463-1493: New Component Addition: DpathEnhancingExtractor
The new DpathEnhancingExtractor component is added to support the inclusion of local service fields ($root and $parent) for use during the filter and transform phases. This aligns nicely with the PR objective of retaining the original object in each record. Would you consider adding more detailed examples or clarifying comments on how these service fields could be utilized (and later cleaned up) to further assist connector developers? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51dad94 and d3904f7.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1685-1685: LGTM!

The update to include DpathEnhancingExtractor in the RecordSelector's extractor field union type is correct and necessary for the new feature.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

80-80: LGTM! Clean import addition.

The DpathEnhancingExtractor import is well-placed within the existing extractors import group.


218-220: LGTM! Clean model mapping integration.

The DpathEnhancingExtractorModel import and constructor mapping are properly integrated into the factory's model mapping system.

Also applies to: 562-562

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2808-2811: RecordSelector Update: Inclusion of DpathEnhancingExtractor
By adding the reference to DpathEnhancingExtractor in the RecordSelector component’s extractor property, the PR now enables users to select this new extractor. Could you also verify that corresponding tests cover this new behavior—ensuring that the $root field is correctly populated and later removed during transformations? wdyt?

Trigger a new build
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (23)
docs/CONTRIBUTING.md (23)

16-22: Python Version Requirement Added:
Great job specifying that contributors must use Python 3.11. Would it be helpful to include a link to Python 3.11 release notes or a brief note on compatibility issues when using different versions? wdyt?


24-30: Pip Installation Instructions for Fedora:
The new instructions to install pip on Fedora are clear. Would you consider adding a quick note on verifying the installed pip version (for example, by running pip --version) to help less experienced users? wdyt?


32-47: Poetry Installation Details:
The alternative methods for installing Poetry (via pip and via curl) are nicely detailed. Would you consider including a brief comment on when one method might be preferable over the other or adding a link to further documentation? wdyt?


50-63: Using the Python 3.11 Environment:
The section showing how to use a Python 3.11 environment with Poetry is very clear. Would you consider adding a troubleshooting tip (for example, what to do if the correct environment isn’t activated) to further assist contributors? wdyt?


65-65: Cloning the CDK Repository:
The instruction to clone the CDK repo is concise. Do you think adding an example git clone command or a reference link to our Git documentation might further help newcomers? wdyt?


66-70: Installing Unit Test Prerequisites:
The instruction to run poetry install --all-extras is straightforward. Would you consider mentioning briefly why the --all-extras flag is required, so that contributors understand its purpose? wdyt?


73-81: RHEL Module Setup Instructions:
These instructions to execute commands for RHEL (or compatible OS) users are clear. Would you consider adding a short explanation of what the iptable_nat module does and why it’s needed? wdyt?


83-87: 'See Also' References:
The “See also” section with links (including Dagger/Podman integration and issue #197) is useful. Could you please verify that these links are current and consider removing any redundant references if necessary? wdyt?


91-97: Editing Shopify Connector Config:
The guidance on editing the acceptance test configuration for the Shopify connector is clear. Would you consider providing a brief sample diff or additional explanation to help users understand the required changes? wdyt?

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

95-95: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


109-113: Running Unit Tests:
The updated unit test commands are concise and useful. Would you consider varying the sentence structure slightly to avoid repeated phrasing? wdyt?

🧰 Tools
🪛 LanguageTool

[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


114-119: Code Formatting Commands:
In this section, several sentences start similarly. Would you consider rewording a couple of these sentences for improved readability, or adding transitions between them? wdyt?

🧰 Tools
🪛 LanguageTool

[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format to format your Python code. ### Run Code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


126-133: More Tools and Options:
This section is quite helpful. Would it add clarity to include a short explanation of when to use each tool before listing the commands? wdyt?


134-139: Local Testing Against Connector Changes:
The instructions for testing locally with CDK changes are detailed. Would you consider including a brief troubleshooting tip for common issues encountered during this process? wdyt?


150-189: Integration Test Preparation Steps:
This lengthy block is very detailed and informative. Would you consider breaking it into sub-sections (for example, “Project Setup,” “Configuration Changes,” and “File Creation”) to enhance readability for contributors? wdyt?

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

150-150: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)


179-179: Spaces inside code span elements
null

(MD038, no-space-in-code)


199-211: Running Connector Tests:
The step-by-step commands for executing connector tests are clear. Would adding a note on how to interpret test outcomes or what to do when a test fails be beneficial? wdyt?


213-223: Shopify Connector Example:
The example provided for the Shopify connector is practical. Would you consider clarifying the placeholder <connector name> (or providing an example) so that the instructions are completely unambiguous? wdyt?


233-243: Extended Acceptance Test Examples:
This block provides various command options (including debug and timeout). Would it be helpful to label these options explicitly (e.g., "With Debug Option", "With Timeout Option") for even easier reference? wdyt?


251-267: airbyte-ci CLI and Build Commands:
The instructions for installing the airbyte-ci CLI and building the connector image using the local CDK are well organized. Would you consider adding a link or further reference to detailed documentation on airbyte-ci for contributors who want to dive deeper? wdyt?


272-276: Working with Poe Tasks:
This section explains Poe Tasks clearly. Would you consider including a short troubleshooting tip or common pitfalls when using Poe tasks to further support contributors? wdyt?


280-285: Auto-Generating the Declarative Schema File:
The instructions are clear, though the phrasing “you may need to regenerate them” could be more expressive. Would you consider rewording it to, for example, “please regenerate the schemas to ensure they align with the latest changes”? wdyt?

🧰 Tools
🪛 LanguageTool

[style] ~280-~280: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...

(DO_ACHIEVE)


297-299: Release Management Section Wording:
In the Release Management section, the instruction “please have a look at the Release Management guide” might sound friendlier if changed to “please take a look at...”. Would you consider this minor tweak to align with American English usage? wdyt?

🧰 Tools
🪛 LanguageTool

[locale-violation] ~298-~298: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....

(HAVE_A_LOOK)


311-316: GitHub Slash Commands Formatting:
The list of GitHub slash commands is very useful. Would you consider adjusting the formatting of the /autofix entry—perhaps by ensuring a consistent separation between the command and its description—as done with /poetry-lock? wdyt?


347-351: Connector Acceptance Tests Header:
The header for “Running Connector Acceptance Tests for a single connector in Docker” is clear. Would you consider adding a brief note on any prerequisites (if not mentioned elsewhere) for running these tests? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d3904f7 and 0e1e5b3.

📒 Files selected for processing (1)
  • docs/CONTRIBUTING.md (4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md

[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format to format your Python code. ### Run Code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~280-~280: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...

(DO_ACHIEVE)


[locale-violation] ~298-~298: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....

(HAVE_A_LOOK)

🪛 markdownlint-cli2 (0.17.2)
docs/CONTRIBUTING.md

95-95: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


150-150: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)


179-179: Spaces inside code span elements
null

(MD038, no-space-in-code)


249-249: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (15)
docs/CONTRIBUTING.md (15)

8-8: Clarify Contribution Avenues:
You updated the bullet point to include “GitHub discussions” (with proper capitalization) as a way to contribute. Would you please confirm that this revised wording aligns with our intended communication and style guidelines? wdyt?


88-88: Docker Installation Note:
The brief note on ensuring Docker is installed locally is succinct. Is this meant to serve only as a reminder, or would adding a link to more detailed Docker installation instructions improve clarity? wdyt?


120-125: Code Linting Instructions:
The linting commands are clearly listed and align well with our best practices. Great work here!


140-141: GitHub Pipelines Integration Note:
The note that the GitHub pipelines run integration tests against the Shopify source is concise and useful. Thank you for the clarity!


144-148: Assumptions Section:
The assumptions about checkout locations and project structure are clearly laid out. This adds valuable context—great job!


225-231: Acceptance Tests Locally:
The instructions for running acceptance tests locally are straightforward and well explained. Excellent work!


245-246: Reverting Test Changes Reminder:
The reminder to revert test changes is a nice touch to ensure cleanup after testing. Looks good to me!


247-250: Test CDK Changes Section Header:
The header and subsequent preparation steps for testing CDK changes against connectors are clear. Thank you for the detailed guidance!

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

249-249: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)


269-271: Local CDK Build Note:
The explanation that the local CDK is injected at build time is concise and clear. Great job!


277-277: Duplicate Ruff Configuration Paragraph:
It appears that the paragraph about the Ruff configuration is duplicated. Would you consider removing the redundant copy to improve document clarity? wdyt?


294-296: docs-generate Task Mapping:
The explanation of how the docs-generate task is mapped and what it produces is clear and useful. Nice work here!


302-305: FAQ on Maintainers:
The FAQ entry explaining who the “maintainers” are is clear and informative. No changes needed here—great job!


323-327: MockServer Introduction:
The introduction about using MockServer in place of direct API access is clear and sets the context well. Thanks for including this helpful information!


339-343: Docker Command for MockServer:
The Docker command provided for running MockServer is detailed and should be very useful for our testing. Everything looks good here!


345-346: HTTP Requests Explanation:
The follow-up explanation that HTTP requests to localhost:8113/data will return the defined body is clear. No issues here—great work!

At record extraction step, in each record add the service field $root holding a reference to:
* the root response object, when parsing JSON format
* the original record, when parsing JSONL format
that each record to process is extracted from.
More service fields could be added in future.
The service fields are available in the record's filtering and transform steps.

Avoid:
* reusing the maps/dictionaries produced, thus avoid building cyclic structures
* transforming the service fields in the Flatten transformation.

Explicitly cleanup the service field(s) after the transform step, thus making them:
* local for the filter and transform steps
* not visible to the next mapping and store steps (as they should be)
* not visible in the tests beyond the test_record_selector (as they should be)
This allows the record transformation logic to define its "local variables" to reuse
some interim calculations.

The contract of body parsing seems irregular in representing the cases of bad JSON, no JSON and empty JSON.
Cannot be unified as that that irregularity is already used.

Update the development environment setup documentation
* to organize and present the setup steps explicitly
* to avoid misunderstandings and wasted efforts.

Update CONTRIBUTING.md to
* collect and organize the knowledge on running the test locally.
* state the actual testing steps.
* clarify and make explicit the procedures and steps.

The unit, integration, and acceptance tests in this exactly version succeed under Fedora 41, while
one of them fails under Oracle Linux 8.7. not related to the contents of this PR.
The integration tests of the CDK fail due to missing `secrets/config.json` file for the Shopify source.
See airbytehq#197

Polish

Integrate the DpathEnhancingExtractor in the UI of Airbyte.
Created a DPath Enhancing Extractor
Refactored the record enhancement logic - moved to the extracted class
Split the tests of DPathExtractor and DPathEnhancingExtractor

Fix the failing tests:

FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_custom_components[test_create_custom_component_with_subcomponent_that_uses_parameters]
FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_custom_components_do_not_contain_extra_fields
FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_parse_custom_component_fields_if_subcomponent
FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_page_increment
FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_offset_increment
FAILED unit_tests/sources/file_based/test_file_based_scenarios.py::test_file_based_read[simple_unstructured_scenario]
FAILED unit_tests/sources/file_based/test_file_based_scenarios.py::test_file_based_read[no_file_extension_unstructured_scenario]

They faile because of comparing string and int values of the page_size (public) attribute.
Imposed an invariant:
  on construction, page_size can be set to a string or int
  keep only values of one type in page_size for uniform comparison (convert the values of the other type)
  _page_size holds the internal / working value
... unless manipulated directly.

Merged:
feat(low-code concurrent): Allow async job low-code streams that are incremental to be run by the concurrent framework (airbytehq#228)
fix(low-code): Fix declarative low-code state migration in SubstreamPartitionRouter (airbytehq#267)
feat: combine slash command jobs into single job steps (airbytehq#266)
feat(low-code): add items and property mappings to dynamic schemas (airbytehq#256)
feat: add help response for unrecognized slash commands (airbytehq#264)
ci: post direct links to html connector test reports (airbytehq#252) (airbytehq#263)
fix(low-code): Fix legacy state migration in SubstreamPartitionRouter (airbytehq#261)
fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254)
feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236)
feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111)
fix: don't mypy unit_tests (airbytehq#241)
fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225)
feat(concurrent cursor): attempt at clamping datetime (airbytehq#234)
fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254)
feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236)
feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111)
fix: don't mypy unit_tests (airbytehq#241)
fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225)
feat(concurrent cursor): attempt at clamping datetime (airbytehq#234)
ci: use `ubuntu-24.04` explicitly (resolves CI warnings) (airbytehq#244)
Fix(sdm): module ref issue in python components import (airbytehq#243)
feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174)
chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133)
docs: comments on what the `Dockerfile` is for (airbytehq#240)
chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237)
Fix(sdm): module ref issue in python components import (airbytehq#243)
feat(low-code): add DpathFlattenFields (airbytehq#227)
feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174)
chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133)
docs: comments on what the `Dockerfile` is for (airbytehq#240)
chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237)

formatted

Update record_extractor.py

Trigger a new build. Hopefully, the integration test infrastructure is fixed.

Update CONTRIBUTING.md

Trigger a new build
  docs/CONTRIBUTING.md
  unit_tests/sources/declarative/transformations/test_flatten_fields.py
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (14)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (2)

33-35: LGTM! Would you consider adding a docstring? 🤔

The enhanced function signature with default values looks great! For better maintainability, maybe we could add a docstring explaining the parameters? wdyt?

 def generate_csv(
     encoding: str = "utf-8", delimiter: str = ",", should_compress: bool = False
 ) -> bytes:
+    """Generate CSV test data with configurable encoding, delimiter and compression.
+    
+    Args:
+        encoding (str, optional): Character encoding. Defaults to "utf-8".
+        delimiter (str, optional): CSV field separator. Defaults to ",".
+        should_compress (bool, optional): Whether to gzip compress output. Defaults to False.
+    
+    Returns:
+        bytes: Generated CSV data, optionally compressed
+    """

180-200: Great test coverage! Maybe we could make the test data DRY? 🤔

The parameterized test looks excellent! One tiny suggestion: since the expected data is identical to what's in generate_csv, would you consider extracting it to a shared constant to avoid duplication? wdyt?

+TEST_CSV_DATA = [
+    {"id": "1", "name": "John", "age": "28"},
+    {"id": "2", "name": "Alice", "age": "34"},
+    {"id": "3", "name": "Bob", "age": "25"},
+]
+
 def generate_csv(
     encoding: str = "utf-8", delimiter: str = ",", should_compress: bool = False
 ) -> bytes:
-    data = [
-        {"id": "1", "name": "John", "age": "28"},
-        {"id": "2", "name": "Alice", "age": "34"},
-        {"id": "3", "name": "Bob", "age": "25"},
-    ]
+    data = TEST_CSV_DATA
     ...

 def test_composite_raw_decoder_csv_parser_values(requests_mock, encoding: str, delimiter: str):
     ...
-    expected_data = [
-        {"id": "1", "name": "John", "age": "28"},
-        {"id": "2", "name": "Alice", "age": "34"},
-        {"id": "3", "name": "Bob", "age": "25"},
-    ]
+    expected_data = TEST_CSV_DATA
docs/CONTRIBUTING.md (12)

109-120: Local Development – Unit Test Commands
The “Run Unit Tests” section now provides several command options (including using pytest’s -k and -s flags), which is very useful for different testing situations. Might adding a one-line explanation for when to use each option assist less experienced contributors? wdyt?

🧰 Tools
🪛 LanguageTool

[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


131-135: Additional Tools and Options Reference
The “More tools and options” section offers quick access to useful commands. Would linking to a more detailed description of each tool be beneficial, or is brevity preferred here? wdyt?


143-148: Assumptions and Pipeline Use
The note that GitHub pipelines run the integration tests along with clear assumptions about directory structure provides valuable context. Have you considered adding a simple diagram to visually represent this structure? wdyt?


172-180: Configuring Secrets for Integration Tests
The detailed instructions—complete with an example for the Shopify connector—for creating the secrets/config.json file are very practical. Would you consider revising the formatting of the in-line example (e.g., extra indentation) for clarity? wdyt?


198-209: Running Connector Test Commands
The “Steps:” section with detailed commands for running connector tests is comprehensive. Do you think it would help to annotate which command verifies which aspect of the connector (spec, check, discover, read)? wdyt?


211-220: Shopify Connector Test Example
The example provided for the Shopify source (including running spec, check, discover, and read commands) is clear and instructive. Would a note on expected outputs help ensure users know when a test passes? wdyt?


229-238: Extended Acceptance Test Options
Providing examples with debug and timeout options is highly useful for troubleshooting. Would it be beneficial to include a brief explanation of when to use the timeout option versus the debug option? wdyt?


242-257: Testing CDK Changes Against Connectors
The detailed instructions using the airbyte-ci CLI for building connector images and running tests are very well defined. Would a pointer to common errors during the build/test process be useful for contributors? wdyt?

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

256-256: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)


273-279: Generating API Reference Documentation
The section on generating API docs using pdoc is detailed and helpful. Should we perhaps mention how to preview the generated docs locally if issues arise? wdyt?


280-282: Release Management Guidance
The directive to “have a look” at the Release Management guide is clear. Given regional language preferences, would you consider revising it to “Please take a look at…” for consistency with American English? wdyt?


283-299: FAQ Section Clarity and Consistency
The FAQ is well structured and covers key questions. A couple of suggestions:
• Consider varying the starting words of consecutive sentences to enhance readability.
• The bullet for “/autofix” could use consistent spacing (e.g., adding a space after /autofix).
Would these refinements help improve the flow? wdyt?

🧰 Tools
🪛 LanguageTool

[style] ~287-~287: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...

(DO_ACHIEVE)


302-334: Advanced Topics and Appendix
The advanced topics—covering using MockServer and running connector acceptance tests in Docker—are comprehensive and provide useful real-world guidance. Would including a brief troubleshooting checklist or common pitfalls in these sections further assist contributors? wdyt?

🧰 Tools
🪛 LanguageTool

[locale-violation] ~305-~305: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....

(HAVE_A_LOOK)


[uncategorized] ~317-~317: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...What GitHub slash commands are available and who can run them? Only Airbyte CDK mai...

(COMMA_COMPOUND_SENTENCE_2)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e1e5b3 and 338d117.

📒 Files selected for processing (7)
  • .github/workflows/pypi_publish.yml (1 hunks)
  • airbyte_cdk/sources/declarative/auth/oauth.py (2 hunks)
  • airbyte_cdk/sources/streams/http/requests_native_auth/abstract_oauth.py (1 hunks)
  • docs/CONTRIBUTING.md (4 hunks)
  • unit_tests/sources/declarative/auth/test_oauth.py (3 hunks)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (2 hunks)
  • unit_tests/sources/declarative/retrievers/test_simple_retriever.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .github/workflows/pypi_publish.yml
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md

[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast to run the subset of PyTest tests, which a...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~125-~125: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format to format your Python code. ### Run Code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[uncategorized] ~184-~184: Possible missing comma found.
Context: ...file. - On the Settings / Domains page find the subdomain of myshopify.com a...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~287-~287: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...

(DO_ACHIEVE)


[locale-violation] ~305-~305: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....

(HAVE_A_LOOK)

🪛 markdownlint-cli2 (0.17.2)
docs/CONTRIBUTING.md

102-102: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


157-157: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)


186-186: Spaces inside code span elements
null

(MD038, no-space-in-code)


256-256: Emphasis used instead of a heading
null

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (32)
unit_tests/sources/declarative/retrievers/test_simple_retriever.py (2)

257-260: LGTM! The state completion assertion looks good.

The added assertion verifies that the state correctly indicates sync completion after reading all records. This aligns well with the PR objective of enhancing record filtering and transformation.


372-375: LGTM! The state completion assertion looks good.

The added assertion verifies that the state correctly indicates sync completion after reading all records. This aligns well with the PR objective of enhancing record filtering and transformation.

unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

37-39: Nice improvements to the test data and CSV generation! 👍

The changes look great:

  • Using strings for all fields is more CSV-friendly
  • Clean separation of compression logic
  • Configurable delimiter enhances test coverage

Also applies to: 42-43, 48-53

airbyte_cdk/sources/declarative/auth/oauth.py (2)

234-237: LGTM! Nice defensive programming.

The addition of the access token initialization check before returning the expiry date helps prevent potential issues with uninitialized tokens.


239-241: LGTM! Clean and focused helper method.

The method follows the single responsibility principle and has a clear, descriptive name.

airbyte_cdk/sources/streams/http/requests_native_auth/abstract_oauth.py (1)

264-266: LGTM! Good edge case handling.

The addition of the empty value check with token expiry validation provides a graceful fallback mechanism.

unit_tests/sources/declarative/auth/test_oauth.py (2)

317-344: LGTM! Great test coverage for the new functionality.

The test case thoroughly verifies the behavior when no access token is present but the expiry date is in the future.


304-304: LGTM! Good test enhancement.

Adding the access token value parameter improves the test coverage.

docs/CONTRIBUTING.md (24)

8-8: Refine Contribution Instructions
The updated bullet now explicitly lists “reporting bugs, posting GitHub discussions, opening issues…” which broadens the contribution options. Was this expansion intended to better guide potential contributors? wdyt?


16-17: Specify Python Version Requirement
Introducing “1. Make sure your Python version is 3.11” establishes a clear baseline. Have you considered whether contributors using other versions might need additional guidance or troubleshooting steps? wdyt?


18-23: Fedora Instructions for Python Installation
Including “Fedora 41:” with a dedicated code block to install Python 3.11 is very helpful for users on that platform. Did you verify these commands on recent Fedora releases? wdyt?


24-31: Pip Installation Guidance on Fedora
The step “2. Install pip” along with Fedora-specific commands provides clear direction. Would you consider adding a brief note on what to do if the pip package name or command changes in future Fedora releases? wdyt?


32-47: Multiple Methods for Installing Poetry
The section now offers two alternatives for installing Poetry (via pip and via curl) as well as a Fedora-specific block. This flexibility is great. Would you like to mention which method is preferred in our official recommendations, or should we let users choose freely? wdyt?


48-49: Poe Tasks Note Clarity
The note explaining that “Poe” tasks can be used for common actions is informative. Is the linked “Poe the Poet” documentation up to date with these instructions? wdyt?


50-63: Guidance on Using the Python 3.11 Environment
The clear instructions and code block for verifying and switching the Poetry environment to Python 3.11 will help prevent environment mismatches. Would you consider adding a brief troubleshooting tip if the command fails? wdyt?


65-70: Repository Cloning and Test Prerequisites
The instructions to clone the CDK repo and install the unit test prerequisites are neatly detailed. Do you think a note on branch or version management could further assist contributors? wdyt?


74-88: RHEL Module Loading Instructions
The steps for RHEL (or compatible OS) to load the IP table NAT module—with both a temporary and persistent solution—are very clear. Should we add a reminder that these commands typically require root privileges? wdyt?


92-93: Additional Reference Links
The “See also:” links to the Dagger-Podman integration and CDK Issue 197 offer useful extra context. Do you think any further annotations are needed, or is the current level of detail sufficient? wdyt?


95-98: Docker Installation Recommendation
Emphasizing that Docker should be installed (instead of Podman) is a strong clarification. Is this recommendation documented elsewhere as well, or should we consolidate related notes here? wdyt?


100-107: Acceptance Test Config Update
The instructions for editing the Shopify acceptance-test configuration—with the before/after code block—are precise. Would it help to remind users to verify that no other fields are unintentionally modified during this update? wdyt?

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

102-102: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


121-124: Code Formatting Instructions
The step-by-step formatting commands using ruff (both checking and auto-fixing) are clear and concise. Do these instructions match our latest formatting guidelines closely? wdyt?


126-129: Linting and Type-checking Guidance
Including commands for both Poe-based linting and mypy type-checking provides thorough coverage of code quality. Are there any additional common pitfalls we should warn about in this section? wdyt?


137-141: Integration Testing Scenarios
The description of various scenarios for running integration tests (e.g., running a connector that uses the new feature) is very practical. Would it be helpful to explicitly list prerequisites for these tests? wdyt?


150-156: Acceptance Test Preparation
The “Preparation steps” for running connector acceptance tests (including entering the correct directory and installing dependencies) are straightforward. Would a brief troubleshooting tip (e.g., common errors) be useful here? wdyt?


158-163: Updating Test Connector Configuration
The instructions to modify the <test connector>/pyproject.toml file and replace the airbyte_cdk line are clear. Do you think it’s necessary to mention potential version mismatches during this update? wdyt?


165-170: Reinstalling airbyte_cdk for Testing
The step-by-step commands for reinstalling airbyte_cdk in the test connector directory are concise. Are there any caching issues or additional cleanup steps we might need to warn about? wdyt?


183-192: Example JSON Configuration for Secrets
The JSON snippet is well-formatted and serves as a good reference for contributors. Do you think adding comments within the JSON (if permitted) would further aid understanding? wdyt?

🧰 Tools
🪛 LanguageTool

[uncategorized] ~184-~184: Possible missing comma found.
Context: ...file. - On the Settings / Domains page find the subdomain of myshopify.com a...

(AI_HYDRA_LEO_MISSING_COMMA)

🪛 markdownlint-cli2 (0.17.2)

186-186: Spaces inside code span elements
null

(MD038, no-space-in-code)


194-196: Reference Links for Testing Resources
Including links to the Acceptance Tests Reference and Connector Acceptance Tests documentation adds valuable context. Would you like to include a brief summary of what each link covers? wdyt?


222-227: Local Acceptance Test Execution
The command block for running acceptance tests locally is straightforward. Do you feel that adding information on environment prerequisites (e.g., necessary Docker settings) might prevent future confusion? wdyt?


240-240: Test Changes Reversion Reminder
A simple reminder to revert test changes is a neat touch to keep the repository clean. Does that cover all scenarios, or should we expand on how to revert if using version control? wdyt?


258-263: Working with Poe Tasks
The explanation regarding the use of Poe tasks (and the option to pre-install Poe via pipx) is clear and friendly. Do you think a short example of a commonly used Poe command might further enhance this section? wdyt?


264-272: Auto-Generating the Declarative Schema File
The instructions for regenerating the schema file from the YAML spec are concise and easy to follow. Would linking to an example of a schema change be helpful for context? wdyt?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1882-1899: Consider reducing code duplication with create_dpath_extractor?

The implementation is almost identical to create_dpath_extractor. What do you think about extracting the common logic into a private helper method to reduce duplication? Something like this, wdyt?

+    def _create_dpath_extractor_base(
+        self,
+        decoder: Optional[Decoder],
+        model_field_path: List[Union[InterpolatedString, str]],
+        config: Config,
+        parameters: Optional[Dict[str, Any]],
+        extractor_class: Type[Union[DpathExtractor, DpathEnhancingExtractor]]
+    ) -> Union[DpathExtractor, DpathEnhancingExtractor]:
+        decoder_to_use = decoder if decoder else JsonDecoder(parameters={})
+        return extractor_class(
+            decoder=decoder_to_use,
+            field_path=model_field_path,
+            config=config,
+            parameters=parameters or {},
+        )
+
     def create_dpath_extractor(
         self,
         model: DpathExtractorModel,
         config: Config,
         decoder: Optional[Decoder] = None,
         **kwargs: Any,
     ) -> DpathExtractor:
-        if decoder:
-            decoder_to_use = decoder
-        else:
-            decoder_to_use = JsonDecoder(parameters={})
         model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path]
-        return DpathExtractor(
-            decoder=decoder_to_use,
-            field_path=model_field_path,
-            config=config,
-            parameters=model.parameters or {},
-        )
+        return self._create_dpath_extractor_base(
+            decoder=decoder,
+            model_field_path=model_field_path,
+            config=config,
+            parameters=model.parameters,
+            extractor_class=DpathExtractor
+        )

     def create_dpath_enhancing_extractor(
         self,
         model: DpathEnhancingExtractorModel,
         config: Config,
         decoder: Optional[Decoder] = None,
         **kwargs: Any,
     ) -> DpathEnhancingExtractor:
-        if decoder:
-            decoder_to_use = decoder
-        else:
-            decoder_to_use = JsonDecoder(parameters={})
         model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path]
-        return DpathEnhancingExtractor(
-            decoder=decoder_to_use,
-            field_path=model_field_path,
-            config=config,
-            parameters=model.parameters or {},
-        )
+        return self._create_dpath_extractor_base(
+            decoder=decoder,
+            model_field_path=model_field_path,
+            config=config,
+            parameters=model.parameters,
+            extractor_class=DpathEnhancingExtractor
+        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fabc665 and ceeea8f.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)

rpopov added 3 commits March 4, 2025 21:06
* main:
  fix: update cryptography package to latest version to address CVE (airbytehq#377)
  fix: (CDK) (HttpRequester) - Make the `HttpRequester.path` optional (airbytehq#370)
  feat: improved custom components handling (airbytehq#350)
  feat: add microseconds timestamp format (airbytehq#373)
  fix: Replace Unidecode with anyascii for permissive license (airbytehq#367)
  feat: add IncrementingCountCursor (airbytehq#346)
  feat: (low-code cdk)  datetime format with milliseconds (airbytehq#369)
  fix: (CDK) (AsyncRetriever) - Improve UX on variable naming and interpolation (airbytehq#368)
  fix: (CDK) (AsyncRetriever) - Add the `request` and `response` to each `async` operations (airbytehq#356)
  fix: (CDK) (ConnectorBuilder) - Add `auxiliary requests` to slice; support `TestRead` for AsyncRetriever (part 1/2) (airbytehq#355)
  feat(concurrent perpartition cursor): Add parent state updates (airbytehq#343)
  fix: update csv parser for builder compatibility (airbytehq#364)
  feat(low-code cdk): add interpolation for limit field in Rate (airbytehq#353)
  feat(low-code cdk): add AbstractStreamFacade processing as concurrent streams in declarative source (airbytehq#347)
  fix: (CDK) (CsvParser) - Fix the `\\` escaping when passing the `delimiter` from Builder's UI (airbytehq#358)
  feat: expose `str_to_datetime` jinja macro (airbytehq#351)
  fix: update CDK migration for 6.34.0 (airbytehq#348)
  feat: Removes `stream_state` interpolation from CDK (airbytehq#320)
  fix(declarative): Pass `extra_fields` in `global_substream_cursor` (airbytehq#195)
  feat(concurrent perpartition cursor): Refactor ConcurrentPerPartitionCursor (airbytehq#331)
  feat(HttpMocker): adding support for PUT requests and bytes responses (airbytehq#342)
  chore: use certified source for manifest-only test (airbytehq#338)
  feat: check for request_option mapping conflicts in individual components (airbytehq#328)
  feat(file-based): sync file acl permissions and identities (airbytehq#260)
  fix: (CDK) (Connector Builder) - refactor the `MessageGrouper` > `TestRead` (airbytehq#332)
  fix(low code): Fix missing cursor for ClientSideIncrementalRecordFilterDecorator (airbytehq#334)
  feat(low-code): Add API Budget (airbytehq#314)
  chore(decoder): clean decoders and make csvdecoder available (airbytehq#326)
* main:
  ci: add Docker build workflow for PR checks (airbytehq#385)
  fix: Update DEFAULT_ERROR_MAPPING for InvalidURL to RETRY (airbytehq#384)
  chore: use python base image 4.0.0 in SDM (airbytehq#382)
  ci: add check for 'do not merge' in PR titles (airbytehq#383)
  chore: remove embedded module (airbytehq#380)
  build: update pyarrow to ^19.0.0 (do not merge) (airbytehq#379)
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (2)

188-196: 🛠️ Refactor suggestion

Test function misses verification of service fields - should we extend the assertions?

The test function's implementation doesn't appear to verify the presence or behavior of service fields like $root that are mentioned in the PR objectives.

Perhaps we could enhance this test function by:

  1. Checking that extracted records initially contain the $root service field
  2. Verifying that exclude_service_keys(actual_records) correctly removes these service fields
  3. Then comparing the result with expected_records

This would more thoroughly test the functionality described in the PR. What do you think about adding these additional assertions?

def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List):
    extractor = DpathExtractor(
        field_path=field_path, config=config, decoder=decoder, parameters=parameters
    )

    response = create_response(body)
    actual_records = list(extractor.extract_records(response))

+   # Verify presence of service fields
+   for record in actual_records:
+       assert "$root" in record, "Service field $root should be present in extracted records"
+   
+   # Verify service fields are correctly excluded
+   cleaned_records = exclude_service_keys(actual_records)
+   assert cleaned_records == expected_records, "Records after excluding service fields should match expected records"
-   assert actual_records == expected_records

37-155: 💡 Verification agent

❓ Verification inconclusive

Test cases expanded but missing verification of service fields - could we improve this?

You've added many helpful test cases covering edge cases like empty strings, objects, and arrays, as well as more complex nested structures. However, according to the PR objectives, a key feature being added is the $root service field for each record.

None of these test cases seem to verify that:

  1. The $root service field is correctly added to each record
  2. The service field contains the original object reference
  3. The service fields can be properly excluded after transformation

Could we enhance some of these test cases to verify this core functionality of the PR? Perhaps comparing the records before and after applying exclude_service_keys?


🏁 Script executed:

#!/bin/bash
# Check for tests or implementations that handle $root service field
rg --type py "\$root" -A 3 -B 3 --no-heading

Length of output: 45


Action Required: Enhance tests to verify the $root service field functionality

The current test suite in unit_tests/sources/declarative/extractors/test_dpath_extractor.py covers many edge cases, but it doesn't verify that the new $root service field is correctly added, contains the original input data, or is properly excluded when exclude_service_keys is applied. Would you consider adding tests that:

  • Verify each record has the $root service field when expected.
  • Check that the $root field contains the original object reference.
  • Compare the records before and after applying exclude_service_keys to ensure the service field is properly removed.

Wdyt?

🧹 Nitpick comments (5)
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)

40-40: Comment placement might cause confusion - should we adjust formatting?

The comment "The JSON contract is irregular, compare with JSONL" is placed at the end of the line rather than on a separate line above or below the test case. This is inconsistent with the comment style used on lines 127-128 and might make it harder to read.

Would it make sense to move this comment to a new line for consistency and clarity?

-        ([], decoder_json, b"", [{}]),  # The JSON contract is irregular, compare with JSONL
+        # The JSON contract is irregular, compare with JSONL
+        ([], decoder_json, b"", [{}]),
unit_tests/sources/declarative/extractors/test_record_selector.py (1)

47-76: Good addition of nested record test cases

These new test cases extend coverage to include nested data structures, which aligns nicely with the PR's goal of enhancing filtering and transformation.

However, I'm curious if we should also explicitly verify that the $root service field is added to each record? That seems to be the primary objective mentioned in the PR. Perhaps we could add assertions to check for the presence and correctness of these service fields?

 def test_record_filter(test_name, field_path, filter_template, body, expected_data):
     # ... existing code ...
     actual_records = list(
         record_selector.select_records(
             response=response,
             records_schema=schema,
             stream_state=stream_state,
             stream_slice=stream_slice,
             next_page_token=next_page_token,
         )
     )
     assert actual_records == [
         Record(data=data, associated_slice=stream_slice, stream_name="") for data in expected_data
     ]
+    
+    # For the nested record test cases, verify service fields
+    if test_name in ["test_no_record_filter_returns_all_records_with_nested", "test_true_record_filter_returns_all_records_with_nested"]:
+        for record in actual_records:
+            # Verify service fields are present
+            assert SERVICE_KEY_ROOT in record.data
+            # Verify root contains the original object
+            assert record.data[SERVICE_KEY_ROOT] == body
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

703-717: Is this a more functional version of DpathExtractor?

This new DpathEnhancingExtractor class appears to have the same structure as the existing DpathExtractor class. According to the PR description, this is meant to implement a feature to add service fields like $root to records during extraction.

Since both extractors have identical field structures, it might be worth clarifying in the documentation how specifically this enhances the standard extractor functionality to make it clearer for future users and maintainers. Perhaps adding a brief note to highlight the service field functionality would be helpful, wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

1995-2013: Implementation looks solid, but could we reduce duplication?

The factory method is correctly implemented and follows the same pattern as other factory methods, which is great for consistency. However, I notice it's nearly identical to the create_dpath_extractor method just above it.

Would it make sense to extract the common logic into a private helper method to avoid duplication? This would improve maintainability for future changes to both extractors. Wdyt?

-    def create_dpath_enhancing_extractor(
-        self,
-        model: DpathEnhancingExtractorModel,
-        config: Config,
-        decoder: Optional[Decoder] = None,
-        **kwargs: Any,
-    ) -> DpathEnhancingExtractor:
-        if decoder:
-            decoder_to_use = decoder
-        else:
-            decoder_to_use = JsonDecoder(parameters={})
-        model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path]
-        return DpathEnhancingExtractor(
-            decoder=decoder_to_use,
-            field_path=model_field_path,
-            config=config,
-            parameters=model.parameters or {},
-        )
+    def _prepare_dpath_extractor_args(
+        self,
+        model: Union[DpathExtractorModel, DpathEnhancingExtractorModel],
+        config: Config,
+        decoder: Optional[Decoder] = None,
+    ) -> Dict[str, Any]:
+        if decoder:
+            decoder_to_use = decoder
+        else:
+            decoder_to_use = JsonDecoder(parameters={})
+        model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path]
+        return {
+            "decoder": decoder_to_use,
+            "field_path": model_field_path,
+            "config": config,
+            "parameters": model.parameters or {},
+        }
+
+    def create_dpath_extractor(
+        self,
+        model: DpathExtractorModel,
+        config: Config,
+        decoder: Optional[Decoder] = None,
+        **kwargs: Any,
+    ) -> DpathExtractor:
+        args = self._prepare_dpath_extractor_args(model, config, decoder)
+        return DpathExtractor(**args)
+
+    def create_dpath_enhancing_extractor(
+        self,
+        model: DpathEnhancingExtractorModel,
+        config: Config,
+        decoder: Optional[Decoder] = None,
+        **kwargs: Any,
+    ) -> DpathEnhancingExtractor:
+        args = self._prepare_dpath_extractor_args(model, config, decoder)
+        return DpathEnhancingExtractor(**args)

1995-1995: Consider adding a docstring to explain the new extractor's purpose.

Adding a brief docstring explaining what the DpathEnhancingExtractor does and how it differs from the regular DpathExtractor would help future developers understand its purpose. Based on the PR description, it seems this extractor adds service fields like $root to the extracted records.

    def create_dpath_enhancing_extractor(
        self,
        model: DpathEnhancingExtractorModel,
        config: Config,
        decoder: Optional[Decoder] = None,
        **kwargs: Any,
    ) -> DpathEnhancingExtractor:
+        """
+        Creates a DpathEnhancingExtractor that extracts records while preserving the root object.
+        
+        This extractor works similarly to DpathExtractor but also adds service fields to each record,
+        particularly a `$root` field containing a reference to the original object from which the
+        record was extracted. This enables more powerful filtering and transformation operations
+        that need access to data beyond the record itself.
+        """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceeea8f and 96e2a59.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
  • airbyte_cdk/sources/declarative/decoders/json_decoder.py (2 hunks)
  • airbyte_cdk/sources/declarative/extractors/record_selector.py (3 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
  • unit_tests/sources/declarative/decoders/test_json_decoder.py (1 hunks)
  • unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py (1 hunks)
  • unit_tests/sources/declarative/extractors/test_dpath_extractor.py (4 hunks)
  • unit_tests/sources/declarative/extractors/test_record_selector.py (2 hunks)
  • unit_tests/sources/declarative/retrievers/test_simple_retriever.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • airbyte_cdk/sources/declarative/decoders/json_decoder.py
  • unit_tests/sources/declarative/retrievers/test_simple_retriever.py
  • airbyte_cdk/sources/declarative/extractors/record_selector.py
  • unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py
🔇 Additional comments (14)
unit_tests/sources/declarative/decoders/test_json_decoder.py (6)

15-17: Parameter name change is more descriptive and aligns with implementation.

The renaming from first_element to expected_json makes the test more clear about what we're actually expecting from the decoder. The parameter name now better reflects that we're testing against the expected JSON output, not just the first element. This is a good improvement to the test clarity, wdyt?


18-20: Updated test expectations align with the new decoder behavior.

These changes to the expected outputs reflect the core functionality change where the decoder now returns a list of records instead of a single record. This is consistent with the PR objective of making service fields available for filtering and transformation, where each record in the list can maintain its own reference to the original object. The changes correctly handle:

  • Empty object case: [{}] instead of {}
  • Empty array case: [] instead of {}
  • Regular JSON object case: [{"healthcheck": {"status": "ok"}}] instead of the object directly

The test cases now properly validate that the decoder always returns a list of elements.


23-23: Function parameter name updated to match the parameter definition.

The function parameter has been updated to use expected_json to be consistent with the parameter definition in the test parametrization. This maintains consistency throughout the test.


26-26: Assertion updated to use the new parameter name.

The assertion correctly references the renamed parameter expected_json instead of the old first_element.


32-33: Added test cases for edge conditions.

The addition of test cases for empty object ("{}", [{}]) and updated empty list handling ("[]", [[]]) improve the test coverage. However, I'm curious about the empty list case - should it really be returning a nested list [[]] instead of just []? This seems different from how the JSON decoder handles empty lists in the test above. Is this intentional behavior for the JSONL parser, or should it be consistent with the JSON decoder?


40-45: Test IDs updated to clearly describe test cases.

The test IDs have been updated to better describe what each test case is checking:

  • "empty_object_response" for the empty object case
  • "empty_list_response" for the empty list case
  • Keeping "one_line_json" and "multi_line_json" for those respective cases

This improves the readability of test output and makes it clearer what functionality is being tested in each case.

unit_tests/sources/declarative/extractors/test_record_selector.py (1)

12-15:

❓ Verification inconclusive

New imports for service keys

I notice you've added imports for SERVICE_KEY_PARENT and SERVICE_KEY_ROOT from the dpath_enhancing_extractor module. These seem related to the service fields mentioned in the PR objectives, but I don't see them being explicitly used in the test cases. Are they used implicitly by the classes being tested, or are they intended for future use? A comment explaining their purpose might be helpful for future developers, wdyt?


🏁 Script executed:

#!/bin/bash
# Description: Check usage of SERVICE_KEY_PARENT and SERVICE_KEY_ROOT in the codebase

echo "Searching for usage of SERVICE_KEY_PARENT and SERVICE_KEY_ROOT in tests:"
rg "SERVICE_KEY_(PARENT|ROOT)" unit_tests/

Length of output: 821


Clarification Needed on Service Key Imports

I noticed that both SERVICE_KEY_PARENT and SERVICE_KEY_ROOT are imported in this test file, much like in test_dpath_enhancing_extractor.py where these keys are combined with SERVICE_KEY_PREFIX. However, their explicit use in test_record_selector.py isn’t immediately clear. Are they intended to be used implicitly by the underlying classes or reserved for upcoming test enhancements? Perhaps adding an inline comment explaining their purpose could improve clarity for future developers. wdyt?

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

1672-1703: New Component Declaration: DpathEnhancingExtractor

You've added a new extractor component that clearly extends the built‐in Dpath extraction functionality to include local service fields ($root and $parent) during transformation. The title, description, required fields, and property definitions are all consistent with existing components (e.g., DpathExtractor).

A couple of questions for you:

  • Would you consider adding some documentation or inline examples regarding the expected types/contents of the $root and $parent service fields? Although these fields are added dynamically and later cleaned, a note here could help future maintainers understand that they are only available during filtering and transformation.
  • Also, do you think it might be beneficial to explicitly flag (or document) that these service fields are protected against unintended transformation (for example, by the FlattenFields transformation) as described in the PR objectives?

Overall, the schema for DpathEnhancingExtractor is well organized and follows the established conventions. wdyt?

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)

1779-1779: Looks good - union type properly updated.

You've correctly extended the RecordSelector.extractor field's union type to include the new DpathEnhancingExtractor, which means declarative configurations can now use this new extractor type.


703-717: Consider potential memory implications of storing root objects

Per the PR description, the implementation will add a $root service field that holds a reference to the original response object. This could potentially increase memory usage, especially with large API responses.

I see you've noted that the implementation "avoids reusing maps or dictionaries to prevent cyclic structures" which is good. Have you also considered:

  1. Memory impact when processing large batches of records?
  2. Ensuring the service fields are properly cleaned after transformation as mentioned in the PR?
  3. Any edge cases with deeply nested objects?

703-717:

❓ Verification inconclusive

Verify that this schema change has a corresponding implementation

The file we're reviewing is a schema definition, but for this new extractor to work, there should be a corresponding implementation class elsewhere. Can you confirm that you've created the actual implementation of DpathEnhancingExtractor that adds the $root field to records?


🏁 Script executed:

#!/bin/bash
# Find the implementation file for DpathEnhancingExtractor

echo "Searching for DpathEnhancingExtractor implementation files..."
rg -l "class DpathEnhancingExtractor" | grep -v "declarative_component_schema.py"

echo "Checking imports of DpathEnhancingExtractor..."
rg "from .* import .*DpathEnhancingExtractor"

Length of output: 561


Implementation file located and imported – please confirm $root logic?

I verified that an implementation of DpathEnhancingExtractor exists in
airbyte_cdk/sources/declarative/extractors/dpath_enhancing_extractor.py
• and it’s properly imported in airbyte_cdk/__init__.py.

Could you please confirm that the implementation in this file also includes the logic to add the $root field to records as intended? wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

78-78: Looks like we're adding a new extractor type!

The addition of the DpathEnhancingExtractor import aligns well with the PR objective to add a service field ($root) to each record during extraction.


213-215: The model import looks good!

Properly importing the Pydantic model for the new extractor component.


582-582: Nice registration in the component mapping!

Adding the new extractor model to the PYDANTIC_MODEL_TO_CONSTRUCTOR dictionary ensures the factory will correctly instantiate it when encountered in a configuration.

Comment on lines +19 to +21
from airbyte_cdk.sources.declarative.extractors.record_extractor import (
exclude_service_keys,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

New import added without corresponding usage validation - should we check the functionality?

I see you've added an import for exclude_service_keys, but I don't notice any tests that verify how this function works with the extractor. Based on the PR objectives that mention adding a $root service field, shouldn't there be tests to verify:

  1. That service fields are properly added to records during extraction
  2. That exclude_service_keys correctly removes these service fields when needed

Would it make sense to add assertions that check for the presence of service fields in the extracted records, and then verify they can be properly excluded? wdyt?


🏁 Script executed:

#!/bin/bash
# Check if there are any tests for exclude_service_keys in the codebase
rg --type py "exclude_service_keys" -A 5 -B 5 --no-heading

Length of output: 5973


Action Required: Enhance Tests for Service Field Handling in dpath_extractor

Hey, I noticed that in unit_tests/sources/declarative/extractors/test_dpath_extractor.py there's an import for exclude_service_keys that isn’t directly validated by any tests. We do have tests in test_record_extractor.py confirming that exclude_service_keys removes service fields as expected. However, given the PR’s focus on introducing the $root service field, shouldn’t we also add specific assertions in the dpath extractor tests to ensure that:

  • Service fields are correctly added to records during extraction.
  • exclude_service_keys properly removes these service fields when required.

Would you consider enhancing the tests in this file to cover these points, or do you plan to rely on the existing tests in test_record_extractor.py for this functionality? wdyt?

rpopov added 2 commits March 6, 2025 20:54
* main:
  feat: add formatting for epoch_microseconds (airbytehq#388)
  feat(low-code cdk): remove unnecessary manifest print (airbytehq#390)
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
unit_tests/sources/declarative/interpolation/test_macros.py (2)

1-190: ⚠️ Potential issue

Run ruff format to fix linting errors

The pipeline is failing with the following error:

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.

Please run ruff format to automatically fix these style issues.

🧰 Tools
🪛 GitHub Actions: Linters

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.


145-145: 💡 Verification agent

🧩 Analysis chain

There's an existing test for timezone handling

I notice there's already a test at line 145 (test_utc_datetime_to_local_timestamp_conversion) that's specifically designed to ensure correct timezone handling. This test could serve as a model for how to handle the timezone issues in the parameterized tests.

Perhaps we could modify the problematic tests to use a similar approach? wdyt?


🏁 Script executed:

#!/bin/bash
# Look for other timezone-related tests or utilities in the codebase
rg -i "timezone" --glob "*.py" . | grep -v test_macros.py

Length of output: 34489


Timezone Handling Consistency in Parameterized Tests
I noticed that our test_utc_datetime_to_local_timestamp_conversion (line 145 in test_macros.py) already demonstrates a solid approach to handling timezones. Given that many parts of the codebase follow similar patterns for timezone awareness, would you consider updating the parameterized tests to adopt this same technique for consistently addressing timezone issues? wdyt?

🧹 Nitpick comments (3)
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)

184-192: Should we enhance test assertions to validate service fields?

The current test function doesn't have any assertions specifically for validating the service fields mentioned in the PR (like $root), nor does it test the exclude_service_keys functionality. Would it make sense to enhance this test to either:

  1. Verify the service keys are present in the raw output before comparison, or
  2. Test that exclude_service_keys correctly removes these service fields?

A quick addition could look something like:

    response = create_response(body)
    actual_records = list(extractor.extract_records(response))
+   
+   # Verify service keys can be properly excluded
+   filtered_records = [exclude_service_keys(record) for record in actual_records]

    assert actual_records == expected_records

What do you think?

unit_tests/sources/declarative/interpolation/test_macros.py (2)

78-78: Helpful comment about timezone dependency, but could we fix the test?

Good catch documenting that this test fails outside of GMT timezone! This is valuable information for other developers. However, instead of just documenting the issue, we might want to make the test more robust.

Perhaps we could:

  1. Use timezone-aware assertions for these timestamp tests
  2. Mock the timezone to ensure consistent results
  3. Make the expected values dynamic based on the local timezone

What do you think about implementing one of these approaches instead? wdyt?


84-84: Same timezone issue as line 78, consider a comprehensive fix

This is the same timezone-dependency issue as the previous test case. Rather than documenting both failures, it would be better to implement a solution that works across all environments.

For the whole test suite, we could add a helper method that handles timezone differences, or patch datetime.timezone during tests to ensure consistent behavior. This would make the tests more reliable and eliminate the need for these comments.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96e2a59 and aedeefb.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
  • unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py (1 hunks)
  • unit_tests/sources/declarative/extractors/test_dpath_extractor.py (4 hunks)
  • unit_tests/sources/declarative/interpolation/test_macros.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/declarative/interpolation/test_macros.py

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: SDM Docker Image Build
🔇 Additional comments (4)
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)

19-21: Unused import - need to add test verification for service keys?

I see you've imported exclude_service_keys, but I don't notice it being used in the test file. Based on the PR objectives about adding a $root service field, shouldn't we add assertions to verify that service fields are properly handled and can be excluded? This would address the previous review feedback as well.

#!/bin/bash
# Check how exclude_service_keys is used in other test files
rg --type py "exclude_service_keys" -A 2 -B 2 unit_tests/
unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py (3)

1-24: Well-organized imports and constants for the new extractor test

Good job setting up the proper imports and constants for testing the new DpathEnhancingExtractor. The organization with clearly defined service key constants makes the test code more readable.


42-302: Comprehensive test cases for service fields

The test cases are thorough and cover a wide range of scenarios, effectively validating that both $root and $parent service fields are correctly added to extracted records in different JSON structures. This comprehensive approach will help ensure robustness of the feature.


33-34:

❓ Verification inconclusive

Unused decoder - is JsonlDecoder needed?

I notice that JsonlDecoder is imported on line 24, but neither a decoder_jsonl instance is created nor is it used in any of the test cases, unlike in the other test file. Is this intentional or should there be tests using this decoder as well?


🏁 Script executed:

#!/bin/bash
# Check if JsonlDecoder is used elsewhere and how
rg --type py "JsonlDecoder" -A 2 -B 2 

Length of output: 7189


Review Note: Unused JsonlDecoder Import in Test File
I noticed that in unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py (line 24), the JsonlDecoder is imported but not actually used in any test cases. Since we already cover JsonlDecoder functionality in unit_tests/sources/declarative/decoders/test_decoders_memory_usage.py, should we remove the unused import here to keep the file clean, or was this intended for potential future tests? wdyt?

Comment on lines +329 to +337
def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List):
extractor = DpathEnhancingExtractor(
field_path=field_path, config=config, decoder=decoder, parameters=parameters
)

response = create_response(body)
actual_records = list(extractor.extract_records(response))

assert actual_records == expected_records
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding tests for exclude_service_keys functionality

The test currently verifies that service fields are correctly added to records, but there's no verification that these service fields can be properly excluded using the exclude_service_keys function. This was mentioned in the previous review comments.

Would it make sense to add additional assertions to verify this functionality? For example:

    response = create_response(body)
    actual_records = list(extractor.extract_records(response))

    assert actual_records == expected_records
+   
+   # Additionally verify service keys can be properly excluded
+   from airbyte_cdk.sources.declarative.extractors.record_extractor import exclude_service_keys
+   filtered_records = [exclude_service_keys(record) for record in actual_records]
+   # Verify no service keys remain in filtered records
+   for record in filtered_records:
+       assert all(not key.startswith(SERVICE_KEY_PREFIX) for key in record.keys())

What do you think?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List):
extractor = DpathEnhancingExtractor(
field_path=field_path, config=config, decoder=decoder, parameters=parameters
)
response = create_response(body)
actual_records = list(extractor.extract_records(response))
assert actual_records == expected_records
def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List):
extractor = DpathEnhancingExtractor(
field_path=field_path, config=config, decoder=decoder, parameters=parameters
)
response = create_response(body)
actual_records = list(extractor.extract_records(response))
assert actual_records == expected_records
# Additionally verify service keys can be properly excluded
from airbyte_cdk.sources.declarative.extractors.record_extractor import exclude_service_keys
filtered_records = [exclude_service_keys(record) for record in actual_records]
# Verify no service keys remain in filtered records
for record in filtered_records:
assert all(not key.startswith(SERVICE_KEY_PREFIX) for key in record.keys())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant