-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: In each record to filter and transform, publish a local service field holding the original object the record is extracted from #214
base: main
Are you sure you want to change the base?
Conversation
At record extraction step, in each record add the service field $root holding a reference to: * the root response object, when parsing JSON format * the original record, when parsing JSONL format that each record to process is extracted from. More service fields could be added in future. The service fields are available in the record's filtering and transform steps. Avoid: * reusing the maps/dictionaries produced, thus avoid building cyclic structures * transforming the service fields in the Flatten transformation. Explicitly cleanup the service field(s) after the transform step, thus making them: * local for the filter and transform steps * not visible to the next mapping and store steps (as they should be) * not visible in the tests beyond the test_record_selector (as they should be) This allows the record transformation logic to define its "local variables" to reuse some interim calculations. The contract of body parsing seems irregular in representing the cases of bad JSON, no JSON and empty JSON. Cannot be unified as that that irregularity is already used. Update the development environment setup documentation * to organize and present the setup steps explicitly * to avoid misunderstandings and wasted efforts. Update CONTRIBUTING.md to * collect and organize the knowledge on running the test locally. * state the actual testing steps. * clarify and make explicit the procedures and steps. The unit, integration, and acceptance tests in this exactly version succeed under Fedora 41, while one of them fails under Oracle Linux 8.7. not related to the contents of this PR. The integration tests of the CDK fail due to missing `secrets/config.json` file for the Shopify source. See airbytehq#197
📝 WalkthroughWalkthroughThis pull request introduces a variety of updates across the Airbyte CDK. The JSON decoding logic has been improved with enhanced error logging and simplified JSON body parsing. Several extractor modules have been extended: DpathExtractor now features new methods for updating the response body and individual records, and service key management functions were added. A new extractor, Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Extractor as DpathEnhancingExtractor
participant Selector as RecordSelector
Client->>Extractor: Receive HTTP response
Extractor->>Extractor: update_body() (set parent context recursively)
Extractor->>Extractor: update_record() (add root service field)
Extractor-->>Client: Yield enhanced record
Client->>Selector: Submit record for filtering and normalization
Selector->>Selector: remove_service_keys() from record
Selector-->>Client: Return normalized record
Suggested labels
Suggested reviewers
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (17)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (3)
10-14
: Small typo in the documentationThere's a typo in "transormation" -> "transformation". Otherwise, the documentation clearly explains the convention and lifecycle of service fields.
-- The service fields are kept only during the record's filtering and transormation. ++ The service fields are kept only during the record's filtering and transformation.
21-27
: Consider removing type ignore and adding return type annotationThe
# type: ignore[no-untyped-def]
seems unnecessary here. What do you think about adding a return type annotation ofNone
since the function modifies the dict in place? wdyt?-def remove_service_keys(mapping: dict[str, Any]): # type: ignore[no-untyped-def] +def remove_service_keys(mapping: dict[str, Any]) -> None:
34-35
: Consider removing type ignore and improving error messageSimilar to above, the type ignore seems unnecessary. Also, would it be helpful to include the actual mapping in the assertion message to aid debugging? wdyt?
-def assert_service_keys_exist(mapping: Mapping[str, Any]): # type: ignore[no-untyped-def] - assert mapping != exclude_service_keys(mapping), "The mapping should contain service keys" +def assert_service_keys_exist(mapping: Mapping[str, Any]) -> None: + assert mapping != exclude_service_keys(mapping), f"The mapping should contain service keys, got: {mapping}"airbyte_cdk/sources/declarative/transformations/flatten_fields.py (1)
36-44
: Consider adjusting indentation for better readabilityThe logic looks good! The service keys are correctly preserved. Would you consider adjusting the indentation of the else clause to match the if condition for better readability? wdyt?
if not is_service_key(current_key): new_key = ( f"{parent_key}.{current_key}" if (current_key in transformed_record or force_with_parent_name) else current_key ) stack.append((value, new_key)) - else: # transfer the service fields without change - transformed_record[current_key] = value + else: + # transfer the service fields without change + transformed_record[current_key] = valueunit_tests/sources/declarative/transformations/test_flatten_fields.py (1)
50-67
: Consider adding test cases for nested service keysThe test coverage for service keys looks good! Would you consider adding a test case for nested objects containing service keys to ensure they're handled correctly at any depth? For example:
( { SERVICE_KEY_PREFIX + "root": { "nested": {SERVICE_KEY_PREFIX + "child": "value"} }, "regular": "field" }, { SERVICE_KEY_PREFIX + "root": { "nested": {SERVICE_KEY_PREFIX + "child": "value"} }, "regular": "field" } )unit_tests/sources/declarative/extractors/test_record_extractor.py (2)
62-67
: Consider catching specific assertion errorThe test looks good, but would it be safer to catch the specific AssertionError instead of using a bare except? This would prevent masking other unexpected exceptions. wdyt?
try: assert_service_keys_exist(original) success = False - except: # OK, expected + except AssertionError: # OK, expected success = True
71-76
: Consider adding edge cases for service key detectionThe basic test cases look good! Would you consider adding some edge cases to ensure robust handling of special characters and edge cases? For example:
def test_service_key_edge_cases(): assert is_service_key(SERVICE_KEY_PREFIX) # just the prefix assert is_service_key(SERVICE_KEY_PREFIX + "$nested") # double prefix assert not is_service_key("prefix" + SERVICE_KEY_PREFIX) # prefix in middleairbyte_cdk/sources/declarative/decoders/json_decoder.py (1)
38-41
: Consider making stack trace logging configurable?The debug logging with
stack_info=True
provides great debugging context but might be verbose in production. What do you think about making it configurable through a parameter, wdyt?- logger.debug("Response to parse: %s", response.text, exc_info=True, stack_info=True) + logger.debug( + "Response to parse: %s", + response.text, + exc_info=True, + stack_info=self.parameters.get("debug_stack_trace", False) + )airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (2)
23-29
: Consider using dict constructor for a cleaner copy?The implementation safely handles both dict and non-dict cases. For the dict copy, what do you think about using the dict constructor for a slightly cleaner approach, wdyt?
- copy = {k: v for k, v in record.items()} + copy = dict(record)
92-109
: Consider consolidating the yield statements?The implementation handles all cases well. What do you think about consolidating the yield statements for better maintainability, wdyt?
- if isinstance(extracted, list): - for record in extracted: - yield update_record(record, root_response) - elif isinstance(extracted, dict): - yield update_record(extracted, root_response) - elif extracted: - yield extracted - else: - yield from [] + if not extracted: + return + records = [extracted] if isinstance(extracted, dict) else extracted if isinstance(extracted, list) else [extracted] + yield from (update_record(record, root_response) for record in records)airbyte_cdk/sources/declarative/extractors/record_selector.py (1)
112-113
: Consider adding a docstring to explain the service fields removal step?The addition of service fields removal before normalization is a significant change in the data processing pipeline. Would you consider adding a brief docstring to explain why this step is necessary here? wdyt?
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)
196-201
: Consider extracting the service key validation into a helper function?The service key validation logic is repeated across test cases. Would it make sense to extract this into a helper function to improve maintainability? wdyt?
+def validate_service_keys(records): + for record in records: + if record != {}: + assert_service_keys_exist(record) + return [exclude_service_keys(record) for record in records] + def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List): # ... - for record in actual_records: - if record != {}: - assert_service_keys_exist(record) - - actual_records = [exclude_service_keys(record) for record in actual_records] + actual_records = validate_service_keys(actual_records)unit_tests/sources/declarative/extractors/test_record_selector.py (1)
169-171
: Consider consistent whitespace around service key operations?The whitespace around service key operations differs between test methods. Would you like to make it consistent? For example, always having one blank line before and after? wdyt?
Also applies to: 244-245
unit_tests/sources/declarative/partition_routers/test_parent_state_stream.py (1)
325-325
: Nice addition of deduplication logic! 👍The deduplication using dictionary comprehension with orjson.dumps as keys is elegant. Would you consider extracting this into a helper function since it's used twice in the test? Something like
_dedupe_records(records)
, wdyt?+def _dedupe_records(records): + return list({orjson.dumps(record): record for record in records}.values()) # Usage: -cumulative_records_state_deduped = list( - {orjson.dumps(record): record for record in cumulative_records_state}.values() -) +cumulative_records_state_deduped = _dedupe_records(cumulative_records_state)docs/CONTRIBUTING.md (2)
97-115
: Comprehensive test execution instructions! 🎯The detailed breakdown of test commands with explanations is very helpful. Consider adding a note about expected test execution times for the different commands to help developers plan their workflow, wdyt?
🧰 Tools
🪛 LanguageTool
[style] ~101-~101: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. -poetry run pytest-fast
to run the subset of PyTest tests, which a...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~108-~108: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. -poetry run ruff format
to format your Python code. ### Run Code ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
167-167
: Minor: Missing comma in the sentence-On the **Settings / Domains** page find the subdomain +On the **Settings / Domains** page, find the subdomain🧰 Tools
🪛 LanguageTool
[uncategorized] ~167-~167: Possible missing comma found.
Context: ...file. - On the Settings / Domains page find the subdomain of myshopify.com a...(AI_HYDRA_LEO_MISSING_COMMA)
docs/RELEASES.md (1)
60-61
: Clear instructions for testing pre-release versionsThe instructions for testing pre-release versions are well-structured. Consider adding a note about the expected time for the publishing pipeline to complete, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
package-lock.json
is excluded by!**/package-lock.json
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (17)
airbyte_cdk/sources/declarative/decoders/json_decoder.py
(1 hunks)airbyte_cdk/sources/declarative/extractors/dpath_extractor.py
(2 hunks)airbyte_cdk/sources/declarative/extractors/record_extractor.py
(1 hunks)airbyte_cdk/sources/declarative/extractors/record_selector.py
(3 hunks)airbyte_cdk/sources/declarative/requesters/paginators/strategies/cursor_pagination_strategy.py
(1 hunks)airbyte_cdk/sources/declarative/transformations/flatten_fields.py
(2 hunks)docs/CONTRIBUTING.md
(5 hunks)docs/RELEASES.md
(2 hunks)package.json
(1 hunks)unit_tests/sources/declarative/decoders/test_json_decoder.py
(1 hunks)unit_tests/sources/declarative/extractors/test_dpath_extractor.py
(5 hunks)unit_tests/sources/declarative/extractors/test_record_extractor.py
(1 hunks)unit_tests/sources/declarative/extractors/test_record_selector.py
(5 hunks)unit_tests/sources/declarative/partition_routers/test_parent_state_stream.py
(1 hunks)unit_tests/sources/declarative/retrievers/test_simple_retriever.py
(2 hunks)unit_tests/sources/declarative/test_manifest_declarative_source.py
(2 hunks)unit_tests/sources/declarative/transformations/test_flatten_fields.py
(2 hunks)
✅ Files skipped from review due to trivial changes (2)
- package.json
- unit_tests/sources/declarative/test_manifest_declarative_source.py
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md
[style] ~101-~101: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast
to run the subset of PyTest tests, which a...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~108-~108: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format
to format your Python code. ### Run Code ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[uncategorized] ~167-~167: Possible missing comma found.
Context: ...file. - On the Settings / Domains page find the subdomain of myshopify.com a...
(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~269-~269: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...
(DO_ACHIEVE)
[locale-violation] ~287-~287: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....
(HAVE_A_LOOK)
docs/RELEASES.md
[grammar] ~68-~68: Did you mean the noun “publishing”?
Context: ...ng Low-Code Python connectors Once the publish pipeline has completed, set the version...
(PREPOSITION_VERB)
🪛 Markdownlint (0.37.0)
docs/CONTRIBUTING.md
140-140: null
Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
239-239: null
Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
86-86: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (14)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (1)
47-50
: Nice simplification! 👍The use of
yield from
makes the code more elegant and Pythonic. Great improvement!airbyte_cdk/sources/declarative/requesters/paginators/strategies/cursor_pagination_strategy.py (1)
73-73
: Good defensive programming! 👍Adding a default empty dict to
next()
preventsStopIteration
exceptions when the decoder yields no items. This aligns well with the JsonDecoder changes.airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (2)
12-20
: Clean implementation of service field! 👍Good use of the service key prefix convention for the root field. The implementation aligns well with the PR objectives.
88-90
: Clear empty case handling! 👍The explicit empty dict check and the explanatory comment make the contract compliance clear.
unit_tests/sources/declarative/decoders/test_json_decoder.py (2)
16-27
: Test cases align well with the new behavior! 👍The parameter rename to
expected_json
and updated test cases clearly reflect the decoder's behavior.
34-48
: Great test coverage! 👍The additional test cases and descriptive IDs improve the test suite's comprehensiveness and readability.
airbyte_cdk/sources/declarative/extractors/record_selector.py (2)
11-14
: LGTM! Clean import organization.The imports are well-organized and the new
exclude_service_keys
import aligns with the PR's objective of managing service fields.
162-167
: LGTM! Clean implementation of service keys removal.The
_remove_service_keys
method is simple, focused, and effectively uses the importedexclude_service_keys
function. The implementation aligns well with the PR's objective of making service fields local to filtering and transformation.unit_tests/sources/declarative/extractors/test_dpath_extractor.py (2)
18-21
: LGTM! Well-organized imports.The new imports for service key management functions are properly organized.
40-94
: Great test coverage for edge cases!The test cases thoroughly cover various scenarios including empty strings, objects, arrays, and nested structures. The parameterization is well-organized and the test data is clear.
unit_tests/sources/declarative/extractors/test_record_selector.py (2)
12-19
: LGTM! Clean import organization.The imports are well-organized and properly grouped.
50-79
: Great test coverage for nested records!The test cases thoroughly cover scenarios with nested records and properly validate the behavior with and without filters.
unit_tests/sources/declarative/retrievers/test_simple_retriever.py (1)
247-247
: LGTM! Clear state management assertions.The assertions properly verify that the state is updated to indicate sync completion after reading records.
Also applies to: 362-362
docs/CONTRIBUTING.md (1)
17-23
: Great addition of Python version requirements!The explicit Python version requirements and Fedora-specific installation instructions make it easier for contributors to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (4)
10-14
: Documentation looks great! Would you consider adding an example? 🤔The convention is well-documented. To make it even more helpful for developers, what do you think about adding a simple example of a service field, like
$root
? wdyt?
21-27
: The implementation looks solid! Quick question about typing 🤔The
# type: ignore[no-untyped-def]
comment seems unnecessary since the function has type hints. Could we remove it? wdyt?
30-31
: How about using str.startswith() for better readability? 🤔- return key.find(SERVICE_KEY_PREFIX) == 0 + return key.startswith(SERVICE_KEY_PREFIX)Both work the same way, but
startswith()
might be more intuitive. What do you think?
34-35
: The assertion looks good! A couple of suggestions to consider 🤔
- The
# type: ignore[no-untyped-def]
seems unnecessary since we have type hints- Would you like to make the assertion message more descriptive? Maybe something like:
assert mapping != exclude_service_keys(mapping), f"The mapping should contain at least one service key with prefix '{SERVICE_KEY_PREFIX}'"What do you think?
airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (3)
23-29
: The implementation looks solid! A few enhancement ideas 🤔
- Would you like to add a docstring explaining the function's purpose and return type?
- We could narrow the return type using Union:
-def update_record(record: Any, root: Any) -> Any: +def update_record(record: Any, root: Any) -> Union[dict[str, Any], Any]:What are your thoughts on these suggestions?
88-91
: The empty response handling looks good! One suggestion for clarity 🤔What do you think about making the contract more explicit in the comment? Maybe:
- # An empty/invalid JSON parsed, keep the contract + # Return empty dict for empty/invalid JSON to maintain the contract that we always yield a mapping
92-109
: The extraction logic looks solid! A couple of suggestions to consider 🤔
- Would you like to add type hints for
root_response
andextracted
?- The
elif extracted:
condition on line 106 might be more explicit aselif not isinstance(extracted, (list, dict)) and extracted:
to clearly show the intent. wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/extractors/dpath_extractor.py
(2 hunks)airbyte_cdk/sources/declarative/extractors/record_extractor.py
(1 hunks)airbyte_cdk/sources/declarative/transformations/flatten_fields.py
(2 hunks)unit_tests/sources/declarative/transformations/test_flatten_fields.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- unit_tests/sources/declarative/transformations/test_flatten_fields.py
- airbyte_cdk/sources/declarative/transformations/flatten_fields.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (1)
17-18
: Clean and efficient implementation! 👍Nice use of dictionary comprehension and type hints.
airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (1)
12-20
: Clean imports and clear constant definition! 👍Good practice using the SERVICE_KEY_PREFIX to construct RECORD_ROOT_KEY.
Still #197 is a problem. It blocks the build. |
Created a DPath Enhancing Extractor Refactored the record enhancement logic - moved to the extracted class Split the tests of DPathExtractor and DPathEnhancingExtractor Fix the failing tests: FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_custom_components[test_create_custom_component_with_subcomponent_that_uses_parameters] FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_custom_components_do_not_contain_extra_fields FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_parse_custom_component_fields_if_subcomponent FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_page_increment FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_offset_increment FAILED unit_tests/sources/file_based/test_file_based_scenarios.py::test_file_based_read[simple_unstructured_scenario] FAILED unit_tests/sources/file_based/test_file_based_scenarios.py::test_file_based_read[no_file_extension_unstructured_scenario] They faile because of comparing string and int values of the page_size (public) attribute. Imposed an invariant: on construction, page_size can be set to a string or int keep only values of one type in page_size for uniform comparison (convert the values of the other type) _page_size holds the internal / working value ... unless manipulated directly. Merged: feat(low-code concurrent): Allow async job low-code streams that are incremental to be run by the concurrent framework (airbytehq#228) fix(low-code): Fix declarative low-code state migration in SubstreamPartitionRouter (airbytehq#267) feat: combine slash command jobs into single job steps (airbytehq#266) feat(low-code): add items and property mappings to dynamic schemas (airbytehq#256) feat: add help response for unrecognized slash commands (airbytehq#264) ci: post direct links to html connector test reports (airbytehq#252) (airbytehq#263) fix(low-code): Fix legacy state migration in SubstreamPartitionRouter (airbytehq#261) fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254) feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236) feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111) fix: don't mypy unit_tests (airbytehq#241) fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225) feat(concurrent cursor): attempt at clamping datetime (airbytehq#234) fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254) feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236) feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111) fix: don't mypy unit_tests (airbytehq#241) fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225) feat(concurrent cursor): attempt at clamping datetime (airbytehq#234) ci: use `ubuntu-24.04` explicitly (resolves CI warnings) (airbytehq#244) Fix(sdm): module ref issue in python components import (airbytehq#243) feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174) chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133) docs: comments on what the `Dockerfile` is for (airbytehq#240) chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237) Fix(sdm): module ref issue in python components import (airbytehq#243) feat(low-code): add DpathFlattenFields (airbytehq#227) feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174) chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133) docs: comments on what the `Dockerfile` is for (airbytehq#240) chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237)
Trigger a new build. Hopefully, the integration test infrastructure is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (11)
airbyte_cdk/sources/declarative/extractors/dpath_enhancing_extractor.py (2)
72-108
: Performance caution when reshaping large JSON objects.
Recursively constructing nested dictionaries and adding theparent
service key could be computationally expensive for very large or deeply nested JSON objects. Perhaps we could optionally disable this recursive parent assignment for known large data sets, wdyt?
135-142
: Collision avoidance with existing 'root' fields.
update_record
sets a service key named"root"
. If the record already contains a key namedroot
, this might overwrite it or create confusion. Maybe prefix it with a special character or handle collisions differently, wdyt?airbyte_cdk/sources/declarative/extractors/__init__.py (1)
21-21
: Confirm__all__
alignment with project naming conventions.
DpathEnhancingExtractor
is appended to__all__
. This is consistent, though sometimes teams prefer alphabetical order. Would you like to reorder these entries to maintain a standard format, wdyt?airbyte_cdk/sources/declarative/requesters/paginators/strategies/page_increment.py (1)
34-42
: Simplify page_size initialization further?
The updated logic is easier to follow, but the flow around lines 34–42 could be extracted into a helper method for clarity. That might remove the nestedif
s and keep it tidy, wdyt?unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py (1)
43-340
: Great test coverage! A few suggestions to consider.The test suite is comprehensive and well-structured. Really nice job on covering edge cases! 👍
A few minor suggestions to consider:
- Would it be helpful to add docstrings to explain the purpose of complex test cases?
- What about adding some negative test cases to verify error handling?
- Consider grouping related test cases using pytest's
@pytest.mark.group
decorator for better organizationWhat are your thoughts on these suggestions?
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
1824-1842
: The implementation looks good, but should we add some validation?The implementation follows the pattern of other extractors, but would it be helpful to add validation for the field path to ensure it's not empty? wdyt?
def create_dpath_enhancing_extractor( self, model: DpathEnhancingExtractorModel, config: Config, decoder: Optional[Decoder] = None, **kwargs: Any, ) -> DpathEnhancingExtractor: + if not model.field_path: + raise ValueError("field_path cannot be empty") if decoder: decoder_to_use = decoder else: decoder_to_use = JsonDecoder(parameters={})airbyte_cdk/sources/declarative/extractors/record_extractor.py (1)
29-40
: The recursive implementation looks good, but should we add a depth limit?The
exclude_service_keys
function handles nested structures well, but would it be worth adding a max depth parameter to prevent potential stack overflow with deeply nested structures? wdyt?-def exclude_service_keys(struct: Any) -> Any: +def exclude_service_keys(struct: Any, max_depth: int = 100) -> Any: """ :param struct: any object/JSON structure + :param max_depth: maximum recursion depth to prevent stack overflow :return: a copy of struct without any service fields at any level of nesting """ + if max_depth <= 0: + return struct if isinstance(struct, dict): - return {k: exclude_service_keys(v) for k, v in struct.items() if not is_service_key(k)} + return {k: exclude_service_keys(v, max_depth - 1) for k, v in struct.items() if not is_service_key(k)} elif isinstance(struct, list): - return [exclude_service_keys(v) for v in struct] + return [exclude_service_keys(v, max_depth - 1) for v in struct] else: return structairbyte_cdk/sources/declarative/extractors/dpath_extractor.py (2)
73-75
: Consider adding a docstring to explain the empty body contract?The empty body handling looks good, but adding a docstring might help future maintainers understand why we need to maintain this contract, wdyt?
98-113
: The template methods look great! Consider adding @AbstractMethod?The update_body and update_record template methods are well-documented. Since they're meant to be overridden by subclasses, would you consider marking them with @AbstractMethod to make the contract more explicit, wdyt?
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)
44-93
: Comprehensive nested data test coverage!The test cases thoroughly cover various nested data structures and extraction patterns. Would you consider adding a test case for very deep nesting to verify any potential recursion limits, wdyt?
airbyte_cdk/sources/declarative/requesters/paginators/strategies/offset_increment.py (1)
55-58
: Consider adding type validation for page_size conversion?The current implementation converts integers to strings without validation. We could add a type check to ensure page_size is either an int or str before conversion, wdyt?
- self.page_size = str(self.page_size) if isinstance(self.page_size, int) else self.page_size + if not isinstance(self.page_size, (int, str, type(None))): + raise ValueError(f"page_size must be an integer or string, got {type(self.page_size)}") + self.page_size = str(self.page_size) if isinstance(self.page_size, int) else self.page_size
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
airbyte_cdk/__init__.py
(2 hunks)airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(2 hunks)airbyte_cdk/sources/declarative/extractors/__init__.py
(2 hunks)airbyte_cdk/sources/declarative/extractors/dpath_enhancing_extractor.py
(1 hunks)airbyte_cdk/sources/declarative/extractors/dpath_extractor.py
(1 hunks)airbyte_cdk/sources/declarative/extractors/record_extractor.py
(1 hunks)airbyte_cdk/sources/declarative/extractors/record_selector.py
(2 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(2 hunks)airbyte_cdk/sources/declarative/parsers/manifest_component_transformer.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)airbyte_cdk/sources/declarative/requesters/paginators/strategies/offset_increment.py
(1 hunks)airbyte_cdk/sources/declarative/requesters/paginators/strategies/page_increment.py
(1 hunks)airbyte_cdk/sources/declarative/transformations/flatten_fields.py
(2 hunks)docs/CONTRIBUTING.md
(5 hunks)unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py
(1 hunks)unit_tests/sources/declarative/extractors/test_dpath_extractor.py
(4 hunks)unit_tests/sources/declarative/extractors/test_record_extractor.py
(1 hunks)unit_tests/sources/declarative/extractors/test_record_selector.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- airbyte_cdk/sources/declarative/transformations/flatten_fields.py
- airbyte_cdk/sources/declarative/extractors/record_selector.py
- docs/CONTRIBUTING.md
- unit_tests/sources/declarative/extractors/test_record_extractor.py
- unit_tests/sources/declarative/extractors/test_record_selector.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (17)
airbyte_cdk/sources/declarative/extractors/dpath_enhancing_extractor.py (2)
66-71
: Looks clean and straightforward!
The__post_init__
method delegates seamlessly to the parent class, ensuring necessary initialization logic is still applied.
109-134
: Double-check handling ofNone
and empty values.
Currently,_set_parent
filters out values if they are dictionaries or lists (lines 124-125). Ifv
can be zero, empty string, orFalse
, are we confident none of those cases should be filtered out? Might we risk losing relevant boolean or numeric fields, wdyt?airbyte_cdk/sources/declarative/extractors/__init__.py (1)
5-7
: Importing the new extractor looks good!
HavingDpathEnhancingExtractor
in a separate file helps keep the code organization clear.airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1674-1674
: LGTM! The new extractor type is correctly integrated.The addition of
DpathEnhancingExtractor
to theextractor
union type inRecordSelector
is clean and follows the existing pattern.airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
80-80
: LGTM! Clean import addition.The import is correctly placed with other extractor imports.
562-562
: LGTM! Mapping is correctly added.The new extractor is properly mapped to its creation method in the factory's initialization.
airbyte_cdk/sources/declarative/extractors/record_extractor.py (4)
10-15
: LGTM! Clear convention documentation.The convention for service fields is well-documented and the prefix constant is appropriately defined.
17-27
: LGTM! Clean utility function implementation.The
add_service_key
function is well-documented and follows immutable pattern by returning a new dict.
42-44
: LGTM! Simple and focused implementation.The
is_service_key
function is clean and follows single responsibility principle.
46-49
: LGTM! Generator pattern is well-utilized.The
remove_service_keys
function efficiently uses generator pattern to process records one at a time.airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (1)
77-78
: LGTM! Smart way to preserve the root response.Storing the original response before modification and passing it through the update chain enables powerful contextual transformations.
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (2)
39-41
: Great test coverage for edge cases!The empty input test cases effectively verify the contract compliance for empty string, object, and array inputs.
125-128
: LGTM! Clear JSONL format test cases.The JSONL-specific test cases effectively verify the handling of array records and line-based parsing.
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)
1458-1488
: Well-structured component definition!The DpathEnhancingExtractor component is well-documented and maintains consistency with other extractors. The service fields ($root, $parent) are clearly explained.
2806-2806
: LGTM! Clean integration with RecordSelector.The addition of DpathEnhancingExtractor to the RecordSelector's extractor property enables seamless integration.
airbyte_cdk/sources/declarative/parsers/manifest_component_transformer.py (1)
50-51
: LGTM! Clean addition of DpathEnhancingExtractor mappingThe new entry follows the established pattern and correctly maps the decoder to JsonDecoder.
airbyte_cdk/__init__.py (1)
95-95
: LGTM! Clean addition of DpathEnhancingExtractor to public APIThe new extractor is properly exposed in both the imports and all list, following the module's conventions.
Also applies to: 237-237
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
docs/CONTRIBUTING.md (4)
17-17
: The installation instructions are well-organized, but could we make the Python version requirement more prominent?The Python version requirement is crucial for development. What do you think about moving it to the very top of the setup instructions and adding a note about version compatibility? Something like:
-1. Clone the CDK repo. If you will be testing connectors, you should clone the CDK into the same parent directory as `airbytehq/airbyte`, which contains the connector definitions. -1. Make sure your Python version is 3.11 +## Prerequisites + +- Python 3.11 (required) +- Poetry 2.0 or higher + +## Setup Steps + +1. Clone the CDK repo. If you will be testing connectors, you should clone the CDK into the same parent directory as `airbytehq/airbyte`, which contains the connector definitions.wdyt? 🤔
Also applies to: 26-86
111-138
: The testing section is comprehensive, but could we improve the organization?The testing instructions are detailed but could be more structured. What do you think about grouping related commands under subsections with descriptive headers? For example:
### Run Unit Tests +#### Basic Test Commands - `poetry run pytest` to run all unit tests. - `poetry run pytest -k <suite or test name>` to run specific unit tests. + +#### Performance Optimized Testing - `poetry run pytest-fast` to run the subset of PyTest tests, which are not flagged as `slow`. (It should take <5 min for fast tests only.) + +#### Advanced Options - `poetry run pytest -s unit_tests` if you want to pass pytest options.This organization might make it easier for contributors to find the specific command they need. Thoughts? 🤔
🧰 Tools
🪛 LanguageTool
[style] ~116-~116: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. -poetry run pytest-fast
to run the subset of PyTest tests, which a...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~123-~123: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. -poetry run ruff format
to format your Python code. ### Run Code ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
346-352
: Consider adding a note about container cleanup.When using Docker for MockServer testing, it might be helpful to add a note about cleaning up the container after testing. What do you think about adding:
docker run -d --rm -v $(pwd)/secrets/mock_server_config:/config -p 8113:8113 --env MOCKSERVER_LOG_LEVEL=TRACE --env MOCKSERVER_SERVER_PORT=8113 --env MOCKSERVER_WATCH_INITIALIZATION_JSON=true --env MOCKSERVER_PERSISTED_EXPECTATIONS_PATH=/config/expectations.json --env MOCKSERVER_INITIALIZATION_JSON_PATH=/config/expectations.json mockserver/mockserver:5.15.0 +# To stop the MockServer container when done: +docker ps | grep mockserver | awk '{print $1}' | xargs docker stopwdyt? This would help prevent orphaned containers. 🐳
100-100
: Add language specifiers to code blocks.Some code blocks are missing language specifiers. This affects syntax highlighting and markdown linting.
-``` +```yaml connector_image: airbyte/source-shopify:dev to connector_image: airbyte/source-shopify:<version>Also, remove spaces inside code spans:
-` domain: nw0ipt-vr.myshopify.com` +`domain: nw0ipt-vr.myshopify.com`Also applies to: 184-184
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
100-100: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/CONTRIBUTING.md
(5 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md
[style] ~116-~116: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast
to run the subset of PyTest tests, which a...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~123-~123: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format
to format your Python code. ### Run Code ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~287-~287: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...
(DO_ACHIEVE)
[locale-violation] ~305-~305: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....
(HAVE_A_LOOK)
🪛 markdownlint-cli2 (0.17.2)
docs/CONTRIBUTING.md
100-100: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
155-155: Emphasis used instead of a heading
null
(MD036, no-emphasis-as-heading)
184-184: Spaces inside code span elements
null
(MD038, no-space-in-code)
254-254: Emphasis used instead of a heading
null
(MD036, no-emphasis-as-heading)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
docs/CONTRIBUTING.md (1)
8-8
: LGTM! Nice improvement in capitalization.The capitalization of "GitHub" has been corrected, which aligns with proper noun conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
docs/CONTRIBUTING.md (6)
16-81
: Setup instructions could be more explicit about Python version requirements, wdyt?The Python version requirement is mentioned but could be more explicit about whether this is a minimum or exact version requirement. Consider clarifying if versions > 3.11 are also supported.
-1. Make sure your Python version is 3.11 +1. Make sure your Python version is 3.11 or higher
93-101
: Consider adding a note about version synchronization, what do you think?The instructions about changing the connector image version could benefit from a note explaining why version synchronization is important and what problems it prevents.
where the **version** comes from airbyte/airbyte-integrations/connectors/source-shopify/metadata.yaml, as -the value of the **dockerImageTag** parameter. +the value of the **dockerImageTag** parameter. This ensures version consistency between the connector image and its metadata, preventing potential compatibility issues.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
95-95: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
106-124
: The testing section looks great, but what about adding examples? 🤔The testing instructions are clear, but could be even more helpful with example commands showing typical usage patterns. For instance, showing how to run a specific test suite or how to use the coverage report.
- `poetry run pytest` to run all unit tests. +Example: +```bash +# Run a specific test suite +poetry run pytest unit_tests/test_source.py + +# Run tests with coverage report +poetry run pytest --cov=airbyte_cdk +```🧰 Tools
🪛 LanguageTool
[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. -poetry run pytest-fast
to run the subset of PyTest tests, which a...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. -poetry run ruff format
to format your Python code. ### Run Code ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
179-181
: Fix code block formatting for better readability?The domain and shop examples are using inconsistent formatting. Consider using proper code blocks for better readability.
- domain: nw0ipt-vr.myshopify.com - shop: nw0ipt-vr +```yaml +domain: nw0ipt-vr.myshopify.com +shop: nw0ipt-vr +```🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
179-179: Spaces inside code span elements
null(MD038, no-space-in-code)
339-345
: Consider adding a troubleshooting section for MockServer setup?The MockServer setup instructions are good, but common issues and their solutions could be helpful. For example, what to do if the port is already in use or if Docker can't mount the volume.
HTTP requests to `localhost:8113/data` should now return the body defined in the expectations file. To test this, the implementer either has to change the code that defines the base URL for the Python source or update the `url_base` from low-code. With the Connector Builder running in docker, you will have to use domain `host.docker.internal` instead of `localhost` as the requests are executed within docker. +### Troubleshooting MockServer Setup + +- If port 8113 is already in use, you can change it to another port in both the docker command and your configuration +- If you see volume mount errors, ensure the `secrets/mock_server_config` directory exists and has proper permissions +- For Windows users using WSL2, you might need to use `%cd%` instead of `$(pwd)` in the volume mount
298-298
: Minor style suggestion: consider using American English consistently?For consistency with other documentation, consider using "take a look" instead of "have a look".
-Please have a look at the [Release Management](./RELEASES.md) guide +Please take a look at the [Release Management](./RELEASES.md) guide🧰 Tools
🪛 LanguageTool
[locale-violation] ~298-~298: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....(HAVE_A_LOOK)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/CONTRIBUTING.md
(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md
[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast
to run the subset of PyTest tests, which a...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format
to format your Python code. ### Run Code ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~280-~280: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...
(DO_ACHIEVE)
[locale-violation] ~298-~298: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....
(HAVE_A_LOOK)
🪛 markdownlint-cli2 (0.17.2)
docs/CONTRIBUTING.md
95-95: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
150-150: Emphasis used instead of a heading
null
(MD036, no-emphasis-as-heading)
179-179: Spaces inside code span elements
null
(MD038, no-space-in-code)
249-249: Emphasis used instead of a heading
null
(MD036, no-emphasis-as-heading)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
661-675
: Consider adding more descriptive documentation for the new extractor.The new
DpathEnhancingExtractor
class seems to be a key component for implementing the$root
service field feature mentioned in the PR objectives. However, its purpose and behavior aren't immediately clear from the documentation. Would you consider adding:
- A description of how it enhances the extraction process
- An example showing how it adds the
$root
service field- Documentation about any limitations or considerations when using this extractor
class DpathEnhancingExtractor(BaseModel): type: Literal["DpathEnhancingExtractor"] field_path: List[str] = Field( ..., - description='List of potentially nested fields describing the full path of the field to extract. Use "*" to extract all values from an array. See more info in the [docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/record-selector).', + description='''List of potentially nested fields describing the full path of the field to extract. Use "*" to extract all values from an array. + + This extractor enhances each extracted record by adding a `$root` service field containing a reference to: + - The root response object when parsing JSON format + - The original record when parsing JSONL format + + This enables advanced filtering and transformation capabilities by providing access to the complete context. + Note: Service fields are local to filtering and transformation steps and are cleaned afterward. + + See more info in the [docs](https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/record-selector). + ''', examples=[ ["data"], ["data", "records"], ["data", "{{ parameters.name }}"], ["data", "*", "record"], ], title="Field Path", ) parameters: Optional[Dict[str, Any]] = Field(None, alias="$parameters")airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
1865-1882
: Consider reducing code duplication with create_dpath_extractor?The implementation is correct, but I notice it shares a lot of logic with
create_dpath_extractor
. Would you consider extracting the common decoder initialization and field path handling into a shared private method to reduce duplication? Something like this perhaps:+ def _create_dpath_extractor_base( + self, + decoder: Optional[Decoder], + field_path: List[str], + config: Config, + parameters: Optional[Dict[str, Any]] = None, + ) -> Tuple[Decoder, List[Union[InterpolatedString, str]]]: + decoder_to_use = decoder if decoder else JsonDecoder(parameters={}) + model_field_path: List[Union[InterpolatedString, str]] = [x for x in field_path] + return decoder_to_use, model_field_path + def create_dpath_extractor( self, model: DpathExtractorModel, config: Config, decoder: Optional[Decoder] = None, **kwargs: Any, ) -> DpathExtractor: - if decoder: - decoder_to_use = decoder - else: - decoder_to_use = JsonDecoder(parameters={}) - model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path] + decoder_to_use, model_field_path = self._create_dpath_extractor_base( + decoder, model.field_path, config, model.parameters + ) return DpathExtractor( decoder=decoder_to_use, field_path=model_field_path, config=config, parameters=model.parameters or {}, ) def create_dpath_enhancing_extractor( self, model: DpathEnhancingExtractorModel, config: Config, decoder: Optional[Decoder] = None, **kwargs: Any, ) -> DpathEnhancingExtractor: - if decoder: - decoder_to_use = decoder - else: - decoder_to_use = JsonDecoder(parameters={}) - model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path] + decoder_to_use, model_field_path = self._create_dpath_extractor_base( + decoder, model.field_path, config, model.parameters + ) return DpathEnhancingExtractor( decoder=decoder_to_use, field_path=model_field_path, config=config, parameters=model.parameters or {}, )This would make the code more maintainable and reduce the risk of divergence in the future. What do you think?
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
1463-1493
: New Component Addition: DpathEnhancingExtractor
The newDpathEnhancingExtractor
component is added to support the inclusion of local service fields ($root
and$parent
) for use during the filter and transform phases. This aligns nicely with the PR objective of retaining the original object in each record. Would you consider adding more detailed examples or clarifying comments on how these service fields could be utilized (and later cleaned up) to further assist connector developers? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(2 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(2 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1685-1685
: LGTM!The update to include
DpathEnhancingExtractor
in theRecordSelector
's extractor field union type is correct and necessary for the new feature.airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
80-80
: LGTM! Clean import addition.The DpathEnhancingExtractor import is well-placed within the existing extractors import group.
218-220
: LGTM! Clean model mapping integration.The DpathEnhancingExtractorModel import and constructor mapping are properly integrated into the factory's model mapping system.
Also applies to: 562-562
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2808-2811
: RecordSelector Update: Inclusion of DpathEnhancingExtractor
By adding the reference toDpathEnhancingExtractor
in theRecordSelector
component’sextractor
property, the PR now enables users to select this new extractor. Could you also verify that corresponding tests cover this new behavior—ensuring that the$root
field is correctly populated and later removed during transformations? wdyt?
Trigger a new build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (23)
docs/CONTRIBUTING.md (23)
16-22
: Python Version Requirement Added:
Great job specifying that contributors must use Python 3.11. Would it be helpful to include a link to Python 3.11 release notes or a brief note on compatibility issues when using different versions? wdyt?
24-30
: Pip Installation Instructions for Fedora:
The new instructions to install pip on Fedora are clear. Would you consider adding a quick note on verifying the installed pip version (for example, by runningpip --version
) to help less experienced users? wdyt?
32-47
: Poetry Installation Details:
The alternative methods for installing Poetry (via pip and via curl) are nicely detailed. Would you consider including a brief comment on when one method might be preferable over the other or adding a link to further documentation? wdyt?
50-63
: Using the Python 3.11 Environment:
The section showing how to use a Python 3.11 environment with Poetry is very clear. Would you consider adding a troubleshooting tip (for example, what to do if the correct environment isn’t activated) to further assist contributors? wdyt?
65-65
: Cloning the CDK Repository:
The instruction to clone the CDK repo is concise. Do you think adding an examplegit clone
command or a reference link to our Git documentation might further help newcomers? wdyt?
66-70
: Installing Unit Test Prerequisites:
The instruction to runpoetry install --all-extras
is straightforward. Would you consider mentioning briefly why the--all-extras
flag is required, so that contributors understand its purpose? wdyt?
73-81
: RHEL Module Setup Instructions:
These instructions to execute commands for RHEL (or compatible OS) users are clear. Would you consider adding a short explanation of what theiptable_nat
module does and why it’s needed? wdyt?
83-87
: 'See Also' References:
The “See also” section with links (including Dagger/Podman integration and issue #197) is useful. Could you please verify that these links are current and consider removing any redundant references if necessary? wdyt?
91-97
: Editing Shopify Connector Config:
The guidance on editing the acceptance test configuration for the Shopify connector is clear. Would you consider providing a brief sample diff or additional explanation to help users understand the required changes? wdyt?🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
95-95: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
109-113
: Running Unit Tests:
The updated unit test commands are concise and useful. Would you consider varying the sentence structure slightly to avoid repeated phrasing? wdyt?🧰 Tools
🪛 LanguageTool
[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. -poetry run pytest-fast
to run the subset of PyTest tests, which a...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
114-119
: Code Formatting Commands:
In this section, several sentences start similarly. Would you consider rewording a couple of these sentences for improved readability, or adding transitions between them? wdyt?🧰 Tools
🪛 LanguageTool
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. -poetry run ruff format
to format your Python code. ### Run Code ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
126-133
: More Tools and Options:
This section is quite helpful. Would it add clarity to include a short explanation of when to use each tool before listing the commands? wdyt?
134-139
: Local Testing Against Connector Changes:
The instructions for testing locally with CDK changes are detailed. Would you consider including a brief troubleshooting tip for common issues encountered during this process? wdyt?
150-189
: Integration Test Preparation Steps:
This lengthy block is very detailed and informative. Would you consider breaking it into sub-sections (for example, “Project Setup,” “Configuration Changes,” and “File Creation”) to enhance readability for contributors? wdyt?🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
150-150: Emphasis used instead of a heading
null(MD036, no-emphasis-as-heading)
179-179: Spaces inside code span elements
null(MD038, no-space-in-code)
199-211
: Running Connector Tests:
The step-by-step commands for executing connector tests are clear. Would adding a note on how to interpret test outcomes or what to do when a test fails be beneficial? wdyt?
213-223
: Shopify Connector Example:
The example provided for the Shopify connector is practical. Would you consider clarifying the placeholder<connector name>
(or providing an example) so that the instructions are completely unambiguous? wdyt?
233-243
: Extended Acceptance Test Examples:
This block provides various command options (including debug and timeout). Would it be helpful to label these options explicitly (e.g., "With Debug Option", "With Timeout Option") for even easier reference? wdyt?
251-267
: airbyte-ci CLI and Build Commands:
The instructions for installing theairbyte-ci
CLI and building the connector image using the local CDK are well organized. Would you consider adding a link or further reference to detailed documentation onairbyte-ci
for contributors who want to dive deeper? wdyt?
272-276
: Working with Poe Tasks:
This section explains Poe Tasks clearly. Would you consider including a short troubleshooting tip or common pitfalls when using Poe tasks to further support contributors? wdyt?
280-285
: Auto-Generating the Declarative Schema File:
The instructions are clear, though the phrasing “you may need to regenerate them” could be more expressive. Would you consider rewording it to, for example, “please regenerate the schemas to ensure they align with the latest changes”? wdyt?🧰 Tools
🪛 LanguageTool
[style] ~280-~280: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...(DO_ACHIEVE)
297-299
: Release Management Section Wording:
In the Release Management section, the instruction “please have a look at the Release Management guide” might sound friendlier if changed to “please take a look at...”. Would you consider this minor tweak to align with American English usage? wdyt?🧰 Tools
🪛 LanguageTool
[locale-violation] ~298-~298: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....(HAVE_A_LOOK)
311-316
: GitHub Slash Commands Formatting:
The list of GitHub slash commands is very useful. Would you consider adjusting the formatting of the/autofix
entry—perhaps by ensuring a consistent separation between the command and its description—as done with/poetry-lock
? wdyt?
347-351
: Connector Acceptance Tests Header:
The header for “Running Connector Acceptance Tests for a single connector in Docker” is clear. Would you consider adding a brief note on any prerequisites (if not mentioned elsewhere) for running these tests? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/CONTRIBUTING.md
(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md
[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast
to run the subset of PyTest tests, which a...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format
to format your Python code. ### Run Code ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~280-~280: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...
(DO_ACHIEVE)
[locale-violation] ~298-~298: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....
(HAVE_A_LOOK)
🪛 markdownlint-cli2 (0.17.2)
docs/CONTRIBUTING.md
95-95: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
150-150: Emphasis used instead of a heading
null
(MD036, no-emphasis-as-heading)
179-179: Spaces inside code span elements
null
(MD038, no-space-in-code)
249-249: Emphasis used instead of a heading
null
(MD036, no-emphasis-as-heading)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (15)
docs/CONTRIBUTING.md (15)
8-8
: Clarify Contribution Avenues:
You updated the bullet point to include “GitHub discussions” (with proper capitalization) as a way to contribute. Would you please confirm that this revised wording aligns with our intended communication and style guidelines? wdyt?
88-88
: Docker Installation Note:
The brief note on ensuring Docker is installed locally is succinct. Is this meant to serve only as a reminder, or would adding a link to more detailed Docker installation instructions improve clarity? wdyt?
120-125
: Code Linting Instructions:
The linting commands are clearly listed and align well with our best practices. Great work here!
140-141
: GitHub Pipelines Integration Note:
The note that the GitHub pipelines run integration tests against the Shopify source is concise and useful. Thank you for the clarity!
144-148
: Assumptions Section:
The assumptions about checkout locations and project structure are clearly laid out. This adds valuable context—great job!
225-231
: Acceptance Tests Locally:
The instructions for running acceptance tests locally are straightforward and well explained. Excellent work!
245-246
: Reverting Test Changes Reminder:
The reminder to revert test changes is a nice touch to ensure cleanup after testing. Looks good to me!
247-250
: Test CDK Changes Section Header:
The header and subsequent preparation steps for testing CDK changes against connectors are clear. Thank you for the detailed guidance!🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
249-249: Emphasis used instead of a heading
null(MD036, no-emphasis-as-heading)
269-271
: Local CDK Build Note:
The explanation that the local CDK is injected at build time is concise and clear. Great job!
277-277
: Duplicate Ruff Configuration Paragraph:
It appears that the paragraph about the Ruff configuration is duplicated. Would you consider removing the redundant copy to improve document clarity? wdyt?
294-296
: docs-generate Task Mapping:
The explanation of how thedocs-generate
task is mapped and what it produces is clear and useful. Nice work here!
302-305
: FAQ on Maintainers:
The FAQ entry explaining who the “maintainers” are is clear and informative. No changes needed here—great job!
323-327
: MockServer Introduction:
The introduction about using MockServer in place of direct API access is clear and sets the context well. Thanks for including this helpful information!
339-343
: Docker Command for MockServer:
The Docker command provided for running MockServer is detailed and should be very useful for our testing. Everything looks good here!
345-346
: HTTP Requests Explanation:
The follow-up explanation that HTTP requests tolocalhost:8113/data
will return the defined body is clear. No issues here—great work!
At record extraction step, in each record add the service field $root holding a reference to: * the root response object, when parsing JSON format * the original record, when parsing JSONL format that each record to process is extracted from. More service fields could be added in future. The service fields are available in the record's filtering and transform steps. Avoid: * reusing the maps/dictionaries produced, thus avoid building cyclic structures * transforming the service fields in the Flatten transformation. Explicitly cleanup the service field(s) after the transform step, thus making them: * local for the filter and transform steps * not visible to the next mapping and store steps (as they should be) * not visible in the tests beyond the test_record_selector (as they should be) This allows the record transformation logic to define its "local variables" to reuse some interim calculations. The contract of body parsing seems irregular in representing the cases of bad JSON, no JSON and empty JSON. Cannot be unified as that that irregularity is already used. Update the development environment setup documentation * to organize and present the setup steps explicitly * to avoid misunderstandings and wasted efforts. Update CONTRIBUTING.md to * collect and organize the knowledge on running the test locally. * state the actual testing steps. * clarify and make explicit the procedures and steps. The unit, integration, and acceptance tests in this exactly version succeed under Fedora 41, while one of them fails under Oracle Linux 8.7. not related to the contents of this PR. The integration tests of the CDK fail due to missing `secrets/config.json` file for the Shopify source. See airbytehq#197 Polish Integrate the DpathEnhancingExtractor in the UI of Airbyte. Created a DPath Enhancing Extractor Refactored the record enhancement logic - moved to the extracted class Split the tests of DPathExtractor and DPathEnhancingExtractor Fix the failing tests: FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_custom_components[test_create_custom_component_with_subcomponent_that_uses_parameters] FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_custom_components_do_not_contain_extra_fields FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_parse_custom_component_fields_if_subcomponent FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_page_increment FAILED unit_tests/sources/declarative/parsers/test_model_to_component_factory.py::test_create_offset_increment FAILED unit_tests/sources/file_based/test_file_based_scenarios.py::test_file_based_read[simple_unstructured_scenario] FAILED unit_tests/sources/file_based/test_file_based_scenarios.py::test_file_based_read[no_file_extension_unstructured_scenario] They faile because of comparing string and int values of the page_size (public) attribute. Imposed an invariant: on construction, page_size can be set to a string or int keep only values of one type in page_size for uniform comparison (convert the values of the other type) _page_size holds the internal / working value ... unless manipulated directly. Merged: feat(low-code concurrent): Allow async job low-code streams that are incremental to be run by the concurrent framework (airbytehq#228) fix(low-code): Fix declarative low-code state migration in SubstreamPartitionRouter (airbytehq#267) feat: combine slash command jobs into single job steps (airbytehq#266) feat(low-code): add items and property mappings to dynamic schemas (airbytehq#256) feat: add help response for unrecognized slash commands (airbytehq#264) ci: post direct links to html connector test reports (airbytehq#252) (airbytehq#263) fix(low-code): Fix legacy state migration in SubstreamPartitionRouter (airbytehq#261) fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254) feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236) feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111) fix: don't mypy unit_tests (airbytehq#241) fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225) feat(concurrent cursor): attempt at clamping datetime (airbytehq#234) fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254) feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236) feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111) fix: don't mypy unit_tests (airbytehq#241) fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225) feat(concurrent cursor): attempt at clamping datetime (airbytehq#234) ci: use `ubuntu-24.04` explicitly (resolves CI warnings) (airbytehq#244) Fix(sdm): module ref issue in python components import (airbytehq#243) feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174) chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133) docs: comments on what the `Dockerfile` is for (airbytehq#240) chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237) Fix(sdm): module ref issue in python components import (airbytehq#243) feat(low-code): add DpathFlattenFields (airbytehq#227) feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174) chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133) docs: comments on what the `Dockerfile` is for (airbytehq#240) chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237) formatted Update record_extractor.py Trigger a new build. Hopefully, the integration test infrastructure is fixed. Update CONTRIBUTING.md Trigger a new build
docs/CONTRIBUTING.md unit_tests/sources/declarative/transformations/test_flatten_fields.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (14)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (2)
33-35
: LGTM! Would you consider adding a docstring? 🤔The enhanced function signature with default values looks great! For better maintainability, maybe we could add a docstring explaining the parameters? wdyt?
def generate_csv( encoding: str = "utf-8", delimiter: str = ",", should_compress: bool = False ) -> bytes: + """Generate CSV test data with configurable encoding, delimiter and compression. + + Args: + encoding (str, optional): Character encoding. Defaults to "utf-8". + delimiter (str, optional): CSV field separator. Defaults to ",". + should_compress (bool, optional): Whether to gzip compress output. Defaults to False. + + Returns: + bytes: Generated CSV data, optionally compressed + """
180-200
: Great test coverage! Maybe we could make the test data DRY? 🤔The parameterized test looks excellent! One tiny suggestion: since the expected data is identical to what's in
generate_csv
, would you consider extracting it to a shared constant to avoid duplication? wdyt?+TEST_CSV_DATA = [ + {"id": "1", "name": "John", "age": "28"}, + {"id": "2", "name": "Alice", "age": "34"}, + {"id": "3", "name": "Bob", "age": "25"}, +] + def generate_csv( encoding: str = "utf-8", delimiter: str = ",", should_compress: bool = False ) -> bytes: - data = [ - {"id": "1", "name": "John", "age": "28"}, - {"id": "2", "name": "Alice", "age": "34"}, - {"id": "3", "name": "Bob", "age": "25"}, - ] + data = TEST_CSV_DATA ... def test_composite_raw_decoder_csv_parser_values(requests_mock, encoding: str, delimiter: str): ... - expected_data = [ - {"id": "1", "name": "John", "age": "28"}, - {"id": "2", "name": "Alice", "age": "34"}, - {"id": "3", "name": "Bob", "age": "25"}, - ] + expected_data = TEST_CSV_DATAdocs/CONTRIBUTING.md (12)
109-120
: Local Development – Unit Test Commands
The “Run Unit Tests” section now provides several command options (including using pytest’s -k and -s flags), which is very useful for different testing situations. Might adding a one-line explanation for when to use each option assist less experienced contributors? wdyt?🧰 Tools
🪛 LanguageTool
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. -poetry run pytest-fast
to run the subset of PyTest tests, which a...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
131-135
: Additional Tools and Options Reference
The “More tools and options” section offers quick access to useful commands. Would linking to a more detailed description of each tool be beneficial, or is brevity preferred here? wdyt?
143-148
: Assumptions and Pipeline Use
The note that GitHub pipelines run the integration tests along with clear assumptions about directory structure provides valuable context. Have you considered adding a simple diagram to visually represent this structure? wdyt?
172-180
: Configuring Secrets for Integration Tests
The detailed instructions—complete with an example for the Shopify connector—for creating thesecrets/config.json
file are very practical. Would you consider revising the formatting of the in-line example (e.g., extra indentation) for clarity? wdyt?
198-209
: Running Connector Test Commands
The “Steps:” section with detailed commands for running connector tests is comprehensive. Do you think it would help to annotate which command verifies which aspect of the connector (spec, check, discover, read)? wdyt?
211-220
: Shopify Connector Test Example
The example provided for the Shopify source (including running spec, check, discover, and read commands) is clear and instructive. Would a note on expected outputs help ensure users know when a test passes? wdyt?
229-238
: Extended Acceptance Test Options
Providing examples with debug and timeout options is highly useful for troubleshooting. Would it be beneficial to include a brief explanation of when to use the timeout option versus the debug option? wdyt?
242-257
: Testing CDK Changes Against Connectors
The detailed instructions using theairbyte-ci
CLI for building connector images and running tests are very well defined. Would a pointer to common errors during the build/test process be useful for contributors? wdyt?🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
256-256: Emphasis used instead of a heading
null(MD036, no-emphasis-as-heading)
273-279
: Generating API Reference Documentation
The section on generating API docs using pdoc is detailed and helpful. Should we perhaps mention how to preview the generated docs locally if issues arise? wdyt?
280-282
: Release Management Guidance
The directive to “have a look” at the Release Management guide is clear. Given regional language preferences, would you consider revising it to “Please take a look at…” for consistency with American English? wdyt?
283-299
: FAQ Section Clarity and Consistency
The FAQ is well structured and covers key questions. A couple of suggestions:
• Consider varying the starting words of consecutive sentences to enhance readability.
• The bullet for “/autofix” could use consistent spacing (e.g., adding a space after/autofix
).
Would these refinements help improve the flow? wdyt?🧰 Tools
🪛 LanguageTool
[style] ~287-~287: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...(DO_ACHIEVE)
302-334
: Advanced Topics and Appendix
The advanced topics—covering using MockServer and running connector acceptance tests in Docker—are comprehensive and provide useful real-world guidance. Would including a brief troubleshooting checklist or common pitfalls in these sections further assist contributors? wdyt?🧰 Tools
🪛 LanguageTool
[locale-violation] ~305-~305: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....(HAVE_A_LOOK)
[uncategorized] ~317-~317: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...What GitHub slash commands are available and who can run them? Only Airbyte CDK mai...(COMMA_COMPOUND_SENTENCE_2)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
.github/workflows/pypi_publish.yml
(1 hunks)airbyte_cdk/sources/declarative/auth/oauth.py
(2 hunks)airbyte_cdk/sources/streams/http/requests_native_auth/abstract_oauth.py
(1 hunks)docs/CONTRIBUTING.md
(4 hunks)unit_tests/sources/declarative/auth/test_oauth.py
(3 hunks)unit_tests/sources/declarative/decoders/test_composite_decoder.py
(2 hunks)unit_tests/sources/declarative/retrievers/test_simple_retriever.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- .github/workflows/pypi_publish.yml
🧰 Additional context used
🪛 LanguageTool
docs/CONTRIBUTING.md
[style] ~118-~118: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... unit tests. - poetry run pytest-fast
to run the subset of PyTest tests, which a...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~125-~125: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ting issues. - poetry run ruff format
to format your Python code. ### Run Code ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[uncategorized] ~184-~184: Possible missing comma found.
Context: ...file. - On the Settings / Domains page find the subdomain of myshopify.com a...
(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~287-~287: Consider a more expressive alternative.
Context: ...or, you may need to regenerate them. To do that, you can run: ```bash poetry run ...
(DO_ACHIEVE)
[locale-violation] ~305-~305: In American English, “take a look” is more commonly used.
Context: ...Actions. ## Release Management Please have a look at the [Release Management](./RELEASES....
(HAVE_A_LOOK)
🪛 markdownlint-cli2 (0.17.2)
docs/CONTRIBUTING.md
102-102: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
157-157: Emphasis used instead of a heading
null
(MD036, no-emphasis-as-heading)
186-186: Spaces inside code span elements
null
(MD038, no-space-in-code)
256-256: Emphasis used instead of a heading
null
(MD036, no-emphasis-as-heading)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (32)
unit_tests/sources/declarative/retrievers/test_simple_retriever.py (2)
257-260
: LGTM! The state completion assertion looks good.The added assertion verifies that the state correctly indicates sync completion after reading all records. This aligns well with the PR objective of enhancing record filtering and transformation.
372-375
: LGTM! The state completion assertion looks good.The added assertion verifies that the state correctly indicates sync completion after reading all records. This aligns well with the PR objective of enhancing record filtering and transformation.
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)
37-39
: Nice improvements to the test data and CSV generation! 👍The changes look great:
- Using strings for all fields is more CSV-friendly
- Clean separation of compression logic
- Configurable delimiter enhances test coverage
Also applies to: 42-43, 48-53
airbyte_cdk/sources/declarative/auth/oauth.py (2)
234-237
: LGTM! Nice defensive programming.The addition of the access token initialization check before returning the expiry date helps prevent potential issues with uninitialized tokens.
239-241
: LGTM! Clean and focused helper method.The method follows the single responsibility principle and has a clear, descriptive name.
airbyte_cdk/sources/streams/http/requests_native_auth/abstract_oauth.py (1)
264-266
: LGTM! Good edge case handling.The addition of the empty value check with token expiry validation provides a graceful fallback mechanism.
unit_tests/sources/declarative/auth/test_oauth.py (2)
317-344
: LGTM! Great test coverage for the new functionality.The test case thoroughly verifies the behavior when no access token is present but the expiry date is in the future.
304-304
: LGTM! Good test enhancement.Adding the access token value parameter improves the test coverage.
docs/CONTRIBUTING.md (24)
8-8
: Refine Contribution Instructions
The updated bullet now explicitly lists “reporting bugs, posting GitHub discussions, opening issues…” which broadens the contribution options. Was this expansion intended to better guide potential contributors? wdyt?
16-17
: Specify Python Version Requirement
Introducing “1. Make sure your Python version is 3.11” establishes a clear baseline. Have you considered whether contributors using other versions might need additional guidance or troubleshooting steps? wdyt?
18-23
: Fedora Instructions for Python Installation
Including “Fedora 41:” with a dedicated code block to install Python 3.11 is very helpful for users on that platform. Did you verify these commands on recent Fedora releases? wdyt?
24-31
: Pip Installation Guidance on Fedora
The step “2. Install pip” along with Fedora-specific commands provides clear direction. Would you consider adding a brief note on what to do if the pip package name or command changes in future Fedora releases? wdyt?
32-47
: Multiple Methods for Installing Poetry
The section now offers two alternatives for installing Poetry (via pip and via curl) as well as a Fedora-specific block. This flexibility is great. Would you like to mention which method is preferred in our official recommendations, or should we let users choose freely? wdyt?
48-49
: Poe Tasks Note Clarity
The note explaining that “Poe” tasks can be used for common actions is informative. Is the linked “Poe the Poet” documentation up to date with these instructions? wdyt?
50-63
: Guidance on Using the Python 3.11 Environment
The clear instructions and code block for verifying and switching the Poetry environment to Python 3.11 will help prevent environment mismatches. Would you consider adding a brief troubleshooting tip if the command fails? wdyt?
65-70
: Repository Cloning and Test Prerequisites
The instructions to clone the CDK repo and install the unit test prerequisites are neatly detailed. Do you think a note on branch or version management could further assist contributors? wdyt?
74-88
: RHEL Module Loading Instructions
The steps for RHEL (or compatible OS) to load the IP table NAT module—with both a temporary and persistent solution—are very clear. Should we add a reminder that these commands typically require root privileges? wdyt?
92-93
: Additional Reference Links
The “See also:” links to the Dagger-Podman integration and CDK Issue 197 offer useful extra context. Do you think any further annotations are needed, or is the current level of detail sufficient? wdyt?
95-98
: Docker Installation Recommendation
Emphasizing that Docker should be installed (instead of Podman) is a strong clarification. Is this recommendation documented elsewhere as well, or should we consolidate related notes here? wdyt?
100-107
: Acceptance Test Config Update
The instructions for editing the Shopify acceptance-test configuration—with the before/after code block—are precise. Would it help to remind users to verify that no other fields are unintentionally modified during this update? wdyt?🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
102-102: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
121-124
: Code Formatting Instructions
The step-by-step formatting commands using ruff (both checking and auto-fixing) are clear and concise. Do these instructions match our latest formatting guidelines closely? wdyt?
126-129
: Linting and Type-checking Guidance
Including commands for both Poe-based linting and mypy type-checking provides thorough coverage of code quality. Are there any additional common pitfalls we should warn about in this section? wdyt?
137-141
: Integration Testing Scenarios
The description of various scenarios for running integration tests (e.g., running a connector that uses the new feature) is very practical. Would it be helpful to explicitly list prerequisites for these tests? wdyt?
150-156
: Acceptance Test Preparation
The “Preparation steps” for running connector acceptance tests (including entering the correct directory and installing dependencies) are straightforward. Would a brief troubleshooting tip (e.g., common errors) be useful here? wdyt?
158-163
: Updating Test Connector Configuration
The instructions to modify the<test connector>/pyproject.toml
file and replace the airbyte_cdk line are clear. Do you think it’s necessary to mention potential version mismatches during this update? wdyt?
165-170
: Reinstalling airbyte_cdk for Testing
The step-by-step commands for reinstalling airbyte_cdk in the test connector directory are concise. Are there any caching issues or additional cleanup steps we might need to warn about? wdyt?
183-192
: Example JSON Configuration for Secrets
The JSON snippet is well-formatted and serves as a good reference for contributors. Do you think adding comments within the JSON (if permitted) would further aid understanding? wdyt?🧰 Tools
🪛 LanguageTool
[uncategorized] ~184-~184: Possible missing comma found.
Context: ...file. - On the Settings / Domains page find the subdomain of myshopify.com a...(AI_HYDRA_LEO_MISSING_COMMA)
🪛 markdownlint-cli2 (0.17.2)
186-186: Spaces inside code span elements
null(MD038, no-space-in-code)
194-196
: Reference Links for Testing Resources
Including links to the Acceptance Tests Reference and Connector Acceptance Tests documentation adds valuable context. Would you like to include a brief summary of what each link covers? wdyt?
222-227
: Local Acceptance Test Execution
The command block for running acceptance tests locally is straightforward. Do you feel that adding information on environment prerequisites (e.g., necessary Docker settings) might prevent future confusion? wdyt?
240-240
: Test Changes Reversion Reminder
A simple reminder to revert test changes is a neat touch to keep the repository clean. Does that cover all scenarios, or should we expand on how to revert if using version control? wdyt?
258-263
: Working with Poe Tasks
The explanation regarding the use of Poe tasks (and the option to pre-install Poe via pipx) is clear and friendly. Do you think a short example of a commonly used Poe command might further enhance this section? wdyt?
264-272
: Auto-Generating the Declarative Schema File
The instructions for regenerating the schema file from the YAML spec are concise and easy to follow. Would linking to an example of a schema change be helpful for context? wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
1882-1899
: Consider reducing code duplication withcreate_dpath_extractor
?The implementation is almost identical to
create_dpath_extractor
. What do you think about extracting the common logic into a private helper method to reduce duplication? Something like this, wdyt?+ def _create_dpath_extractor_base( + self, + decoder: Optional[Decoder], + model_field_path: List[Union[InterpolatedString, str]], + config: Config, + parameters: Optional[Dict[str, Any]], + extractor_class: Type[Union[DpathExtractor, DpathEnhancingExtractor]] + ) -> Union[DpathExtractor, DpathEnhancingExtractor]: + decoder_to_use = decoder if decoder else JsonDecoder(parameters={}) + return extractor_class( + decoder=decoder_to_use, + field_path=model_field_path, + config=config, + parameters=parameters or {}, + ) + def create_dpath_extractor( self, model: DpathExtractorModel, config: Config, decoder: Optional[Decoder] = None, **kwargs: Any, ) -> DpathExtractor: - if decoder: - decoder_to_use = decoder - else: - decoder_to_use = JsonDecoder(parameters={}) model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path] - return DpathExtractor( - decoder=decoder_to_use, - field_path=model_field_path, - config=config, - parameters=model.parameters or {}, - ) + return self._create_dpath_extractor_base( + decoder=decoder, + model_field_path=model_field_path, + config=config, + parameters=model.parameters, + extractor_class=DpathExtractor + ) def create_dpath_enhancing_extractor( self, model: DpathEnhancingExtractorModel, config: Config, decoder: Optional[Decoder] = None, **kwargs: Any, ) -> DpathEnhancingExtractor: - if decoder: - decoder_to_use = decoder - else: - decoder_to_use = JsonDecoder(parameters={}) model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path] - return DpathEnhancingExtractor( - decoder=decoder_to_use, - field_path=model_field_path, - config=config, - parameters=model.parameters or {}, - ) + return self._create_dpath_extractor_base( + decoder=decoder, + model_field_path=model_field_path, + config=config, + parameters=model.parameters, + extractor_class=DpathEnhancingExtractor + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
* main: fix: update cryptography package to latest version to address CVE (airbytehq#377) fix: (CDK) (HttpRequester) - Make the `HttpRequester.path` optional (airbytehq#370) feat: improved custom components handling (airbytehq#350) feat: add microseconds timestamp format (airbytehq#373) fix: Replace Unidecode with anyascii for permissive license (airbytehq#367) feat: add IncrementingCountCursor (airbytehq#346) feat: (low-code cdk) datetime format with milliseconds (airbytehq#369) fix: (CDK) (AsyncRetriever) - Improve UX on variable naming and interpolation (airbytehq#368) fix: (CDK) (AsyncRetriever) - Add the `request` and `response` to each `async` operations (airbytehq#356) fix: (CDK) (ConnectorBuilder) - Add `auxiliary requests` to slice; support `TestRead` for AsyncRetriever (part 1/2) (airbytehq#355) feat(concurrent perpartition cursor): Add parent state updates (airbytehq#343) fix: update csv parser for builder compatibility (airbytehq#364) feat(low-code cdk): add interpolation for limit field in Rate (airbytehq#353) feat(low-code cdk): add AbstractStreamFacade processing as concurrent streams in declarative source (airbytehq#347) fix: (CDK) (CsvParser) - Fix the `\\` escaping when passing the `delimiter` from Builder's UI (airbytehq#358) feat: expose `str_to_datetime` jinja macro (airbytehq#351) fix: update CDK migration for 6.34.0 (airbytehq#348) feat: Removes `stream_state` interpolation from CDK (airbytehq#320) fix(declarative): Pass `extra_fields` in `global_substream_cursor` (airbytehq#195) feat(concurrent perpartition cursor): Refactor ConcurrentPerPartitionCursor (airbytehq#331) feat(HttpMocker): adding support for PUT requests and bytes responses (airbytehq#342) chore: use certified source for manifest-only test (airbytehq#338) feat: check for request_option mapping conflicts in individual components (airbytehq#328) feat(file-based): sync file acl permissions and identities (airbytehq#260) fix: (CDK) (Connector Builder) - refactor the `MessageGrouper` > `TestRead` (airbytehq#332) fix(low code): Fix missing cursor for ClientSideIncrementalRecordFilterDecorator (airbytehq#334) feat(low-code): Add API Budget (airbytehq#314) chore(decoder): clean decoders and make csvdecoder available (airbytehq#326)
* main: ci: add Docker build workflow for PR checks (airbytehq#385) fix: Update DEFAULT_ERROR_MAPPING for InvalidURL to RETRY (airbytehq#384) chore: use python base image 4.0.0 in SDM (airbytehq#382) ci: add check for 'do not merge' in PR titles (airbytehq#383) chore: remove embedded module (airbytehq#380) build: update pyarrow to ^19.0.0 (do not merge) (airbytehq#379)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (2)
188-196
: 🛠️ Refactor suggestionTest function misses verification of service fields - should we extend the assertions?
The test function's implementation doesn't appear to verify the presence or behavior of service fields like
$root
that are mentioned in the PR objectives.Perhaps we could enhance this test function by:
- Checking that extracted records initially contain the
$root
service field- Verifying that
exclude_service_keys(actual_records)
correctly removes these service fields- Then comparing the result with
expected_records
This would more thoroughly test the functionality described in the PR. What do you think about adding these additional assertions?
def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List): extractor = DpathExtractor( field_path=field_path, config=config, decoder=decoder, parameters=parameters ) response = create_response(body) actual_records = list(extractor.extract_records(response)) + # Verify presence of service fields + for record in actual_records: + assert "$root" in record, "Service field $root should be present in extracted records" + + # Verify service fields are correctly excluded + cleaned_records = exclude_service_keys(actual_records) + assert cleaned_records == expected_records, "Records after excluding service fields should match expected records" - assert actual_records == expected_records
37-155
: 💡 Verification agent❓ Verification inconclusive
Test cases expanded but missing verification of service fields - could we improve this?
You've added many helpful test cases covering edge cases like empty strings, objects, and arrays, as well as more complex nested structures. However, according to the PR objectives, a key feature being added is the
$root
service field for each record.None of these test cases seem to verify that:
- The
$root
service field is correctly added to each record- The service field contains the original object reference
- The service fields can be properly excluded after transformation
Could we enhance some of these test cases to verify this core functionality of the PR? Perhaps comparing the records before and after applying
exclude_service_keys
?
🏁 Script executed:
#!/bin/bash # Check for tests or implementations that handle $root service field rg --type py "\$root" -A 3 -B 3 --no-headingLength of output: 45
Action Required: Enhance tests to verify the
$root
service field functionalityThe current test suite in
unit_tests/sources/declarative/extractors/test_dpath_extractor.py
covers many edge cases, but it doesn't verify that the new$root
service field is correctly added, contains the original input data, or is properly excluded whenexclude_service_keys
is applied. Would you consider adding tests that:
- Verify each record has the
$root
service field when expected.- Check that the
$root
field contains the original object reference.- Compare the records before and after applying
exclude_service_keys
to ensure the service field is properly removed.Wdyt?
🧹 Nitpick comments (5)
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)
40-40
: Comment placement might cause confusion - should we adjust formatting?The comment "The JSON contract is irregular, compare with JSONL" is placed at the end of the line rather than on a separate line above or below the test case. This is inconsistent with the comment style used on lines 127-128 and might make it harder to read.
Would it make sense to move this comment to a new line for consistency and clarity?
- ([], decoder_json, b"", [{}]), # The JSON contract is irregular, compare with JSONL + # The JSON contract is irregular, compare with JSONL + ([], decoder_json, b"", [{}]),unit_tests/sources/declarative/extractors/test_record_selector.py (1)
47-76
: Good addition of nested record test casesThese new test cases extend coverage to include nested data structures, which aligns nicely with the PR's goal of enhancing filtering and transformation.
However, I'm curious if we should also explicitly verify that the
$root
service field is added to each record? That seems to be the primary objective mentioned in the PR. Perhaps we could add assertions to check for the presence and correctness of these service fields?def test_record_filter(test_name, field_path, filter_template, body, expected_data): # ... existing code ... actual_records = list( record_selector.select_records( response=response, records_schema=schema, stream_state=stream_state, stream_slice=stream_slice, next_page_token=next_page_token, ) ) assert actual_records == [ Record(data=data, associated_slice=stream_slice, stream_name="") for data in expected_data ] + + # For the nested record test cases, verify service fields + if test_name in ["test_no_record_filter_returns_all_records_with_nested", "test_true_record_filter_returns_all_records_with_nested"]: + for record in actual_records: + # Verify service fields are present + assert SERVICE_KEY_ROOT in record.data + # Verify root contains the original object + assert record.data[SERVICE_KEY_ROOT] == bodyairbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
703-717
: Is this a more functional version of DpathExtractor?This new
DpathEnhancingExtractor
class appears to have the same structure as the existingDpathExtractor
class. According to the PR description, this is meant to implement a feature to add service fields like$root
to records during extraction.Since both extractors have identical field structures, it might be worth clarifying in the documentation how specifically this enhances the standard extractor functionality to make it clearer for future users and maintainers. Perhaps adding a brief note to highlight the service field functionality would be helpful, wdyt?
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
1995-2013
: Implementation looks solid, but could we reduce duplication?The factory method is correctly implemented and follows the same pattern as other factory methods, which is great for consistency. However, I notice it's nearly identical to the
create_dpath_extractor
method just above it.Would it make sense to extract the common logic into a private helper method to avoid duplication? This would improve maintainability for future changes to both extractors. Wdyt?
- def create_dpath_enhancing_extractor( - self, - model: DpathEnhancingExtractorModel, - config: Config, - decoder: Optional[Decoder] = None, - **kwargs: Any, - ) -> DpathEnhancingExtractor: - if decoder: - decoder_to_use = decoder - else: - decoder_to_use = JsonDecoder(parameters={}) - model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path] - return DpathEnhancingExtractor( - decoder=decoder_to_use, - field_path=model_field_path, - config=config, - parameters=model.parameters or {}, - ) + def _prepare_dpath_extractor_args( + self, + model: Union[DpathExtractorModel, DpathEnhancingExtractorModel], + config: Config, + decoder: Optional[Decoder] = None, + ) -> Dict[str, Any]: + if decoder: + decoder_to_use = decoder + else: + decoder_to_use = JsonDecoder(parameters={}) + model_field_path: List[Union[InterpolatedString, str]] = [x for x in model.field_path] + return { + "decoder": decoder_to_use, + "field_path": model_field_path, + "config": config, + "parameters": model.parameters or {}, + } + + def create_dpath_extractor( + self, + model: DpathExtractorModel, + config: Config, + decoder: Optional[Decoder] = None, + **kwargs: Any, + ) -> DpathExtractor: + args = self._prepare_dpath_extractor_args(model, config, decoder) + return DpathExtractor(**args) + + def create_dpath_enhancing_extractor( + self, + model: DpathEnhancingExtractorModel, + config: Config, + decoder: Optional[Decoder] = None, + **kwargs: Any, + ) -> DpathEnhancingExtractor: + args = self._prepare_dpath_extractor_args(model, config, decoder) + return DpathEnhancingExtractor(**args)
1995-1995
: Consider adding a docstring to explain the new extractor's purpose.Adding a brief docstring explaining what the
DpathEnhancingExtractor
does and how it differs from the regularDpathExtractor
would help future developers understand its purpose. Based on the PR description, it seems this extractor adds service fields like$root
to the extracted records.def create_dpath_enhancing_extractor( self, model: DpathEnhancingExtractorModel, config: Config, decoder: Optional[Decoder] = None, **kwargs: Any, ) -> DpathEnhancingExtractor: + """ + Creates a DpathEnhancingExtractor that extracts records while preserving the root object. + + This extractor works similarly to DpathExtractor but also adds service fields to each record, + particularly a `$root` field containing a reference to the original object from which the + record was extracted. This enables more powerful filtering and transformation operations + that need access to data beyond the record itself. + """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (10)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(2 hunks)airbyte_cdk/sources/declarative/decoders/json_decoder.py
(2 hunks)airbyte_cdk/sources/declarative/extractors/record_selector.py
(3 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(2 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)unit_tests/sources/declarative/decoders/test_json_decoder.py
(1 hunks)unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py
(1 hunks)unit_tests/sources/declarative/extractors/test_dpath_extractor.py
(4 hunks)unit_tests/sources/declarative/extractors/test_record_selector.py
(2 hunks)unit_tests/sources/declarative/retrievers/test_simple_retriever.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- airbyte_cdk/sources/declarative/decoders/json_decoder.py
- unit_tests/sources/declarative/retrievers/test_simple_retriever.py
- airbyte_cdk/sources/declarative/extractors/record_selector.py
- unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py
🔇 Additional comments (14)
unit_tests/sources/declarative/decoders/test_json_decoder.py (6)
15-17
: Parameter name change is more descriptive and aligns with implementation.The renaming from
first_element
toexpected_json
makes the test more clear about what we're actually expecting from the decoder. The parameter name now better reflects that we're testing against the expected JSON output, not just the first element. This is a good improvement to the test clarity, wdyt?
18-20
: Updated test expectations align with the new decoder behavior.These changes to the expected outputs reflect the core functionality change where the decoder now returns a list of records instead of a single record. This is consistent with the PR objective of making service fields available for filtering and transformation, where each record in the list can maintain its own reference to the original object. The changes correctly handle:
- Empty object case:
[{}]
instead of{}
- Empty array case:
[]
instead of{}
- Regular JSON object case:
[{"healthcheck": {"status": "ok"}}]
instead of the object directlyThe test cases now properly validate that the decoder always returns a list of elements.
23-23
: Function parameter name updated to match the parameter definition.The function parameter has been updated to use
expected_json
to be consistent with the parameter definition in the test parametrization. This maintains consistency throughout the test.
26-26
: Assertion updated to use the new parameter name.The assertion correctly references the renamed parameter
expected_json
instead of the oldfirst_element
.
32-33
: Added test cases for edge conditions.The addition of test cases for empty object
("{}", [{}])
and updated empty list handling("[]", [[]])
improve the test coverage. However, I'm curious about the empty list case - should it really be returning a nested list[[]]
instead of just[]
? This seems different from how the JSON decoder handles empty lists in the test above. Is this intentional behavior for the JSONL parser, or should it be consistent with the JSON decoder?
40-45
: Test IDs updated to clearly describe test cases.The test IDs have been updated to better describe what each test case is checking:
- "empty_object_response" for the empty object case
- "empty_list_response" for the empty list case
- Keeping "one_line_json" and "multi_line_json" for those respective cases
This improves the readability of test output and makes it clearer what functionality is being tested in each case.
unit_tests/sources/declarative/extractors/test_record_selector.py (1)
12-15
:❓ Verification inconclusive
New imports for service keys
I notice you've added imports for
SERVICE_KEY_PARENT
andSERVICE_KEY_ROOT
from the dpath_enhancing_extractor module. These seem related to the service fields mentioned in the PR objectives, but I don't see them being explicitly used in the test cases. Are they used implicitly by the classes being tested, or are they intended for future use? A comment explaining their purpose might be helpful for future developers, wdyt?
🏁 Script executed:
#!/bin/bash # Description: Check usage of SERVICE_KEY_PARENT and SERVICE_KEY_ROOT in the codebase echo "Searching for usage of SERVICE_KEY_PARENT and SERVICE_KEY_ROOT in tests:" rg "SERVICE_KEY_(PARENT|ROOT)" unit_tests/Length of output: 821
Clarification Needed on Service Key Imports
I noticed that both
SERVICE_KEY_PARENT
andSERVICE_KEY_ROOT
are imported in this test file, much like intest_dpath_enhancing_extractor.py
where these keys are combined withSERVICE_KEY_PREFIX
. However, their explicit use intest_record_selector.py
isn’t immediately clear. Are they intended to be used implicitly by the underlying classes or reserved for upcoming test enhancements? Perhaps adding an inline comment explaining their purpose could improve clarity for future developers. wdyt?airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
1672-1703
: New Component Declaration: DpathEnhancingExtractorYou've added a new extractor component that clearly extends the built‐in Dpath extraction functionality to include local service fields ($root and $parent) during transformation. The title, description, required fields, and property definitions are all consistent with existing components (e.g., DpathExtractor).
A couple of questions for you:
- Would you consider adding some documentation or inline examples regarding the expected types/contents of the $root and $parent service fields? Although these fields are added dynamically and later cleaned, a note here could help future maintainers understand that they are only available during filtering and transformation.
- Also, do you think it might be beneficial to explicitly flag (or document) that these service fields are protected against unintended transformation (for example, by the FlattenFields transformation) as described in the PR objectives?
Overall, the schema for DpathEnhancingExtractor is well organized and follows the established conventions. wdyt?
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)
1779-1779
: Looks good - union type properly updated.You've correctly extended the
RecordSelector.extractor
field's union type to include the newDpathEnhancingExtractor
, which means declarative configurations can now use this new extractor type.
703-717
: Consider potential memory implications of storing root objectsPer the PR description, the implementation will add a
$root
service field that holds a reference to the original response object. This could potentially increase memory usage, especially with large API responses.I see you've noted that the implementation "avoids reusing maps or dictionaries to prevent cyclic structures" which is good. Have you also considered:
- Memory impact when processing large batches of records?
- Ensuring the service fields are properly cleaned after transformation as mentioned in the PR?
- Any edge cases with deeply nested objects?
703-717
:❓ Verification inconclusive
Verify that this schema change has a corresponding implementation
The file we're reviewing is a schema definition, but for this new extractor to work, there should be a corresponding implementation class elsewhere. Can you confirm that you've created the actual implementation of
DpathEnhancingExtractor
that adds the$root
field to records?
🏁 Script executed:
#!/bin/bash # Find the implementation file for DpathEnhancingExtractor echo "Searching for DpathEnhancingExtractor implementation files..." rg -l "class DpathEnhancingExtractor" | grep -v "declarative_component_schema.py" echo "Checking imports of DpathEnhancingExtractor..." rg "from .* import .*DpathEnhancingExtractor"Length of output: 561
Implementation file located and imported – please confirm $root logic?
I verified that an implementation of
DpathEnhancingExtractor
exists in
•airbyte_cdk/sources/declarative/extractors/dpath_enhancing_extractor.py
• and it’s properly imported inairbyte_cdk/__init__.py
.Could you please confirm that the implementation in this file also includes the logic to add the
$root
field to records as intended? wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
78-78
: Looks like we're adding a new extractor type!The addition of the
DpathEnhancingExtractor
import aligns well with the PR objective to add a service field ($root
) to each record during extraction.
213-215
: The model import looks good!Properly importing the Pydantic model for the new extractor component.
582-582
: Nice registration in the component mapping!Adding the new extractor model to the
PYDANTIC_MODEL_TO_CONSTRUCTOR
dictionary ensures the factory will correctly instantiate it when encountered in a configuration.
from airbyte_cdk.sources.declarative.extractors.record_extractor import ( | ||
exclude_service_keys, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
New import added without corresponding usage validation - should we check the functionality?
I see you've added an import for exclude_service_keys
, but I don't notice any tests that verify how this function works with the extractor. Based on the PR objectives that mention adding a $root
service field, shouldn't there be tests to verify:
- That service fields are properly added to records during extraction
- That
exclude_service_keys
correctly removes these service fields when needed
Would it make sense to add assertions that check for the presence of service fields in the extracted records, and then verify they can be properly excluded? wdyt?
🏁 Script executed:
#!/bin/bash
# Check if there are any tests for exclude_service_keys in the codebase
rg --type py "exclude_service_keys" -A 5 -B 5 --no-heading
Length of output: 5973
Action Required: Enhance Tests for Service Field Handling in dpath_extractor
Hey, I noticed that in unit_tests/sources/declarative/extractors/test_dpath_extractor.py
there's an import for exclude_service_keys
that isn’t directly validated by any tests. We do have tests in test_record_extractor.py
confirming that exclude_service_keys
removes service fields as expected. However, given the PR’s focus on introducing the $root
service field, shouldn’t we also add specific assertions in the dpath extractor tests to ensure that:
- Service fields are correctly added to records during extraction.
exclude_service_keys
properly removes these service fields when required.
Would you consider enhancing the tests in this file to cover these points, or do you plan to rely on the existing tests in test_record_extractor.py
for this functionality? wdyt?
* main: feat: add formatting for epoch_microseconds (airbytehq#388) feat(low-code cdk): remove unnecessary manifest print (airbytehq#390)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
unit_tests/sources/declarative/interpolation/test_macros.py (2)
1-190
:⚠️ Potential issueRun ruff format to fix linting errors
The pipeline is failing with the following error:
[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.
Please run
ruff format
to automatically fix these style issues.🧰 Tools
🪛 GitHub Actions: Linters
[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.
145-145
: 💡 Verification agent🧩 Analysis chain
There's an existing test for timezone handling
I notice there's already a test at line 145 (
test_utc_datetime_to_local_timestamp_conversion
) that's specifically designed to ensure correct timezone handling. This test could serve as a model for how to handle the timezone issues in the parameterized tests.Perhaps we could modify the problematic tests to use a similar approach? wdyt?
🏁 Script executed:
#!/bin/bash # Look for other timezone-related tests or utilities in the codebase rg -i "timezone" --glob "*.py" . | grep -v test_macros.pyLength of output: 34489
Timezone Handling Consistency in Parameterized Tests
I noticed that ourtest_utc_datetime_to_local_timestamp_conversion
(line 145 intest_macros.py
) already demonstrates a solid approach to handling timezones. Given that many parts of the codebase follow similar patterns for timezone awareness, would you consider updating the parameterized tests to adopt this same technique for consistently addressing timezone issues? wdyt?
🧹 Nitpick comments (3)
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)
184-192
: Should we enhance test assertions to validate service fields?The current test function doesn't have any assertions specifically for validating the service fields mentioned in the PR (like
$root
), nor does it test theexclude_service_keys
functionality. Would it make sense to enhance this test to either:
- Verify the service keys are present in the raw output before comparison, or
- Test that
exclude_service_keys
correctly removes these service fields?A quick addition could look something like:
response = create_response(body) actual_records = list(extractor.extract_records(response)) + + # Verify service keys can be properly excluded + filtered_records = [exclude_service_keys(record) for record in actual_records] assert actual_records == expected_recordsWhat do you think?
unit_tests/sources/declarative/interpolation/test_macros.py (2)
78-78
: Helpful comment about timezone dependency, but could we fix the test?Good catch documenting that this test fails outside of GMT timezone! This is valuable information for other developers. However, instead of just documenting the issue, we might want to make the test more robust.
Perhaps we could:
- Use timezone-aware assertions for these timestamp tests
- Mock the timezone to ensure consistent results
- Make the expected values dynamic based on the local timezone
What do you think about implementing one of these approaches instead? wdyt?
84-84
: Same timezone issue as line 78, consider a comprehensive fixThis is the same timezone-dependency issue as the previous test case. Rather than documenting both failures, it would be better to implement a solution that works across all environments.
For the whole test suite, we could add a helper method that handles timezone differences, or patch datetime.timezone during tests to ensure consistent behavior. This would make the tests more reliable and eliminate the need for these comments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py
(1 hunks)unit_tests/sources/declarative/extractors/test_dpath_extractor.py
(4 hunks)unit_tests/sources/declarative/interpolation/test_macros.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/declarative/interpolation/test_macros.py
[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
🔇 Additional comments (4)
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)
19-21
: Unused import - need to add test verification for service keys?I see you've imported
exclude_service_keys
, but I don't notice it being used in the test file. Based on the PR objectives about adding a$root
service field, shouldn't we add assertions to verify that service fields are properly handled and can be excluded? This would address the previous review feedback as well.#!/bin/bash # Check how exclude_service_keys is used in other test files rg --type py "exclude_service_keys" -A 2 -B 2 unit_tests/unit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py (3)
1-24
: Well-organized imports and constants for the new extractor testGood job setting up the proper imports and constants for testing the new
DpathEnhancingExtractor
. The organization with clearly defined service key constants makes the test code more readable.
42-302
: Comprehensive test cases for service fieldsThe test cases are thorough and cover a wide range of scenarios, effectively validating that both
$root
and$parent
service fields are correctly added to extracted records in different JSON structures. This comprehensive approach will help ensure robustness of the feature.
33-34
:❓ Verification inconclusive
Unused decoder - is JsonlDecoder needed?
I notice that
JsonlDecoder
is imported on line 24, but neither adecoder_jsonl
instance is created nor is it used in any of the test cases, unlike in the other test file. Is this intentional or should there be tests using this decoder as well?
🏁 Script executed:
#!/bin/bash # Check if JsonlDecoder is used elsewhere and how rg --type py "JsonlDecoder" -A 2 -B 2Length of output: 7189
Review Note: Unused JsonlDecoder Import in Test File
I noticed that inunit_tests/sources/declarative/extractors/test_dpath_enhancing_extractor.py
(line 24), theJsonlDecoder
is imported but not actually used in any test cases. Since we already coverJsonlDecoder
functionality inunit_tests/sources/declarative/decoders/test_decoders_memory_usage.py
, should we remove the unused import here to keep the file clean, or was this intended for potential future tests? wdyt?
def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List): | ||
extractor = DpathEnhancingExtractor( | ||
field_path=field_path, config=config, decoder=decoder, parameters=parameters | ||
) | ||
|
||
response = create_response(body) | ||
actual_records = list(extractor.extract_records(response)) | ||
|
||
assert actual_records == expected_records |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding tests for exclude_service_keys functionality
The test currently verifies that service fields are correctly added to records, but there's no verification that these service fields can be properly excluded using the exclude_service_keys
function. This was mentioned in the previous review comments.
Would it make sense to add additional assertions to verify this functionality? For example:
response = create_response(body)
actual_records = list(extractor.extract_records(response))
assert actual_records == expected_records
+
+ # Additionally verify service keys can be properly excluded
+ from airbyte_cdk.sources.declarative.extractors.record_extractor import exclude_service_keys
+ filtered_records = [exclude_service_keys(record) for record in actual_records]
+ # Verify no service keys remain in filtered records
+ for record in filtered_records:
+ assert all(not key.startswith(SERVICE_KEY_PREFIX) for key in record.keys())
What do you think?
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List): | |
extractor = DpathEnhancingExtractor( | |
field_path=field_path, config=config, decoder=decoder, parameters=parameters | |
) | |
response = create_response(body) | |
actual_records = list(extractor.extract_records(response)) | |
assert actual_records == expected_records | |
def test_dpath_extractor(field_path: List, decoder: Decoder, body, expected_records: List): | |
extractor = DpathEnhancingExtractor( | |
field_path=field_path, config=config, decoder=decoder, parameters=parameters | |
) | |
response = create_response(body) | |
actual_records = list(extractor.extract_records(response)) | |
assert actual_records == expected_records | |
# Additionally verify service keys can be properly excluded | |
from airbyte_cdk.sources.declarative.extractors.record_extractor import exclude_service_keys | |
filtered_records = [exclude_service_keys(record) for record in actual_records] | |
# Verify no service keys remain in filtered records | |
for record in filtered_records: | |
assert all(not key.startswith(SERVICE_KEY_PREFIX) for key in record.keys()) |
Implement airbytehq/airbyte#50395 discussed in airbytehq/airbyte#49971
Implemented a new DpathEnhancingExtractor that navigates through a path of field names to reach the records to extract (exactly as the DpathExtractor does) and:
Published the DpathEnhancingExtractor to the Builder UI.
These service fields are accessible in the filter and transform steps, after which they are removed from the records.
This allows the filter and transformation steps to navigate to fields and structures that are left out of the selected records.
Examples:
More service fields could be added in the future.
The service fields are available in the record's filtering and transform steps.
This design avoids:
Explicitly clean the service field(s) after the transform step makes the service fields:
This allows the record transformation logic to define its "local variables" to reuse
some interim calculations.
The contract of body parsing seems irregular in representing the cases of bad JSON, no JSON and empty JSON.
It cannot be unified as that irregularity is already used.
Updated the development environment setup documentation
Updated CONTRIBUTING.md to:
The unit, integration, and acceptance tests in this exact version fail due to pipeline configuration issues, not related to the change themselves.
Summary by CodeRabbit
New Features
DpathEnhancingExtractor
for improved data extraction capabilities.Bug Fixes / Improvements
Documentation
Tests
DpathEnhancingExtractor
.Chores