-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: (CDK) (AsyncRetriever) - Use the Nested Decoders
to decode the streaming responses, instead of ResponseToFileExtractor
#378
fix: (CDK) (AsyncRetriever) - Use the Nested Decoders
to decode the streaming responses, instead of ResponseToFileExtractor
#378
Conversation
📝 WalkthroughWalkthroughThe changes remove a placeholder comment in the file extractor that referred to planned JSON support. Additionally, the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Wdyt about adding more reviewers to ensure a broader perspective on the changes? ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (14)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
705-708
: Consider adding field validation for file_typeThe new
file_type
field looks good and provides the necessary flexibility to support both CSV and JSONL formats. I noticed it currently accepts any string though - would it make sense to restrict the values to just "csv" and "jsonl" using an Enum? This could prevent configuration errors and make the API more robust. Also, there's a small typo in the title: "storred" should be "stored".- file_type: Optional[str] = Field( - "csv", - title="The file type in which the response data is storred. Supported types are [csv, jsonl].", - ) + class FileType(str, Enum): + CSV = "csv" + JSONL = "jsonl" + + file_type: Optional[FileType] = Field( + FileType.CSV, + title="The file type in which the response data is stored. Supported types are [csv, jsonl].", + )What do you think? This would make the allowed values explicit in the schema.
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
1681-1685
: Suggestion: Fix Typo and Enhance Validation forfile_type
I noticed a minor typo in the title ("storred" should be "stored") and would suggest adding an explicit enum constraint to ensure only the supported file types (csv
andjsonl
) are accepted. Would you be open to these adjustments, wdyt?- file_type: - title: The file type in which the response data is storred. Supported types are [csv, jsonl]. - type: string - default: csv + file_type: + title: The file type in which the response data is stored. Supported types are [csv, jsonl]. + type: string + enum: [csv, jsonl] + default: csvairbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
1995-1997
: Consider using theFileTypes
enum for defaulting to CSV.
Currently, you default to"csv"
viamodel.file_type or "csv"
. Would you like to useFileTypes.CSV.value
here for consistency with the new enum, or is the current approach already meeting your needs? wdyt?airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py (11)
24-26
: TheFileTypes
enum is well introduced.
Would you consider expanding it to handle other file formats in the future (e.g., Parquet)? wdyt?
32-43
: Documentation is thorough.
It's great that you mention memory constraints, CSV streaming, and so on. Maybe add a tiny note to highlight new JSONL support in the docstring? wdyt?
47-47
: Defaultingfile_type
to CSV is consistent.
Using the enum’s default value (likeFileTypes.CSV.value
) might be safer. Would that help prevent typos if changed later? wdyt?
52-74
: Docstring forextract_records
is clear.
Would you consider adding a brief usage example to illustrate how to callextract_records
with a JSONL file? wdyt?
118-135
: Temporary file path generation.
Would you consider using Python’stempfile
module (e.g.,tempfile.NamedTemporaryFile
) for more secure and robust temp file handling? wdyt?
175-205
: Chunk reading from CSV.
Your approach withpd.read_csv
in chunks is great for memory usage. Would you consider allowing users to configurechunk_size
more dynamically? wdyt?
206-233
: JSON lines chunk reading.
This mirrors the CSV strategy nicely. Maybe unify the chunk size handling or document it consistently for both CSV and JSONL? wdyt?
235-253
: Conditional approach to file types.
You might consider a more direct dispatch using theFileTypes
enum instead of if-else checks. That could simplify or scale better if file types expand. wdyt?
256-258
: Conditional for CSV is straightforward.
Are you comfortable returning an error if a non-enum file type is passed here, or did you want additional fallback logic? wdyt?
263-264
: Switching tologger.info
.
Previously, some might log empty data as a warning. Are you sure this info-level is enough to alert you in debugging? wdyt?
267-268
: Typographical fix for 'occured'.
Would you like to correct "occured" to "occurred"? wdyt?- message = "ResponseToFileExtractor._read_with_chunks(): The IO/Error occured while reading the data from file." + message = "ResponseToFileExtractor._read_with_chunks(): The IO/Error occurred while reading the data from file."
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(1 hunks)airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py
(6 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Pytest (Fast)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py (4)
10-10
: Import of Enum looks good!
No issues spotted. The new enum usage is appropriate and improves code clarity.
153-154
: Use of_get_file_path()
is correct.
No issues spotted here; good job centralizing filename generation logic.
155-156
: Snippet for writing to file is fine.
The context management and response closing follow best practices.
259-261
: JSONL branch looks good.
No further concerns – the logic is consistent for reading JSONL.
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to make sure we don't introduce something in the declarative language that we would need to revert therefore creating a breaking change
…port-to-response-to-file-extractor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1)
53-59
: Documentation improvement looks good. Thoughts on adding more context?The updated documentation clearly explains the behavior when non-gzipped data is encountered. This matches the implementation and helps future developers understand the fallback mechanism.
What do you think about adding a brief note on why this fallback behavior is important for file extraction, particularly with respect to the new JSONL support mentioned in the PR? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
(1 hunks)airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py
(0 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(6 hunks)
💤 Files with no reviewable changes (1)
- airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2)
38-46
: Clear and useful helper method to improve error handling. Nice addition!The new
_reset_reader_pointer
method is well-documented and addresses a specific need: resetting the reader position after failed gzip decompression. This is essential for the improved error handling in theparse
method.
61-67
: Excellent error handling implementation!This is a robust approach to handling potentially non-gzipped data:
- Attempts gzip decompression first
- Falls back gracefully on error
- Resets the pointer before sending to inner parser
- Logs an appropriate warning
This will make the parser more resilient when processing various input formats and prevent unnecessary failures when data isn't actually gzipped.
Could we consider adding a unit test that specifically verifies this fallback mechanism works with both gzipped and non-gzipped data? wdyt?
JsonL
output file-type supportNested Decoders
to decode the streaming responses, instead of ResponseToFileExtractor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of concerns that I'm curious to have your input on
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
Outdated
Show resolved
Hide resolved
…port-to-response-to-file-extractor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1)
70-76
: Remove debug print statementThere's a debugging print statement that should be removed before production.
- print(f"\n\nHERE\n\n")
This statement could clutter logs in production, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
(7 hunks)airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
(3 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(6 hunks)unit_tests/sources/declarative/decoders/test_composite_decoder.py
(3 hunks)unit_tests/sources/declarative/decoders/test_zipfile_decoder.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
🔇 Additional comments (10)
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1)
46-51
: Test now properly simulates compression headersThe addition of the
Content-Encoding
header is a good improvement to test the behavior ofZipfileDecoder
with compressed responses. This aligns well with the newis_compressed
method in the implementation.airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (3)
16-16
: Good import consolidationImporting both
COMPRESSION_TYPES
andParser
from the same module is a clean approach, making the dependencies more explicit.
29-34
: Well-documentedis_compressed
methodThe new method clearly checks if the response is compressed based on the Content-Encoding header. The documentation explains the purpose well.
44-47
: Pass compression status to parserThe update to pass the compression status to the parser's
parse
method ensures consistent handling of compressed data. This is a good implementation that aligns with changes inCompositeRawDecoder
.unit_tests/sources/declarative/decoders/test_composite_decoder.py (3)
71-72
: Added Content-Encoding header for better test coverageGood addition of the
Content-Encoding
header to test gzip parsing with the proper headers, which makes the test more realistic.
109-124
: Enhanced test coverage for gzip parsing with JSON linesThe updated test properly verifies that the
GzipParser
withJsonLineParser
can handle gzipped content with appropriate headers.
126-144
: Added test for non-compressed data handlingExcellent addition of a test case that verifies the
GzipParser
can handle non-compressed data. This ensures the decoder is robust when encountering various response types.airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (3)
25-34
: Good list of compression typesComprehensive list of compression types to check against. This improves the decoder's ability to handle various compression formats.
190-195
: Well-implemented compression detectionThe
is_compressed
method is clear and concise, checking theContent-Encoding
header against the predefined compression types.
206-206
: Pass compression status to parserGood implementation of passing the compression status to the parser's
parse
method. This ensures consistent handling of compressed data.
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/decoders/decoder_parser.py (3)
1-13
: Imports and logger setup look good!The imports are comprehensive and the logger setup is appropriately named. One small suggestion - would adding a module-level docstring help explain the purpose of this new file in the decoder ecosystem? Something brief that explains how it relates to the nested decoders approach mentioned in the PR objectives? wdyt?
18-26
: Parser abstract base class is well-structured.The abstract base class with the
parse
method provides a good foundation for different parser implementations. The docstring is concise but informative.Could we enhance the docstring to provide more details about expected input data formats and the structure of the yielded dictionaries? This would help implementers understand their responsibilities better. wdyt?
28-30
: Consider adding more documentation for complex type definitions.These type aliases define important structures but their purpose and composition might not be immediately clear to other developers.
Would it help to add comments explaining:
- What the three elements in the
PARSERS_TYPE
tuple represent (looks like content types, file extensions, and parser instances?)- How
PARSERS_BY_HEADER_TYPE
is used to select parsers based on headers?This would make the code more maintainable as the parser ecosystem grows. wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
(5 hunks)airbyte_cdk/sources/declarative/decoders/decoder.py
(2 hunks)airbyte_cdk/sources/declarative/decoders/decoder_parser.py
(1 hunks)airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
(2 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(3 hunks)unit_tests/sources/declarative/decoders/test_composite_decoder.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- airbyte_cdk/sources/declarative/decoders/decoder.py
- airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
- unit_tests/sources/declarative/decoders/test_composite_decoder.py
- airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/decoders/decoder_parser.py (1)
15-15
: Type alias for parser output looks good.This type alias clearly defines the expected output format for parsers - a generator yielding dictionaries. It's well-defined and will help with type checking.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)
2196-2202
: Changed from static to instance method, aligning with create_csv_decoder pattern.The
create_jsonl_decoder
method has been modified from a static method to an instance method by adding theself
parameter. This change enables access to instance variables such asself._emit_connector_builder_messages
, allowing the method to configure thestream_response
parameter consistently with thecreate_csv_decoder
method.
2204-2230
: Improved GzipDecoder to handle both compressed and uncompressed data.The enhanced implementation now intelligently processes responses based on content encoding headers, with a fallback mechanism for uncompressed data. This aligns with the PR objectives of using nested decoders for better stream handling.
I especially like the detailed comments explaining the behavior with different response streaming approaches.
2897-2902
: Method parameter formatting updated for consistency.Parameters passed to
download_target_extractor
are now formatted with line breaks between each parameter, which improves readability and maintains consistency with other method calls in the file.
2778-2782
: Method parameter formatting improved for readability.Parameters for
model.download_paginator
are now consistently formatted with one parameter per line, which improves code readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CI passes, I'm all in on this change. Thanks for diving in the requests lib shenanigans with me ❤️
What
Resolving:
How
decoder
notation inmodel_to_component_factory.py
to distinguish between the decoder used where theResponseToFileExtractor
support is terminated. This is the deprecated component that should be replaced by the "nested decoders" in the future. Please refer to the example below for more information.download_extractor
should be supplied as a part of the manifest, only when we have to deal with the field pointer to the data (basically when data is nested within theresponse
), the example is hereSide-Fix:
_join_url
method.User Impact
No impact is expected. This is not a breaking change.
Before we supported the CSV-only:
After:
For JSON-Lines:
For CSV
For Gzipped JSON-Lines:
For Gzipped CSV:
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Refactor
Bug Fixes