Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent raw decoder from closing before sync completion #366

Merged
merged 12 commits into from
Mar 6, 2025

Conversation

ChristoGrab
Copy link
Collaborator

@ChristoGrab ChristoGrab commented Feb 25, 2025

What

We've run into an issue with the CsvDecoder closing files prematurely during syncs, leading to crashes when syncing and/or setting up sources:

ValueError: I/O operation on closed file

The problem seems to be that the connection is automatically closed once the last chunk is being processed, preventing the process from gracefully completing. By explicitly closing the connection AFTER we have yielded the data, we can ensure that the connection remains until all data has been parsed and read.

Note that not only csv fails with the same error but pandas as well which seems to indicate an issue regarding how urllib3 handles closing of IO rather than an issue with the Python csv standard lib.

Note

From docs
Sometimes you want to use io.TextIOWrapper or similar objects like a CSV reader directly with HTTPResponse data. Making these two interfaces play nice together requires using the auto_close attribute by setting it to False. By default HTTP responses are closed after reading all bytes, this disables that behavior:

Summary by CodeRabbit

Summary by CodeRabbit

  • Bug Fixes

    • Improved streamed response handling to ensure that the data stream remains available during processing and is properly closed afterward, enhancing overall stability and resource management.
  • New Features

    • Introduced a new test server class for testing CSV parsing capabilities of the CompositeRawDecoder.
    • Added a new test function to validate the CSV parsing functionality in a controlled environment.
    • Added a utility function to find available ports for testing purposes.

@ChristoGrab ChristoGrab marked this pull request as ready for review February 25, 2025 19:09
@ChristoGrab ChristoGrab changed the title refactor: prevent raw decoder from closing before sync completion fix: prevent raw decoder from closing before sync completion Feb 25, 2025
Copy link
Contributor

coderabbitai bot commented Feb 25, 2025

📝 Walkthrough

Walkthrough

This pull request updates the decode method in the CompositeRawDecoder class to enhance resource management for streaming responses. It prevents the automatic closing of the raw response stream before parsing and ensures the stream is explicitly closed after parsing is complete. Additionally, it introduces new testing functionality to validate CSV parsing using a controlled server environment.

Changes

File Change Summary
airbyte_cdk/.../composite_raw_decoder.py Updated the decode method to set response.raw.auto_close to False before parsing and explicitly call response.raw.close() after parsing.
unit_tests/.../test_composite_decoder.py Added find_available_port function, TestServer class, and test_composite_raw_decoder_csv_parser_without_mocked_response function for CSV parsing validation.

Possibly related PRs

  • feat: add Composite Raw Decoder #179: The changes in the main PR, which modify the decode method of the CompositeRawDecoder class, are directly related to the functionality introduced in the retrieved PR that adds the CompositeRawDecoder and its associated parsers.
  • fix: update csv parser for builder compatibility #364: The enhancements to the decode method are related to modifications in the parse method of the CsvParser class, as both involve handling the response stream and parsing logic within the same decoder framework.

Suggested reviewers

  • bazarnov
  • artem1205

Wdyt about adding any additional details or clarifications for the reviewers?

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1)

154-156: Good improvement to resource management!

This change properly addresses the premature closing issue by disabling auto_close and explicitly closing the response.raw stream after all parsing is complete. This should prevent the ValueError about I/O operations on closed files during sync operations.

Have you considered adding a try-finally block to ensure the raw stream always gets closed even if parsing raises an exception? Something like:

if self.is_stream_response():
    response.raw.auto_close = False
-    yield from self.parser.parse(data=response.raw)  # type: ignore[arg-type]
-    response.raw.close()
+    try:
+        yield from self.parser.parse(data=response.raw)  # type: ignore[arg-type]
+    finally:
+        response.raw.close()

This would ensure proper cleanup even in error scenarios. Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4dbb6fe and 92c1914.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Check: 'source-shopify' (skip=false)

@maxi297 maxi297 requested a review from artem1205 March 4, 2025 20:32
@maxi297
Copy link
Contributor

maxi297 commented Mar 4, 2025

@maxi297 to add a test on that

@github-actions github-actions bot added bug Something isn't working security labels Mar 4, 2025
@maxi297
Copy link
Contributor

maxi297 commented Mar 4, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

1-258: ⚠️ Potential issue

Fix Ruff formatting issues in this file.

The pipeline is reporting Ruff formatting issues that need to be addressed:

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.

Run ruff format on this file to fix the formatting issues and ensure the CI pipeline passes.

🧰 Tools
🪛 GitHub Actions: Linters

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.

🧹 Nitpick comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

215-226: Consider making the test more robust against port conflicts, wdyt?

The test correctly sets up a real HTTP server to verify the decoder doesn't prematurely close the response stream. However, using a hardcoded port (8080) might cause test failures if that port is already in use on the CI system or developer machine.

Could you consider using port 0 to let the OS assign an available port, then retrieve the assigned port?

-    httpd = HTTPServer(('localhost', 8080), TestServer)
+    httpd = HTTPServer(('localhost', 0), TestServer)
+    port = httpd.server_port
     thread = Thread(target=httpd.serve_forever, args = ())
     thread.start()
 
-    response = requests.get("http://localhost:8080", stream=True)
+    response = requests.get(f"http://localhost:{port}", stream=True)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92c1914 and 06ff322.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/declarative/decoders/test_composite_decoder.py

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.

⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (3)

7-9: New imports support the HTTP server test implementation, looks good.

The addition of BaseHTTPRequestHandler, HTTPServer, and Thread imports supports the new test case that verifies the fix for premature raw decoder closing.


207-213: Simple test server implementation that serves CSV data, looks good!

This TestServer class provides a minimal HTTP server that returns a simple CSV response. It's a good approach for testing the decoder with a real HTTP response stream rather than just using mocks.


215-226: This test directly addresses the PR objective - good work!

This new test case directly verifies the fix for the issue described in the PR: "prevent raw decoder from closing before sync completion". By setting up a real HTTP server and response stream, it tests the actual behavior of the CompositeRawDecoder with CSV data in a way that would reveal if the file is closed prematurely.

The test is simple but focused on the specific issue, which is exactly what was needed based on the reviewer's comment requesting a test.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (2)

214-228: Good test for reproducing the closed file issue

This test does a great job at simulating the real-world scenario where ValueError: I/O operation on closed file was occurring. Using a real HTTP server instead of mocks is an excellent approach to catch the actual behavior.

A couple of suggestions to make this even more robust:

  • Would it be helpful to use a random port instead of hardcoded 8080 to avoid potential conflicts during testing? httpd = HTTPServer(('localhost', 0), TestServer) and then get the port with port = httpd.server_port
  • Maybe add an assertion on the actual content of the parsed result to ensure data integrity? Something like: assert result[0] == {'col1': 'val1', 'col2': 'val2'}

What do you think? Particularly for CI environments where port conflicts might arise.


227-228: Clean server shutdown implementation

Good job on the proper cleanup with both httpd.shutdown() and thread.join(). This ensures the test doesn't leave hanging threads or ports in use.

You might consider adding a timeout to the thread.join() to prevent tests from hanging if the server doesn't shut down properly, wdyt?

- thread.join()
+ thread.join(timeout=5)  # Add timeout to prevent hanging tests
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06ff322 and f276a6e.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (4)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (4)

207-212: Well-structured test server implementation

Nice implementation of the TestServer class. It's simple and focused on just what's needed to test the CSV parsing functionality.


215-217: Clear docstring explaining the test's purpose

Great job clearly documenting the issue this test is addressing. The docstring effectively explains why this test is needed beyond the existing mocked tests.


219-221: Thread management looks good

The server setup and thread management approach is solid. You're properly creating the server and starting it in a background thread.


223-226: Core test functionality looks good

This is testing the key functionality - ensuring that the CSV parser can properly process a streamed response without encountering closed file errors.

@artem1205
Copy link
Contributor

We've run into an issue with the CsvDecoder closing files prematurely during syncs

during sync or during the test_read?

I can imagine that when we use test_tead and iterate by records (using next() we get the first N records and then wait) our connection is still alive and may be closed by the server.

@ChristoGrab , Could you share the link with this error in sync during the read?

@artem1205
Copy link
Contributor

UPD:

adding \n to the Maxime's test allows parser to work even without closing the response

  • csv.DictReader relies on newline characters (\n) to delimit rows.
  • If the last row in the CSV data doesn't end with a newline, csv.DictReader might attempt to read past the end of the data, leading to errors.

Copy link
Contributor

@artem1205 artem1205 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
approving to unblock, we might need to introduce another buffer to the CSV parser in order to handle and check for \n at the end of the line.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

1-267: 💡 Verification agent

🧩 Analysis chain

Fix formatting issues flagged by the pipeline.

The pipeline reports that Ruff formatting check failed. Please run 'ruff format' to fix the code style issues in this file.


🏁 Script executed:

#!/bin/bash
# Run formatter on the file to see what would change
ruff format --diff unit_tests/sources/declarative/decoders/test_composite_decoder.py

Length of output: 837


Action: Please run ruff format to adjust the file's formatting.

The diff from the formatter indicates that one file (specifically, around line 222 in unit_tests/sources/declarative/decoders/test_composite_decoder.py) would be modified by removing an extra newline. Could you run ruff format on this file to fix the style issues as flagged by the pipeline? wdyt?

🧰 Tools
🪛 GitHub Actions: Linters

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Please run 'ruff format' to fix code style issues in this file.

🧹 Nitpick comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

207-212: Simple server implementation looks good, but could potentially be enhanced.

The TestServer class implements a minimal HTTP server that returns a CSV string without a trailing newline character, which appears intentional based on the test's docstring. This is a good approach for reproducing the specific issue.

Would you consider adding some basic error handling or logging to make debugging easier if the test fails? Just a thought - this is fine as is for the specific use case.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f276a6e and 5364059.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1 hunks)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/declarative/decoders/test_composite_decoder.py

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Please run 'ruff format' to fix code style issues in this file.

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: preview_docs
🔇 Additional comments (3)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (3)

7-9: Appropriate imports for the new test functionality.

The added imports for HTTP server functionality and threading are well-aligned with the new test requirements. These are necessary to create a real HTTP server for testing the CSV parser without mocked responses.


214-224: Well-documented test case explaining the issue being fixed.

Excellent documentation explaining the rationale for this test and the specific conditions needed to reproduce the issue. This context is very valuable for future maintainers.

The commented-out CSV content is useful for reference, though I wonder if we might want to try reproducing with that exact content in a future test?


227-229: Consider using a dynamic port instead of hardcoded 8080.

Using port 8080 hardcoded could cause test failures if that port is already in use by another process. This could lead to intermittent CI failures that are hard to debug.

Would you consider using a dynamic port allocation approach instead? Maybe something like:

with socket.socket() as s:
    s.bind(('localhost', 0))
    port = s.getsockname()[1]
httpd = HTTPServer(("localhost", port), TestServer)

This would help prevent port conflicts, wdyt?

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@maxi297
Copy link
Contributor

maxi297 commented Mar 5, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (2)

231-234: Consider adding more assertions to validate CSV parsing.

The test currently only verifies that the result has a length of 1, but it doesn't check the actual content of the parsed CSV.

Would it be helpful to add more specific assertions to verify the parsed data structure? Perhaps:

  result = list(CompositeRawDecoder(parser=CsvParser()).decode(response))

  assert len(result) == 1
+ assert result[0] == {'col1': 'val1', 'col2': 'val2'}

This would ensure not just that some data was parsed, but that it was parsed correctly. wdyt?


225-237:

❓ Verification inconclusive

Consider adding a port finder to avoid conflicts.

Using a hardcoded port (8080) might lead to test failures if that port is already in use on the test system.

Would you consider implementing a dynamic port allocation? Something like:

- httpd = HTTPServer(("localhost", 8080), TestServer)
+ def find_free_port():
+     with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+         s.bind(('', 0))
+         return s.getsockname()[1]
+ 
+ port = find_free_port()
+ httpd = HTTPServer(("localhost", port), TestServer)

This would make the test more reliable across different environments. What do you think?


🌐 Web query:

How to find a free port in Python for HTTP server testing?

💡 Result:

To find a free port in Python for HTTP server testing, there are several reliable methods depending on your use case:

1. Using sockets with OS-assigned ports

The most common method involves creating a socket and letting the OS assign a free port automatically via bind(('', 0)):

import socket

def get_free_port():
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(('', 0))
        return s.getsockname()[1]
  • Advantages: Simple, returns an available port immediately[1][4]
  • Limitations: Port may be reclaimed by the OS after socket closure, creating a race condition[1][4]
  • Solution: Use the socket before closing it to reserve the port[1][7]

2. Atomic port allocation for servers

For HTTP servers, create and bind the socket first, then pass it to the server:

from http.server import HTTPServer

# Create and bind socket first
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(('', 0))
port = sock.getsockname()[1]

# Create server using existing socket
server = HTTPServer(('', port), YourHandlerClass, bind_and_activate=False)
server.socket = sock
server.server_bind = lambda: None  # Disable default binding
server.server_activate()

This prevents race conditions by keeping the port reserved[1][7].


3. Using testing frameworks

For pytest integration, pytest-httpserver automatically handles port allocation:

def test_server(httpserver):
    httpserver.expect_request("/test").respond_with_data("OK")
    # Access via httpserver.port
    assert requests.get(httpserver.url_for("/test")).status_code == 200
  • Automatically selects and reserves a port[5][10]
  • No manual port management required[10]

4. Built-in Python HTTP server

When using python -m http.server, specify 0 to get an OS-assigned port:

python -m http.server 0
Serving HTTP on 0.0.0.0 port 12345  # Random port shown here
  • Port is printed after server starts[8][9]
  • No race condition since server keeps port occupied[9]

Method Best Use Case Race Condition Risk
Socket binding Quick port checks High
Atomic server setup Production-grade servers None
pytest-httpserver Test environments None
Built-in HTTP server Manual testing None

For testing scenarios, Method 3 (pytest-httpserver) is recommended as it handles port management automatically. For manual server creation, Method 2 provides the safest approach by reserving the port before server initialization[1][5][9].

Citations:


Dynamic Port Allocation Suggestion

Currently, the test uses a hardcoded port (8080) which could lead to conflicts if that port is already in use on the test system. Would you consider implementing a dynamic port finder to improve reliability? For example, you might replace the fixed port with something like:

-    httpd = HTTPServer(("localhost", 8080), TestServer)
+    def find_free_port():
+        import socket
+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+            s.bind(('', 0))
+            return s.getsockname()[1]
+    
+    port = find_free_port()
+    httpd = HTTPServer(("localhost", port), TestServer)

This approach, as verified by current best practices, should help avoid potential test failures. Alternatively, if you’re concerned about race conditions (i.e., the port might be reclaimed before the server starts), you might consider binding the socket first and passing it to the server for an atomic allocation. What do you think?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5364059 and e4311af.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Analyze (python)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (2)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (2)

207-212: Clean implementation of the test server.

The TestServer class is well-structured and focused on its single responsibility - providing a simple CSV response. Nice job!


214-224: Great documentation on the test case background and motivation.

The comprehensive docstring really helps explain why this test is necessary and the history of the issue. This context is invaluable for future maintainers.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

226-226: Consider implementing dynamic port allocation.

The hardcoded port (8080) might cause test failures if that port is already in use by another process. Would you consider implementing a utility to find an available port dynamically? This would make the test more robust in CI environments and when running locally.

- httpd = HTTPServer(("localhost", 8080), TestServer)
+ import socket
+ 
+ def find_available_port():
+     with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+         s.bind(('localhost', 0))
+         return s.getsockname()[1]
+ 
+ port = find_available_port()
+ httpd = HTTPServer(("localhost", port), TestServer)

wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4311af and a79256f.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (python)
🔇 Additional comments (5)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (5)

7-9: Well-chosen imports for the new test.

The addition of BaseHTTPRequestHandler, HTTPServer, and Thread imports are appropriate for implementing the local HTTP server test. This approach allows for testing with actual HTTP responses rather than mocks.


207-212: Nice implementation of the TestServer class.

The TestServer implementation is focused and concise. I notice you're intentionally providing a CSV without a trailing newline character to reproduce the specific issue mentioned in the test docstring.


214-224: Excellent detailed docstring explaining the test purpose and issue context.

The docstring thoroughly explains the context, the problem being solved, and the reasoning behind the specific test implementation. This kind of documentation is invaluable for future maintainers who might need to understand why this test exists.


228-236: Great error handling and resource cleanup.

I appreciate the proper thread management with try-finally and thread.join(). This ensures resources are properly cleaned up even if an exception occurs during the test.


230-231: Ensure consistent port reference.

If you implement dynamic port allocation as suggested earlier, don't forget to update the port used in the request URL as well.

- response = requests.get(f"http://localhost:8080", stream=True)
+ response = requests.get(f"http://localhost:{port}", stream=True)

wdyt?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

1-1: 💡 Verification agent

🧩 Analysis chain

Fix formatting issue flagged by pipeline.

The CI pipeline indicates there's a formatting issue:

Ruff formatting check failed. 1 file would be reformatted.

Would you consider running ruff format on this file to fix the style issues?


🏁 Script executed:

#!/bin/bash
# Check what formatting changes ruff would make
ruff format --diff unit_tests/sources/declarative/decoders/test_composite_decoder.py

Length of output: 581


Please run ruff format on this file to update the quote style – wdyt?

I verified the formatting diff and it shows that the only change required is to switch the quotes in the s.bind call from single quotes to double quotes:

  • Before:
    s.bind(('localhost', 0))
  • After:
    s.bind(("localhost", 0))

Could you run ruff format unit_tests/sources/declarative/decoders/test_composite_decoder.py so that the file meets the CI formatting standards?

🧰 Tools
🪛 GitHub Actions: Linters

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.

🧹 Nitpick comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

232-244: Well-structured test with proper resource cleanup.

The test correctly:

  1. Finds a dynamic port
  2. Starts the server in a thread
  3. Uses try-finally to ensure cleanup
  4. Joins the thread with a timeout

This addresses previous review feedback about thread cleanup and error handling. Good job!

One small suggestion though - would you consider adding a small sleep after starting the thread to ensure the server is fully up before making the request? This might prevent occasional flaky test failures, wdyt?

 thread.start()
+import time
+time.sleep(0.1)  # Short delay to ensure server is ready
 try:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a79256f and 2af9ed2.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (3 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/declarative/decoders/test_composite_decoder.py

[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.

⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (4)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (4)

7-10: Good addition of necessary imports for the server-based test.

These imports support the new HTTP server implementation and dynamic port finding - both needed for testing with a real server rather than mocked responses.


26-29: Nice implementation of dynamic port finding!

This is a clean solution for finding an available port dynamically, which prevents potential port conflicts during test runs. The type ignore comment is also properly used.


214-219: Simple yet effective test server implementation.

The test server correctly serves a minimal CSV payload. I notice the CSV intentionally doesn't end with a newline character - which helps reproduce the specific edge case you're testing for.


221-231: Thorough docstring explaining the test purpose and reproduction steps.

The docstring clearly explains why this test is needed and the specific issue it's designed to reproduce. The commented-out data sample is helpful for future reference.

@maxi297
Copy link
Contributor

maxi297 commented Mar 6, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (2)

228-228: Minor wording issue in docstring.

Looks like there's a small typo in the docstring - "this is the only we were able to reproduce locally" seems to be missing a word. Did you mean "the only way" perhaps? wdyt?


230-230: Consider externalizing the commented test data.

This commented line contains a large CSV string that might make the test harder to read. Would you consider moving it to a constant at the top of the file or to a separate fixture file to improve readability? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2af9ed2 and a28c3b3.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (4)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (4)

26-29: Nice implementation of port finder!

This is a clean and effective way to dynamically find an available port for testing. Using a context manager ensures proper socket cleanup.


214-219: Simple and effective test server implementation!

The test server implementation is concise and serves its purpose well. It correctly responds with a minimal CSV payload that helps reproduce the issue.


237-241: Excellent use of try/finally for cleanup!

Great job implementing the try/finally pattern to ensure proper resource cleanup. This addresses the previous review comment about thread cleanup and error handling.


232-244: Consider adding server readiness check.

The test starts a server in a separate thread but immediately makes a request without ensuring the server is ready. In rare cases, this might lead to intermittent test failures.

Would you consider adding a simple readiness check before making the request? Something like:

thread.start()
+# Ensure server is ready before making request
+def is_server_ready():
+    try:
+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+            return s.connect_ex(('localhost', port)) == 0
+    except:
+        return False
+
+# Wait for server to be ready with timeout
+import time
+start_time = time.time()
+while not is_server_ready() and time.time() - start_time < 3:
+    time.sleep(0.1)
+
try:
    response = requests.get(f"http://localhost:{port}", stream=True)

This would make the test more robust against timing issues, wdyt?

@maxi297 maxi297 merged commit fe2f9a5 into main Mar 6, 2025
24 checks passed
@maxi297 maxi297 deleted the christo/csv-sync-fix branch March 6, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants