Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Source S3: tab delimiter is not parsed properly #8789

Closed
alafanechere opened this issue Dec 14, 2021 · 0 comments · Fixed by #9163
Closed

🐛 Source S3: tab delimiter is not parsed properly #8789

alafanechere opened this issue Dec 14, 2021 · 0 comments · Fixed by #9163
Assignees

Comments

@alafanechere
Copy link
Contributor

alafanechere commented Dec 14, 2021

Environment

  • Source Connector and version: Source S3 0.1.7
  • Severity: Critical
  • Step where error happened: Sync job

Current Behavior

When dealing with CSV file, the S3 source connector is not parsing the /t character in the delimiter field properly.
It leads to the following error during sync

ValueError: only single character unicode strings can be converted to Py_UCS4, got length 4

The user provided the config blob stored in Postgres when they are using \t:

{
    "name": "s3_my_data_tsv",
    "sourceId": "3d7c5081-0624-47ab-8033-fd3e2dd0e5be",
    "tombstone": false,
    "workspaceId": "f5918909-d769-4482-b45b-d95e38befcc9",
    "configuration": {
        "format": {
            "filetype": "csv",
            "delimiter": "\\t",
            "block_size": 10000,
            "quote_char": "\"",
            "escape_char": "\\",
            "double_quote": true,
            "advanced_options": "{\"column_names\": [\"filename\", \"email\", \"locale\", \"updated_ts\"]}",
            "newlines_in_values": false,
            "additional_reader_options": "{}"
        },
        "schema": "{\"filename\": \"string\", \"email\": \"string\", \"locale\": \"string\", \"updated_ts\": \"string\"}",
        "dataset": "my_data_tsv",
        "provider": {
            "bucket": "my_bucket",
            "endpoint": "",
            "path_prefix": "manual/2021-12-03/",
            "aws_access_key_id": "***",
            "aws_secret_access_key": "***"
        },
        "path_pattern": "manual/2021-12-03/*.tsv"
    },
    "sourceDefinitionId": "69589781-7828-43c5-9f63-8925b1c1ccc2"
}

The delimiter is stored as \\t and not \t.

Manually editing the config blob in Postgres to set \t solved the issue.

Expected Behavior

Tab separated value files should be correctly parsed by the S3 source connector.

Steps to Reproduce

  1. Upload a .tsv file to S3
  2. Set up an S3 source with a \t delimiter value
  3. Set up a connection between this source and a destination
  4. Run the sync
@alafanechere alafanechere added type/bug Something isn't working needs-triage community area/connectors Connector related issues and removed needs-triage labels Dec 14, 2021
@alafanechere alafanechere changed the title 🐛 S3 source: tab delimiting character is not parsed properly 🐛 Source S3: tab delimiting character is not parsed properly Dec 14, 2021
@alafanechere alafanechere changed the title 🐛 Source S3: tab delimiting character is not parsed properly 🐛 Source S3: tab delimiter is not parsed properly Dec 14, 2021
@sherifnada sherifnada moved this to Needs Scoping in GL Roadmap Dec 17, 2021
@oustynova oustynova moved this from Prioritized for Scoping to Ready for implementation in GL Roadmap Dec 17, 2021
@grubberr grubberr self-assigned this Dec 24, 2021
@antixar antixar moved this from Ready for implementation to Implementation in progress in GL Roadmap Dec 24, 2021
@grubberr grubberr linked a pull request Dec 28, 2021 that will close this issue
16 tasks
@grubberr grubberr moved this from Implementation in progress to Internal review in GL Roadmap Dec 29, 2021
@grubberr grubberr moved this from Internal review to Done in GL Roadmap Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants