-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix normalization issue with quoted & case sensitive columns #9317
Conversation
/test connector=bases/base-normalization
|
/test connector=bases/base-normalization
|
"User id", | ||
"user id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I don't understand how the first User id
is not converted to lower case. The main change in the Python file is this one I think:
- if not is_quoted and not self.needs_quotes(input_name):
- result = input_name.lower()
+ result = input_name.lower()
which seems to universally convert names to lower case. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file you commented on is the output for Postgres, not mssql (i think the mssql one is not versioned in git).
- In Postgres, when quoted, the identifier's case matters (so "User Id", "User id" and "user id" are different column names). So nothing has changed here.
- But as the integration test reveals and as the reported On-Call issue reports, MSSQL does not make a difference with these 3 columns names with different cases and considers them all to be equal, and thus conflicting.
So, to mirror this behavior with MSSQL in normalization code, I universally lowercase all column identifiers for MSSQL only even if they are quoted. As a consequence, normalization now detects column name conflicts and will resolve it by renaming conflicting names with extra '_1', '_2' etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm going to change the PR and isolate it to normalize this sort of casing only when looking for column name conflicts...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for the explanation.
/test connector=bases/base-normalization
|
json_value(_airbyte_data, ''$."User Id"'') as "User Id", | ||
json_value(_airbyte_data, ''$."user_id"'') as user_id, | ||
json_value(_airbyte_data, ''$."User id"'') as "User id_1", | ||
json_value(_airbyte_data, ''$."user id"'') as "user id_2", | ||
json_value(_airbyte_data, ''$."UserId"'') as userid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tuliren here is the output of the new test for MSSQL
What
Relates to https://github.com/airbytehq/oncall/issues/84
How
It seems MSSQL does care about case sensitivity of columns (even quoted ones).
(This might be depending on a "collation" settings of the server too)
But this PR makes it so we always resolve column conflicts by lowercasing all identifiers.
When I added the new test case, it surfaced similar behavior/errors on other destinations, so I'm fixing it there too.
(bigquery/mysql)
Recommended reading order
x.java
y.python