-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source-mssql: Duplicate rows for same LSN #24206
Conversation
@grishick can you check this contribution and add to source team backlog to future review? |
@grishick @prateekmukhedkar any update on this ? |
@marcosmarxm @grishick @prateekmukhedkar Any updates on this? |
@marcosmarxm did you get a chance to look into this ? |
/test connector=connectors/source-mssql |
@sashaNeshcheret Could you take a look on this PR ? |
@marcosmarxm it has been more than 5 months. Could someone take a look on this PR ? |
@akashkulk Could you review this PR ? |
@sivankumar86 I acknowledge the delay in this PR. The reason is that we changed how cursor fields are defined for CDC related syncs so that when data is published to destination, the Destination connector can use this cursor field to de-duplicate rows with the same LSN. The change brings CDC syncs to match the Airbyte protocol. We made a change for Postgres source connector here #27442. This change specifies |
@prateekmukhedkar Thanks for taking a look. May need to combine 2 fields from sqlserver cdc to create a _ab_cdc_lsn column or need to change the deduplicate logic to use more than field. I am doing 2 approach as I am using custom dbt job to duplicate which uses 2 columns. |
What
Describe what the change is solving
There is a chance commit LSN same for multiple rows and it is hard to figure out latest rows
How
Describe the solution
There is a extra column to find out change sequence but, it was not included in airbyte output
https://debezium.io/documentation/reference/stable/connectors/sqlserver.html
"
The connector sorts the changes that it reads in ascending order, based on the values of their commit LSN and change LSN. This sorting order ensures that the changes are replayed by Debezium in the same order in which they occurred in the database.
"
Recommended reading order
x.java
y.python
🚨 User Impact 🚨
Are there any breaking changes? What is the end result perceived by the user? If yes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.
Community member or Airbyter
https://discuss.airbyte.io/t/duplicated-registries-when-syncing-from-mssql-cdc/3752/7