Skip to content

Commit

Permalink
📝 Postgres source: document occasional full refresh under cdc mode (a…
Browse files Browse the repository at this point in the history
…irbytehq#17705)

* Update postgres doc about full refresh in cdc mode

* Update format
  • Loading branch information
tuliren authored and jhammarstedt committed Oct 31, 2022
1 parent c829e16 commit 6120ff8
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions docs/integrations/sources/postgres.md
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,18 @@ Possible solutions include:
- [Recommended] Sync data when there is no update running in the primary server, or sync data from the primary server.
- [Not Recommended] Increase [`max_standby_archive_delay`](https://www.postgresql.org/docs/14/runtime-config-replication.html#GUC-MAX-STANDBY-ARCHIVE-DELAY) and [`max_standby_streaming_delay`](https://www.postgresql.org/docs/14/runtime-config-replication.html#GUC-MAX-STANDBY-STREAMING-DELAY) to be larger than the amount of time needed to complete the data sync. However, it is usually hard to tell how much time it will take to sync all the data. This approach is not very practical.

### Under CDC incremental mode, there are still full refresh syncs

Normally under the CDC mode, the Postgres source will first run a full refresh sync to read the snapshot of all the existing data, and all subsequent runs will only be incremental syncs reading from the write-ahead logs (WAL). However, occasionally, you may see full refresh syncs after the initial run. When this happens, you will see the following log:

> Saved offset is before Replication slot's confirmed_flush_lsn, Airbyte will trigger sync from scratch
The root causes is that the WALs needed for the incremental sync has been removed by Postgres. This can occur under the following scenarios:
- When there are lots of database updates resulting in more WAL files than allowed in the `pg_wal` directory, Postgres will purge or archive the WAL files. This scenario is preventable. Possible solutions include:
- Sync the data source more frequently. The downside is that more computation resources will be consumed, leading to a higher Airbyte bill.
- Set a higher `wal_keep_size`. If no unit is provided, it is in megabytes, and the default is `0`. See detailed documentation [here](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-WAL-KEEP-SIZE). The downside of this approach is that more disk space will be needed.
- When the Postgres connector successfully reads the WAL and acknowledges it to Postgres, but the destination connector fails to consume the data, the Postgres connector will try to read the same WAL again, which may have been removed by Postgres, since the WAL record is already acknowledged. This scenario is rare, because it can happen, and currently there is no way to prevent it. The correct behavior is to perform a full refresh.

## Changelog

| Version | Date | Pull Request | Subject |
Expand Down

0 comments on commit 6120ff8

Please sign in to comment.