📝 Postgres source: document occasional full refresh under cdc mode (a…

…irbytehq#17705) * Update postgres doc about full refresh in cdc mode * Update format
jhammarstedt · Oct 31, 2022 · 6120ff8 · 6120ff8
1 parent c829e16
commit 6120ff8
Showing 1 changed file with 12 additions and 0 deletions.
diff --git a/docs/integrations/sources/postgres.md b/docs/integrations/sources/postgres.md
@@ -382,6 +382,18 @@ Possible solutions include:
 - [Recommended] Sync data when there is no update running in the primary server, or sync data from the primary server.
 - [Not Recommended] Increase [`max_standby_archive_delay`](https://www.postgresql.org/docs/14/runtime-config-replication.html#GUC-MAX-STANDBY-ARCHIVE-DELAY) and [`max_standby_streaming_delay`](https://www.postgresql.org/docs/14/runtime-config-replication.html#GUC-MAX-STANDBY-STREAMING-DELAY) to be larger than the amount of time needed to complete the data sync. However, it is usually hard to tell how much time it will take to sync all the data. This approach is not very practical.
 
+### Under CDC incremental mode, there are still full refresh syncs
+
+Normally under the CDC mode, the Postgres source will first run a full refresh sync to read the snapshot of all the existing data, and all subsequent runs will only be incremental syncs reading from the write-ahead logs (WAL). However, occasionally, you may see full refresh syncs after the initial run. When this happens, you will see the following log:
+
+> Saved offset is before Replication slot's confirmed_flush_lsn, Airbyte will trigger sync from scratch
+
+The root causes is that the WALs needed for the incremental sync has been removed by Postgres. This can occur under the following scenarios:
+- When there are lots of database updates resulting in more WAL files than allowed in the `pg_wal` directory, Postgres will purge or archive the WAL files. This scenario is preventable. Possible solutions include:
+  - Sync the data source more frequently. The downside is that more computation resources will be consumed, leading to a higher Airbyte bill.
+  - Set a higher `wal_keep_size`. If no unit is provided, it is in megabytes, and the default is `0`. See detailed documentation [here](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-WAL-KEEP-SIZE). The downside of this approach is that more disk space will be needed.
+- When the Postgres connector successfully reads the WAL and acknowledges it to Postgres, but the destination connector fails to consume the data, the Postgres connector will try to read the same WAL again, which may have been removed by Postgres, since the WAL record is already acknowledged. This scenario is rare, because it can happen, and currently there is no way to prevent it. The correct behavior is to perform a full refresh.
+
 ## Changelog
 
 | Version | Date       | Pull Request                                             | Subject                                                                                                                                                                    |