DiscoverSchema endpoints calculates diff and breaking change #18571

alovew · 2022-10-27T19:19:20Z

Catalog diff calculation is moved to the DiscoverSchema endpoint. If a breaking change is detected, the connection is updated to have breakingChange=true

API changes:

SourceDiscoverSchemaRequestBody, the input to the DiscoverSchema endpoint, now has a connectionId
SourceDiscoverSchemaRead, the response from the DiscoverSchema endpoint, now returns a catalogDiff
The ConnectionUpdate object now has breakingChange, so connections can be updated with this data

alovew · 2022-10-27T19:33:15Z

Currently writing tests for SchedulerHandler

benmoriceau · 2022-10-27T22:45:11Z

airbyte-server/src/main/java/io/airbyte/server/apis/ConfigurationApi.java

@@ -202,14 +208,9 @@ public ConfigurationApi(final ConfigRepository configRepository,
        jobPersistence,
        workerEnvironment,
        logConfigs,
-        eventRunner);
+        eventRunner, connectionsHandler);


Nit, this addition should be on its own line.

benmoriceau · 2022-10-27T22:47:30Z

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

-      return retrieveDiscoveredSchema(persistedCatalogId);
+      final SourceDiscoverSchemaRead discoveredSchema = retrieveDiscoveredSchema(persistedCatalogId);
+
+      if (discoverSchemaRequestBody.getConnectionId() != null) {


When can this be null. Should we log an error if so?

I believe this endpoint is called when connections are created in order to discover schemas, and in this case we don't need to pass in the connection id since there is not an existing sync catalog to compare anything against

benmoriceau · 2022-10-27T22:51:05Z

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

+      if (streamTransform.getTransformType() != TransformTypeEnum.UPDATE_STREAM) {
+        return;
+      }
+      streamTransform.getUpdateStream().stream().forEach(fieldTransform -> {


Nit: we should consider takeWhile here instead of foreach in order to be able to breakout of the loop as soon as possible. I wonder if it can have an impact on big catalogs.

I'm not sure exactly how this would work. I think takeWhile continues to operate on each item until it reaches something that does not match the predicate, and then it will not operate on that field, so I'm not sure we could actually set isBreaking to true in that case..but maybe you're thinking about a different way of doing this?

I just updated it to use anyMatch instead which is probably cleaner and I think does break out early

benmoriceau · 2022-10-27T22:55:12Z

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

@@ -409,4 +434,19 @@ private JobInfoRead readJobFromResult(final ManualOperationResult manualOperatio
    return jobConverter.getJobInfoRead(job);
  }

+  private boolean containsBreakingChange(final CatalogDiff diff) {
+    AtomicBoolean isBreaking = new AtomicBoolean(false);
+    diff.getTransforms().stream().forEach(streamTransform -> {


Nit, this looks like more like a reduce operation rather than a map. In Java it would be written like

.collect(false, (acc, value) -> acc = acc || value.getBreaking(), (left, right) -> left = left || right )

I find it more explicit about what this is doing but I don't know what is the opinion of the rest of the team.

I find this more confusing

Not sure if that's better...

diff.getTransforms().stream() .filter(streamTransform -> streamTransform.getTransformType() == TransformTypeEnum.UPDATE_STREAM) .flatMap(streamTransform -> streamTransform.getUpdateStream().stream().map(fieldTransform::getBreaking)) .anyMatch(Boolean::booleanValue)

Looks better to me. What I would like to avoid is to have state within the function use in the stream API.

@benmoriceau why do we want to avoid that?

I generally think of stream notation as a more functional approach, in that paradigm having a mutable states/side effects would generally be something to avoid.

What jimmy is saying. Ideally when using this we need to be able to take the anonymous function in the foreach call and move it to its own method which is not the case here. I would personally prefer using a for loop like for (A a: as) if we are not stateless in the anonymous functions.

Synced up on Zoom about that

gosusnp

LGTM, some nits and questions.

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

gosusnp · 2022-10-28T17:29:44Z

airbyte-server/src/main/java/io/airbyte/server/handlers/WebBackendConnectionsHandler.java

@@ -368,12 +367,14 @@ public WebBackendConnectionRead webBackendGetConnection(final WebBackendConnecti
    return buildWebBackendConnectionRead(connection, currentSourceCatalogId).catalogDiff(diff);
  }

-  private Optional<SourceDiscoverSchemaRead> getRefreshedSchema(final UUID sourceId)
+  private Optional<SourceDiscoverSchemaRead> getRefreshedSchema(final UUID sourceId, final UUID connectionId)


nit: since this function is actually going to fetch the schema everytime, feels like refreshSchema is more explicit. A bit out-of-scope, I'll probably rename this at some point.

airbyte-server/src/main/java/io/airbyte/server/handlers/WebBackendConnectionsHandler.java

gosusnp · 2022-10-28T17:32:39Z

docs/reference/api/generated-api-html/index.html

+      "transformType" : "add_stream",
+      "updateStream" : [ {
+        "updateFieldSchema" : { },
+        "fieldName" : [ "fieldName", "fieldName" ],


Discovering the json here, why do we repeat "fieldName" twice here?

docs/reference/api/generated-api-html/index.html

alovew · 2022-10-28T20:18:22Z

@gosusnp I made a somewhat significant change after you approved - the SourceDiscoverSchemaRead object now has a breakingChange field, so we can set that on the connectionRead object in the WebBackendConnectionsHandler without refetching the connection object after updating it in the DiscoverSchema endpoint

benmoriceau

LGTM overall. got some comments about code organization that might not be needed for this review.

benmoriceau · 2022-10-28T20:53:31Z

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

@@ -409,4 +434,19 @@ private JobInfoRead readJobFromResult(final ManualOperationResult manualOperatio
    return jobConverter.getJobInfoRead(job);
  }

+  private boolean containsBreakingChange(final CatalogDiff diff) {
+    AtomicBoolean isBreaking = new AtomicBoolean(false);
+    diff.getTransforms().stream().forEach(streamTransform -> {


Looks better to me. What I would like to avoid is to have state within the function use in the stream API.

benmoriceau · 2022-10-28T20:55:04Z

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

+      throws JsonValidationException, ConfigNotFoundException, IOException {
+    final Optional<io.airbyte.api.model.generated.AirbyteCatalog> catalogUsedToMakeConfiguredCatalog = connectionsHandler
+        .getConnectionAirbyteCatalog(discoverSchemaRequestBody.getConnectionId());
+    io.airbyte.api.model.generated.@NotNull AirbyteCatalog currentAirbyteCatalog =


Nit should this be final (Same for the diff)? If you had to reset your intelliJ workspace recently , it is likely that the save actions plugin configuration has been lost.

benmoriceau · 2022-10-28T21:02:11Z

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

@@ -116,7 +124,8 @@ public SchedulerHandler(final ConfigRepository configRepository,
                   final JsonSchemaValidator jsonSchemaValidator,
                   final JobPersistence jobPersistence,
                   final EventRunner eventRunner,
-                   final JobConverter jobConverter) {
+                   final JobConverter jobConverter,
+                   final ConnectionsHandler connectionsHandler) {


Not for the scope of this review but to keep in mind for later: I think discoverSchemaForSourceFromSourceId should be move to the SourceHandler. We should get rid of the scheduler handler at some points (we don't have any scheduler anymore).

benmoriceau · 2022-10-28T21:07:11Z

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

+        .getConnectionAirbyteCatalog(discoverSchemaRequestBody.getConnectionId());
+    io.airbyte.api.model.generated.@NotNull AirbyteCatalog currentAirbyteCatalog =
+        connectionsHandler.getConnection(discoverSchemaRequestBody.getConnectionId()).getSyncCatalog();
+    CatalogDiff diff = connectionsHandler.getDiff(catalogUsedToMakeConfiguredCatalog.orElse(currentAirbyteCatalog), discoveredSchema.getCatalog(),


Not for this review: The getDiff could part of the CatalogHelper/. It doesn't interact with the persistenceLayer. The goal would be to avoid having an handler to depends on another one.

gosusnp · 2022-10-28T22:44:19Z

airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java

+        .getConnectionAirbyteCatalog(discoverSchemaRequestBody.getConnectionId());
+    io.airbyte.api.model.generated.@NotNull AirbyteCatalog currentAirbyteCatalog =
+        connectionsHandler.getConnection(discoverSchemaRequestBody.getConnectionId()).getSyncCatalog();
+    CatalogDiff diff = connectionsHandler.getDiff(catalogUsedToMakeConfiguredCatalog.orElse(currentAirbyteCatalog), discoveredSchema.getCatalog(),


What's the use case for having a catalog fallback? This feels like something that could be worth documenting here.

I just copy pasted this to move it from WebBackendConnectionsHandler, so I'm not 100% sure. I think it might be in case we can't find a source catalog for a particular connection, then we fall back on the configured airbyte catalog that exists for that connection, but I'm not sure about the case where that would actually happen.

In this case, can we move this code to some shared helper somewhere? Spreading unclear logic feels like a trap for our future selves.

@benmoriceau made a comment above to move getDiff to the CatalogsHelper:

"Not for this review: The getDiff could part of the CatalogHelper/. It doesn't interact with the persistenceLayer. The goal would be to avoid having an handler to depends on another one."

could we do that in a followup since I think it will be a larger change? i could add it here though, just will be a big PR

I'm ok with the follow up. Do we have a ticket for it?

benmoriceau

LGTM, just answered to some open discussion

colesnodgrass

Usage of Optional.get() should be avoided.

colesnodgrass · 2022-11-01T19:44:50Z

airbyte-server/src/main/java/io/airbyte/server/handlers/WebBackendConnectionsHandler.java

+      diff = refreshedCatalog.get().getCatalogDiff();
+      connection.setBreakingChange(refreshedCatalog.get().getBreakingChange());


Calling .get() on an Optional.empty() will throw a NoSuchElementException. What is the expected behavior here if refreshedCatalog = Optional.empty(); (from line 333) was the path taken?

on line 339 we have if(refreshedCatalog.isPresent()) so all of this is nested under that - I think that should make this ok?

* master: (38 commits) New Source: Gridly (#18342) 🎉 New Source: Alpha Vantage (#18320) ci_integration_test.sh: cut GITHUB_STEP_SUMMARY (#18895) 🎉 New Source: Datadog [python cdk] (#18150) Hide Reject all button in consent dialog (#18596) feat: add doc url to track event (#18690) fix: install java in oss catalog deploy action (#18887) [CI] Speed up check_images_exist (#18873) Extract open API (#18879) Remove unused interfaces (#18880) add action for deploying oss connector catalog to GCS (#18633) feat: generate full connector catalog json (#18562) Add unsupported_protocol_version column to Connection (#18876) Extract OAuth API (#18818) update images to have non-transparent background (#18874) DiscoverSchema endpoints calculates diff and breaking change (#18571) Validate protocol version on connector update (#18639) Bmoric/extract notification api (#18812) Show version and changelog status for affected connectors (#18845) Bmoric/extract logs api (#18621) ...

github-actions bot added area/api Related to the api area/documentation Improvements or additions to documentation area/platform issues related to the platform area/server labels Oct 27, 2022

alovew temporarily deployed to more-secrets October 27, 2022 19:20 Inactive

alovew assigned benmoriceau, jdpgrailsdev, colesnodgrass and gosusnp Oct 27, 2022

alovew temporarily deployed to more-secrets October 27, 2022 19:25 Inactive

alovew temporarily deployed to more-secrets October 27, 2022 19:31 Inactive

alovew requested review from colesnodgrass, benmoriceau, gosusnp and jdpgrailsdev October 27, 2022 19:45

benmoriceau reviewed Oct 27, 2022

View reviewed changes

alovew temporarily deployed to more-secrets October 27, 2022 23:08 Inactive

alovew temporarily deployed to more-secrets October 27, 2022 23:30 Inactive

alovew requested a review from benmoriceau October 27, 2022 23:48

alovew temporarily deployed to more-secrets October 27, 2022 23:50 Inactive

alovew temporarily deployed to more-secrets October 28, 2022 00:37 Inactive

gosusnp approved these changes Oct 28, 2022

View reviewed changes

alovew temporarily deployed to more-secrets October 28, 2022 20:18 Inactive

alovew temporarily deployed to more-secrets October 28, 2022 20:26 Inactive

benmoriceau reviewed Oct 28, 2022

View reviewed changes

alovew temporarily deployed to more-secrets October 28, 2022 21:42 Inactive

gosusnp approved these changes Oct 28, 2022

View reviewed changes

alovew temporarily deployed to more-secrets October 28, 2022 23:13 Inactive

alovew added 10 commits October 31, 2022 13:10

update discover schema endpoint to calculate diff

dc61e07

remove logs

80b5767

add connection id

74353b8

tests

0855b5f

one liner for checking field transforms

73d8d01

pmd

f47d0ae

add breakingChange to SourceDiscoverSchemaRead object

feb2b52

test for connection with breaking change after discovery

2e2ddde

format

badd99c

final variable

697f202

alovew force-pushed the anne/discover-schema-endpoint branch from c37cb18 to 697f202 Compare October 31, 2022 20:11

alovew temporarily deployed to more-secrets October 31, 2022 20:31 Inactive

alovew requested a review from gosusnp October 31, 2022 22:45

benmoriceau reviewed Nov 1, 2022

View reviewed changes

use for loop

9b47e6a

alovew temporarily deployed to more-secrets November 1, 2022 18:54 Inactive

alovew requested a review from benmoriceau November 1, 2022 19:39

colesnodgrass requested changes Nov 1, 2022

View reviewed changes

alovew requested a review from colesnodgrass November 1, 2022 22:45

gosusnp approved these changes Nov 2, 2022

View reviewed changes

Merge branch 'master' into anne/discover-schema-endpoint

312e792

colesnodgrass approved these changes Nov 2, 2022

View reviewed changes

alovew temporarily deployed to more-secrets November 2, 2022 19:50 Inactive

alovew temporarily deployed to more-secrets November 2, 2022 20:28 Inactive

alovew merged commit d26e5bc into master Nov 2, 2022

alovew deleted the anne/discover-schema-endpoint branch November 2, 2022 21:10

github-actions bot mentioned this pull request Nov 15, 2022

Bump helm chart version reference to 0.40.43 #19453

Closed

octavia-squidington-iii mentioned this pull request Nov 18, 2022

Bump Airbyte version from 0.40.18 to 0.40.19 #19579

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiscoverSchema endpoints calculates diff and breaking change #18571

DiscoverSchema endpoints calculates diff and breaking change #18571

alovew commented Oct 27, 2022 •

edited

Loading

alovew commented Oct 27, 2022

benmoriceau Oct 27, 2022

benmoriceau Oct 27, 2022

alovew Oct 27, 2022

benmoriceau Oct 27, 2022

alovew Oct 27, 2022

alovew Oct 27, 2022

benmoriceau Oct 27, 2022

alovew Oct 27, 2022

gosusnp Oct 28, 2022

benmoriceau Oct 28, 2022

alovew Oct 28, 2022

gosusnp Oct 28, 2022

benmoriceau Nov 1, 2022

benmoriceau Nov 1, 2022

gosusnp left a comment

gosusnp Oct 28, 2022

gosusnp Oct 28, 2022

alovew commented Oct 28, 2022

benmoriceau left a comment

benmoriceau Oct 28, 2022

benmoriceau Oct 28, 2022

benmoriceau Oct 28, 2022

benmoriceau Oct 28, 2022

gosusnp Oct 28, 2022

alovew Oct 28, 2022

gosusnp Oct 28, 2022

alovew Oct 31, 2022

benmoriceau Nov 1, 2022

benmoriceau left a comment

colesnodgrass left a comment

colesnodgrass Nov 1, 2022

alovew Nov 1, 2022

		diff = refreshedCatalog.get().getCatalogDiff();
		connection.setBreakingChange(refreshedCatalog.get().getBreakingChange());

DiscoverSchema endpoints calculates diff and breaking change #18571

DiscoverSchema endpoints calculates diff and breaking change #18571

Conversation

alovew commented Oct 27, 2022 • edited Loading

alovew commented Oct 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gosusnp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alovew commented Oct 28, 2022

benmoriceau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmoriceau left a comment

Choose a reason for hiding this comment

colesnodgrass left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alovew commented Oct 27, 2022 •

edited

Loading