Handle unchanged status on schema dry-run / diff #341

AlexisSouquiere · 2023-11-15T09:48:13Z

Problem

Suppose that we have a subject with a v1 schema created with ns4kafka
When doing a dry-run / diff with the same schema, kafkactl will return a CHANGED status.
It's because ns4kafka never compares the value of the schema (only if there is a subject present to return CHANGED / CREATED) with the latest schema definition.

Doing a apply without dry-run won't create a new version the schema are equals so it will return an UNCHANGED status. Behaviour between dry-run and apply without dry-run is different.

Suggestion

A simple string comparison is not sufficient because we won't take into account formatting changes that could have been done (spaces, tabs, reordering, etc.)

The kafka-schema-registry-client dep provides an AvroSchema class that we can instantiate with the input schema string and the latest schema definition schema. The constructor normalizes the schema and allows us to compare the two schemas, even if users did some formatting changes.

My suggestion would be to improve the dry-run comparison (https://github.com/michelin/ns4kafka/blob/master/src/main/java/com/michelin/ns4kafka/controllers/SchemaController.java#L121) by instantiating 2 AvroSchema instances for the 2 definitions and doing a comparison on the AvroSchema.canonicalString() strings

Input schema

{"namespace":"com.michelin.kafka.producer.showcase.avro","type":"record","name":"PersonAvro","fields":[{"name":"firstName","type":["null","string"],"default":null,"doc":"First name of the person"},{"name":"lastName","type":["null","string"],"default":null,"doc":"Last name of the person"},{"name":"dateOfBirth","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null,"doc":"Date of birth of the person"}]}

Normalized schema

{"type":"record","name":"PersonAvro","namespace":"com.michelin.kafka.producer.showcase.avro","fields":[{"name":"firstName","type":["null","string"],"doc":"First name of the person","default":null},{"name":"lastName","type":["null","string"],"doc":"Last name of the person","default":null},{"name":"dateOfBirth","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"doc":"Date of birth of the person","default":null}]}

Alternatives Considered

N/A

Additional Context

N/A

The text was updated successfully, but these errors were encountered:

loicgreffier · 2023-11-15T23:08:48Z

Scenario to be validated before implementation:

Schema v1

{"namespace":"com.michelin.kafka.producer.showcase.avro","type":"record","name":"PersonAvro","fields":[{"name":"firstName","type":["null","string"],"default":null,"doc":"First name of the person"},{"name":"lastName","type":["null","string"],"default":null,"doc":"Last name of the person"},{"name":"dateOfBirth","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null,"doc":"Date of birth of the person"}]}

and we just swap the order of the firstName and lastName fields.

Schema v2

{"namespace":"com.michelin.kafka.producer.showcase.avro","type":"record","name":"PersonAvro","fields":[{"name":"lastName","type":["null","string"],"default":null,"doc":"Last name of the person"},{"name":"firstName","type":["null","string"],"default":null,"doc":"First name of the person"},{"name":"dateOfBirth","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null,"doc":"Date of birth of the person"}]}

Based on canonical strings, schemas are different. But when applying the v2 to the Schema Registry, does the Schema Registry is smart enough to detect it is the exact same schema (with fields in different order) or does it creates a new schema ?

➡️ If a new schema is created, the scenario is OK. (and it seems to be according to my quick tests)
➡️ If no new schema is created, and the v1 is returned from the API call, the scenario is invalid.

A full integration test to validate this could be:
➡️ Create a schema v1 through ns4kafkaClient
➡️ Check created status + created on SR through SR client
➡️ Create the schema v2 (swap firstName and LastName) through ns4kafkaClient
➡️ Check changed status + schema v2 created on SR through SR client
➡️ Create the same schema v2 (swap namespace and type fields) through ns4kafkaClient. It should return unchanged.
➡️ Create the same schema v2 (swap namespace and type fields) through SR client. It should return schema v1 ID.

💡 The check for AvroSchemav1.canonicalString() == AvroSchema2.canonicalString() can be run before validating the schema compatibility. I think there's nothing wrong doing this, and it will save a lot of HTTP calls if it is unchanged.

💡 Think about adding the Confluent repository when using implementation('io.confluent:kafka-schema-registry-client:7.5.1'):

repositories {
    mavenCentral()
    maven {
        url "https://packages.confluent.io/maven"
    }
}

loicgreffier · 2024-01-21T18:25:05Z

#344

AlexisSouquiere added the enhancement This issue or pull request improves a feature label Nov 15, 2023

loicgreffier assigned AlexisSouquiere Nov 15, 2023

loicgreffier closed this as completed Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle unchanged status on schema dry-run / diff #341

Handle unchanged status on schema dry-run / diff #341

AlexisSouquiere commented Nov 15, 2023

loicgreffier commented Nov 15, 2023 •

edited

Loading

loicgreffier commented Jan 21, 2024

Handle unchanged status on schema dry-run / diff #341

Handle unchanged status on schema dry-run / diff #341

Comments

AlexisSouquiere commented Nov 15, 2023

Problem

Suggestion

Alternatives Considered

Additional Context

loicgreffier commented Nov 15, 2023 • edited Loading

loicgreffier commented Jan 21, 2024

loicgreffier commented Nov 15, 2023 •

edited

Loading