Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unchanged status on schema dry-run / diff #341

Closed
AlexisSouquiere opened this issue Nov 15, 2023 · 2 comments
Closed

Handle unchanged status on schema dry-run / diff #341

AlexisSouquiere opened this issue Nov 15, 2023 · 2 comments
Assignees
Labels
enhancement This issue or pull request improves a feature

Comments

@AlexisSouquiere
Copy link
Collaborator

Problem

Suppose that we have a subject with a v1 schema created with ns4kafka
When doing a dry-run / diff with the same schema, kafkactl will return a CHANGED status.
It's because ns4kafka never compares the value of the schema (only if there is a subject present to return CHANGED / CREATED) with the latest schema definition.

Doing a apply without dry-run won't create a new version the schema are equals so it will return an UNCHANGED status. Behaviour between dry-run and apply without dry-run is different.

Suggestion

A simple string comparison is not sufficient because we won't take into account formatting changes that could have been done (spaces, tabs, reordering, etc.)

The kafka-schema-registry-client dep provides an AvroSchema class that we can instantiate with the input schema string and the latest schema definition schema. The constructor normalizes the schema and allows us to compare the two schemas, even if users did some formatting changes.

My suggestion would be to improve the dry-run comparison (https://github.com/michelin/ns4kafka/blob/master/src/main/java/com/michelin/ns4kafka/controllers/SchemaController.java#L121) by instantiating 2 AvroSchema instances for the 2 definitions and doing a comparison on the AvroSchema.canonicalString() strings

Input schema

{"namespace":"com.michelin.kafka.producer.showcase.avro","type":"record","name":"PersonAvro","fields":[{"name":"firstName","type":["null","string"],"default":null,"doc":"First name of the person"},{"name":"lastName","type":["null","string"],"default":null,"doc":"Last name of the person"},{"name":"dateOfBirth","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null,"doc":"Date of birth of the person"}]}

Normalized schema

{"type":"record","name":"PersonAvro","namespace":"com.michelin.kafka.producer.showcase.avro","fields":[{"name":"firstName","type":["null","string"],"doc":"First name of the person","default":null},{"name":"lastName","type":["null","string"],"doc":"Last name of the person","default":null},{"name":"dateOfBirth","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"doc":"Date of birth of the person","default":null}]}

Alternatives Considered

N/A

Additional Context

N/A

@AlexisSouquiere AlexisSouquiere added the enhancement This issue or pull request improves a feature label Nov 15, 2023
@loicgreffier
Copy link
Collaborator

loicgreffier commented Nov 15, 2023

Scenario to be validated before implementation:

Schema v1

{"namespace":"com.michelin.kafka.producer.showcase.avro","type":"record","name":"PersonAvro","fields":[{"name":"firstName","type":["null","string"],"default":null,"doc":"First name of the person"},{"name":"lastName","type":["null","string"],"default":null,"doc":"Last name of the person"},{"name":"dateOfBirth","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null,"doc":"Date of birth of the person"}]}

and we just swap the order of the firstName and lastName fields.

Schema v2

{"namespace":"com.michelin.kafka.producer.showcase.avro","type":"record","name":"PersonAvro","fields":[{"name":"lastName","type":["null","string"],"default":null,"doc":"Last name of the person"},{"name":"firstName","type":["null","string"],"default":null,"doc":"First name of the person"},{"name":"dateOfBirth","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null,"doc":"Date of birth of the person"}]}

Based on canonical strings, schemas are different. But when applying the v2 to the Schema Registry, does the Schema Registry is smart enough to detect it is the exact same schema (with fields in different order) or does it creates a new schema ?

➡️ If a new schema is created, the scenario is OK. (and it seems to be according to my quick tests)
➡️ If no new schema is created, and the v1 is returned from the API call, the scenario is invalid.

A full integration test to validate this could be:
➡️ Create a schema v1 through ns4kafkaClient
➡️ Check created status + created on SR through SR client
➡️ Create the schema v2 (swap firstName and LastName) through ns4kafkaClient
➡️ Check changed status + schema v2 created on SR through SR client
➡️ Create the same schema v2 (swap namespace and type fields) through ns4kafkaClient. It should return unchanged.
➡️ Create the same schema v2 (swap namespace and type fields) through SR client. It should return schema v1 ID.


💡 The check for AvroSchemav1.canonicalString() == AvroSchema2.canonicalString() can be run before validating the schema compatibility. I think there's nothing wrong doing this, and it will save a lot of HTTP calls if it is unchanged.

💡 Think about adding the Confluent repository when using implementation('io.confluent:kafka-schema-registry-client:7.5.1'):

repositories {
    mavenCentral()
    maven {
        url "https://packages.confluent.io/maven"
    }
}

@loicgreffier
Copy link
Collaborator

#344

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue or pull request improves a feature
Projects
None yet
Development

No branches or pull requests

2 participants