Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated output differs depending on order of input files #952

Open
wchargin opened this issue Jan 15, 2025 · 6 comments
Open

Generated output differs depending on order of input files #952

wchargin opened this issue Jan 15, 2025 · 6 comments

Comments

@wchargin
Copy link

If you pass multiple file names on the command line to protoc while using protoc-gen-dart, the contents of the generated files may depend on the order in which you listed the inputs.

This is problematic because it makes it harder to run automated checks like "test that the generated outputs are up to date" with simple invocations like find . -name '*.proto' -exec protoc ... {} +. This can be mitigated by sorting the file list, but (a) it took me a while to figure out what was going on here and come up with that solution, and (b) it should be the duty of the plugin to produce deterministic output.

I also generate bindings for C++, Go, Python, and TypeScript, both with and without gRPC support, and none of those other language bindings exhibits this issue.

Repro

Construct a few protos that depend on each other:

root@6ff5b1d60b11:/data# tail *.proto
==> a.proto <==
syntax = "proto3";
message A {}

==> b.proto <==
syntax = "proto3";
message B {}

==> has_a.proto <==
syntax = "proto3";

import "a.proto";

message HasA {
  A a = 1;
}

==> has_ab.proto <==
syntax = "proto3";

import "a.proto";
import "b.proto";

message HasAB {
  A a = 1;
  B b = 2;
}

==> has_b.proto <==
syntax = "proto3";

import "b.proto";

message HasB {
  B b = 1;
}

Compile them in two different orders:

root@6ff5b1d60b11:/data# rm -rf out_ab out_ba && mkdir out_ab out_ba
root@6ff5b1d60b11:/data# protoc --dart_out=out_ab has_a.proto has_b.proto has_ab.proto
root@6ff5b1d60b11:/data# protoc --dart_out=out_ba has_b.proto has_a.proto has_ab.proto

Actual behavior

Note that the generated code differs between the two versions:

root@6ff5b1d60b11:/data# diff -u out_ab/has_ab.pb.dart out_ba/has_ab.pb.dart | head
--- out_ab/has_ab.pb.dart       2025-01-15 20:42:37.746806003 +0000
+++ out_ba/has_ab.pb.dart       2025-01-15 20:42:43.621806005 +0000
@@ -9,20 +9,20 @@

 import 'package:protobuf/protobuf.dart' as $pb;

-import 'a.pb.dart' as $0;
-import 'b.pb.dart' as $1;
+import 'a.pb.dart' as $1;
+import 'b.pb.dart' as $0;
root@6ff5b1d60b11:/data# diff -u out_ab/has_a.pb.dart out_ba/has_a.pb.dart | head
--- out_ab/has_a.pb.dart        2025-01-15 20:42:37.745806003 +0000
+++ out_ba/has_a.pb.dart        2025-01-15 20:42:43.621806005 +0000
@@ -9,17 +9,17 @@

 import 'package:protobuf/protobuf.dart' as $pb;

-import 'a.pb.dart' as $0;
+import 'a.pb.dart' as $1;

 class HasA extends $pb.GeneratedMessage {

Expected behavior

The output should be bitwise identical.

Version info

root@6ff5b1d60b11:/data# dart pub global list
protoc_plugin 20.0.1
@wchargin
Copy link
Author

This can be mitigated by sorting the file list

This is the mitigation that I've been using, but I want to emphasize that it's still far from optimal. A 2-line change to one file can cause thousands of lines of changes to generated dart bindings in totally unrelated files, just because some internal structure in the compiler changed. Diffs like this:

 ///
 //  Generated code. Do not modify.
 //  source: aaa/bbb.proto
 //
 // @dart = 2.12
 // ignore_for_file: annotate_overrides,camel_case_types,constant_identifier_names,directives_ordering,library_prefixes,non_constant_identifier_names,prefer_final_fields,return_of_invalid_type,unnecessary_const,unnecessary_import,unnecessary_this,unused_import,unused_shown_name

 import 'dart:core' as $core;

 import 'package:protobuf/protobuf.dart' as $pb;

-import 'xxx.pb.dart' as $62;
+import 'xxx.pb.dart' as $63;
 import '../yyy/zzz.pb.dart' as $1;

It would be ideal to see the $123 import symbols in each file depend only on that file—maybe as simple as "the first one is $0 and they go from there" in each file, rather than using some global shared state.

@osa1
Copy link
Member

osa1 commented Feb 25, 2025

This is trivial to fix (just one line to sort arguments), but I'm unable to reproduce this in a unit test. So far I have this:

Unit test
// Copyright (c) 2013, the Dart project authors.  Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.

import 'package:protoc_plugin/indenting_writer.dart';
import 'package:protoc_plugin/protoc.dart';
import 'package:protoc_plugin/src/generated/descriptor.pb.dart';
import 'package:protoc_plugin/src/generated/plugin.pb.dart';
import 'package:protoc_plugin/src/linker.dart';
import 'package:protoc_plugin/src/options.dart';
import 'package:protoc_plugin/src/output_config.dart';
import 'package:test/test.dart';

import 'dart:io';

void main() {
  test(
    'CodeGenerator generates imports in the same order regardless of argument order (#952)',
    () {
      // message A {}
      final messageA = DescriptorProto()..name = 'A';
      final fileA =
          FileDescriptorProto()
            ..name = 'a.proto'
            ..messageType.add(messageA)
            ..syntax = 'proto3';

      // message B {}
      final messageB = DescriptorProto()..name = 'B';
      final fileB =
          FileDescriptorProto()
            ..name = 'b.proto'
            ..messageType.add(messageB)
            ..syntax = 'proto3';

      // message HasA {
      //   A a = 1;
      // }
      final messageHasA =
          DescriptorProto()
            ..name = 'HasA'
            ..field.add(
              FieldDescriptorProto()
                ..name = 'a'
                ..number = 1
                ..type = FieldDescriptorProto_Type.TYPE_MESSAGE
                ..typeName = '.A'
                ..jsonName = 'a',
            );
      final fileHasA =
          FileDescriptorProto()
            ..name = 'has_a.proto'
            ..dependency.addAll(['a.proto'])
            ..messageType.add(messageHasA)
            ..syntax = 'proto3';

      // message HasB {
      //   B b = 1;
      // }
      final messageHasB =
          DescriptorProto()
            ..name = 'HasB'
            ..field.add(
              FieldDescriptorProto()
                ..name = 'b'
                ..number = 1
                ..type = FieldDescriptorProto_Type.TYPE_MESSAGE
                ..typeName = '.B'
                ..jsonName = 'b',
            );
      final fileHasB =
          FileDescriptorProto()
            ..name = 'has_b.proto'
            ..dependency.addAll(['b.proto'])
            ..messageType.add(messageHasB)
            ..syntax = 'proto3';

      // message HasAB {
      //   A a = 1;
      //   B b = 2;
      // }
      final messageHasAB =
          DescriptorProto()
            ..name = 'HasAB'
            ..field.addAll([
              FieldDescriptorProto()
                ..name = 'a'
                ..number = 1
                ..type = FieldDescriptorProto_Type.TYPE_MESSAGE
                ..typeName = '.A'
                ..jsonName = 'a',
              FieldDescriptorProto()
                ..name = 'b'
                ..number = 2
                ..type = FieldDescriptorProto_Type.TYPE_MESSAGE
                ..typeName = '.B'
                ..jsonName = 'b',
            ]);
      final fileHasAB =
          FileDescriptorProto()
            ..name = 'has_ab.proto'
            ..dependency.addAll(['a.proto', 'b.proto'])
            ..messageType.add(messageHasAB)
            ..syntax = 'proto3';

      final request1 =
          CodeGeneratorRequest()
            ..protoFile.addAll([fileA, fileHasA, fileB, fileHasB, fileHasAB])
            ..fileToGenerate.addAll([
              'has_a.proto',
              'has_b.proto',
              'has_ab.proto',
            ]);

      File('request1.txt').writeAsStringSync(request1.toString());

      final request2 =
          CodeGeneratorRequest()
            ..protoFile.addAll([fileB, fileHasB, fileA, fileHasA, fileHasAB])
            ..fileToGenerate.addAll([
              'has_b.proto',
              'has_a.proto',
              'has_ab.proto',
            ]);

      File('request2.txt').writeAsStringSync(request2.toString());

      final generatedFiles1 = generateFiles(request1);
      final hasAB1 = generatedFiles1['has_ab.pb.dart']!;
      final imports1 = importLines(
        hasAB1,
      ).where((s) => s.contains('a.pb.dart') || s.contains('b.pb.dart'));
      print(imports1);

      final generatedFiles2 = generateFiles(request2);
      final hasAB2 = generatedFiles2['has_ab.pb.dart']!;
      final imports2 = importLines(
        hasAB2,
      ).where((s) => s.contains('a.pb.dart') || s.contains('b.pb.dart'));
      print(imports2);
    },
  );
}

Map<String, CodeGeneratorResponse_File> generateFiles(
  CodeGeneratorRequest request,
) {
  final response = CodeGeneratorResponse();
  final options = parseGenerationOptions(request, response)!;
  final generators = <FileGenerator>[];
  for (final file in request.protoFile) {
    generators.add(FileGenerator(file, options));
  }
  link(options, generators);
  final files = <String, CodeGeneratorResponse_File>{};
  for (final gen in generators) {
    final name = gen.descriptor.name;
    if (request.fileToGenerate.contains(name)) {
      for (final genFile in gen.generateFiles(
        const DefaultOutputConfiguration(),
      )) {
        assert(!files.containsKey(genFile.name));
        files[genFile.name] = genFile;
      }
    }
  }
  return files;
}

List<String> importLines(CodeGeneratorResponse_File file) =>
    file.content.split('\n').where((s) => s.startsWith('import')).toList();

This should run the code generator with the same input as the original repro in this issue, but somehow the imports are always in the same order in generated code.

I've run out of time for now, I'll have a look later. In the meantime if anyone knows how to test this the fix is just this:

diff --git a/protoc_plugin/lib/src/code_generator.dart b/protoc_plugin/lib/src/code_generator.dart
index 9f324c6..340c439 100644
--- a/protoc_plugin/lib/src/code_generator.dart
+++ b/protoc_plugin/lib/src/code_generator.dart
@@ -105,6 +105,9 @@ class CodeGenerator {
       request.mergeFromCodedBufferReader(reader, extensions);
       reader.checkLastTagWas(0);

+      request.protoFile
+          .sort((desc1, desc2) => desc1.name.compareTo(desc2.name));
+
       final response = CodeGeneratorResponse();

       // Parse the options in the request. Return the errors if any.

@wchargin
Copy link
Author

This is trivial to fix (just one line to sort arguments)

Thanks for looking into it. It looks like this is effectively sorting the list of command-line arguments, right? If so, perhaps please read my comment above. This mitigation would make the output deterministic, but still means that if I change one line in one file I can have thousands of lines of diffs in unrelated files because some globally shared indices have changed. It would be ideal for the indices to be local to each file.

I'm unable to reproduce this in a unit test

Curious. I'll meet you halfway with a self-contained (Dockerized) version of my initial report that hopefully you can at least reproduce:
https://gist.github.com/wchargin/fac7e3bad80babdd2f1099c4fbc4c6aa

You can clone the Gist and run ./run.sh:

$ git clone git@gist.github.com:fac7e3bad80babdd2f1099c4fbc4c6aa.git gist
Cloning into 'gist'...
$ cd gist
$ ./run.sh
[+] Building 0.0s (8/8) FINISHED                                         docker:desktop-linux
 => => naming to docker.io/library/dart-protobuf-952                                     0.0s

Dart SDK version: 3.7.0 (stable) (Wed Feb 5 04:53:58 2025 -0800) on "linux_arm64"
protoc_plugin 21.1.2
libprotoc 3.21.12

--- out_ab/has_ab.pb.dart       2025-02-25 15:16:36.223334000 +0000
+++ out_ba/has_ab.pb.dart       2025-02-25 15:16:36.295334001 +0000
@@ -13,13 +13,13 @@

 import 'package:protobuf/protobuf.dart' as $pb;

-import 'a.pb.dart' as $0;
-import 'b.pb.dart' as $1;
+import 'a.pb.dart' as $1;
+import 'b.pb.dart' as $0;

 class HasAB extends $pb.GeneratedMessage {
 ...

@wchargin
Copy link
Author

Here's a self-contained example to demonstrate what I mean about outputs changing even when their inputs didn't change, even when the file names are always passed in sorted order on the command line: https://gist.github.com/wchargin/ae49e838b5537eb391af306d8a42c03c

(Run as above: on a machine with Docker, clone the Gist and run ./run.sh.)

Salient contents:

// has_bc.proto
syntax = "proto3";

import "element_b.proto";
import "element_c.proto";

message HasBC {
  ElementB b = 1;
  ElementC c = 2;
}
// anomaly.proto, version 1
syntax = "proto3";

message Anomaly {
}
// anomaly.proto, version 2
syntax = "proto3";

import "element_a.proto";

message Anomaly {
  ElementA a = 1;
}

Suppose that these files are compiled as protoc --dart_out=out anomaly.proto has_bc.proto. With the initial version of anomaly, the global import order will be:

  • 'element_b.pb.dart' as $0
  • 'element_c.pb.dart' as $1

But when anomaly is changed to import element_a, then the global import order changes to:

  • 'element_a.pb.dart' as $0
  • 'element_b.pb.dart' as $1
  • 'element_c.pb.dart' as $2

This means that making a two-line change to anomaly.proto, without changing any other input files or command-line invocations, has caused changes to the generated has_bc.pb.dart:

--- out1/has_bc.pb.dart 2025-02-25 15:56:17.613794005 +0000
+++ out2/has_bc.pb.dart 2025-02-25 15:56:17.686794005 +0000
@@ -13,13 +13,13 @@

 import 'package:protobuf/protobuf.dart' as $pb;

-import 'element_b.pb.dart' as $0;
-import 'element_c.pb.dart' as $1;
+import 'element_b.pb.dart' as $1;
+import 'element_c.pb.dart' as $2;

 class HasBC extends $pb.GeneratedMessage {
 ...

And that's why I think that just sorting input filenames doesn't really suffice to fix the underlying issue.

@osa1
Copy link
Member

osa1 commented Feb 25, 2025

This mitigation would make the output deterministic, but still means that if I change one line in one file I can have thousands of lines of diffs in unrelated files because some globally shared indices have changed

We should confirm (I haven't implemented this so I don't know) but my understanding is each generated file gets a unique index. So blah.pb.dart will always have the same as $... in all of the outputs.

I agree that sorting imports withing libraries and giving them local numbers would make sense as it would generate the same code in more cases.

Curious. I'll meet you halfway with a self-contained (Dockerized) version of my initial report that hopefully you can at least reproduce

We don't need docker or anything that's more complicated than a Dart unit test, we should be able to test this as a unit test. I'll investigate why my test above doesn't reproduce this. Maybe some internal map or set iteration order is different in my test vs. the command-line invocation of protoc with the Dart plugin.

@osa1
Copy link
Member

osa1 commented Feb 26, 2025

This should run the code generator with the same input as the original repro in this issue, but somehow the imports are always in the same order in generated code.

I also tried to dump CodeGeneratorRequests passed to Dart protoc plugin and loaded them directly in the tests, but the output is still the same in both invocations.

I checked maps and sets in protoc_plugin, they're either LinkedHash... or SplayTree..., so no non-determinism there.

The only other source of non-determinism I can think of is object identities in combination of splay tree maps and sets, but our SplayTreeSets have String keys, which have stable identities.

So I'm out of ideas as to what may be causing this non-determinism when the same compilation is done in a unit test vs. via protoc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants