Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neo DateTime FFI #5940

Open
sffc opened this issue Dec 20, 2024 · 13 comments
Open

Neo DateTime FFI #5940

sffc opened this issue Dec 20, 2024 · 13 comments
Assignees
Labels
C-datetime Component: datetime, calendars, time zones S-large Size: A few weeks (larger feature, major refactoring)

Comments

@sffc
Copy link
Member

sffc commented Dec 20, 2024

Some thoughts on the structure of DateTime FFI.

Types: we might not need very many

The Rust type DateTimeFormatter<CompositeFieldSet> does everything. We can have many constructors that load different subsets of data. The reason to have multiple types is to reduce stack size (memory usage in FFI) and make the Drop impl a bit smaller, which I think is valuable.

It probably makes sense to start with these types:

FFI type Rust type Approximate stack size
TimeFormatter TimeFormatter<TimeFieldSet> 272
DateFormatter DateTimeFormatter<DateFieldSet> 440
DateTimeFormatter DateTimeFormatter<CompositeDateTimeFieldSet> 480
ZonedDateTimeFormatter DateTimeFormatter<CompositeFieldSet> 1472
GregorianDateTimeFormatter FixedCalendarDateTimeFormatter<Gregorian, CompositeDateTimeFieldSet> 424
GregorianZonedDateTimeFormatter FixedCalendarDateTimeFormatter<Gregorian, CompositeFieldSet> 1416
size_test!(TimeFormatter<fieldsets::enums::TimeFieldSet>, time_formatter_size, 272);
size_test!(DateTimeFormatter<fieldsets::enums::DateFieldSet>, date_formatter_size, 440);
size_test!(DateTimeFormatter<fieldsets::enums::CompositeDateTimeFieldSet>, date_time_formatter_size, 480);
size_test!(DateTimeFormatter<fieldsets::enums::CompositeFieldSet>, zoned_date_time_formatter_size, 1472);
size_test!(FixedCalendarDateTimeFormatter<icu_calendar::Gregorian, fieldsets::enums::CompositeDateTimeFieldSet>, gregorian_date_time_formatter_size, 424);
size_test!(FixedCalendarDateTimeFormatter<icu_calendar::Gregorian, fieldsets::enums::CompositeFieldSet>, gregorian_zoned_date_time_formatter_size, 1416);

Maybe we don't need the Gregorian types. The only difference is the presence of the AnyCalendar field and its associated destructor, which seems to be fairly small in the grand scheme. We can still of course have the Gregorian constructors, but they can resolve to these types.

I think we can also have just DateTimeFormatter and not DateFormatter.

Basically: time zone data is way bigger than date data which is way bigger than time data. So if you already have date data, removing time data is mostly in the noise, and if you already have zone data, removing date data is mostly in the noise.

Static Field Set Constructors

For dates, times, and datetimes, I think we can just have all the constructors. There are 20ish.

I think we should flatten the field set options into positional arguments, like we do in some other types. This avoids having a bunch of tiny field set structs like we do in Rust.

For time zones, I see a few options…

  1. Maybe we could have a fn to "add" time zone data to a DateTimeFormatter, transforming it into a ZonedDateTimeFormatter. This would require a bit of work on the Rust side, but I think it's doable.
  2. Or we do a hybrid approach where we have time zone constructors for each style and they take a dynamic datetime field set as an argument. You won't get slicing of date data, but date data is small relative to time zone data.
  3. Anything else?

Dynamic Field Set Constructors

The Rust approach is infeasible over FFI, since it is built on the idea of a bunch of little structs that compose into bigger structs and eventually into an all-encompassing dataful enum.

I think we instead want to introduce a builder API. I think we should start in Rust and then port it to FFI. I've had in mind for a while that the builder API would work similarly to the Serde impl.

There are a few approaches for specifying the field sets:

  1. Enums and structs: specify all date, calendar period, time, and zone fields in plain enums. Then you put the ones you want into a struct. The constructor is fallible if you put together an invalid combination of field sets, but the field sets individually are always valid. This is slightly more type-safe.
  2. Set of fields: export a single enum with possible fields, and the field set is a vector of that enum. Then we check to see if the fields you put in the vector compose a valid field set. This is what the Serde impl does.
  3. String: specify the semantic field set as a string, perhaps in JSON syntax, and use the Serde impl directly.

I lean toward option 2 because it is a smaller API surface, but I can be easily persuaded. If option 1 can be implemented more efficiently since it is structs and enums instead of a vector, that would convince me.

Alternatively, we could say that we don't export dynamic field sets over FFI. Clients in each language can build their own delegation mechanism that sits on top of the static constructors. This would be annoying for clients, but it is totally feasible. And if CLDR adds a string semantic skeleton syntax, we can add a single narrow constructor for that.

@Manishearth

@sffc sffc added the C-datetime Component: datetime, calendars, time zones label Dec 20, 2024
@sffc sffc added this to the ICU4X 2.0 ⟨P1⟩ milestone Dec 20, 2024
@Manishearth
Copy link
Member

Manishearth commented Dec 20, 2024

The only difference is the presence of the AnyCalendar field and its associated destructor, which seems to be fairly small in the grand scheme

AnyCalendar pulls in a lot of data statically, no? We could add try_new_gregorian() (etc) functions to it to reduce it a bit, but we'd still be stuck with tons of symbols data for the formatters.

I like your first set of six types.

Maybe we could have a fn to "add" time zone data to a DateTimeFormatter, transforming it into a ZonedDateTimeFormatter. This would require a bit of work on the Rust side, but I think it's doable.

I'd prefer to avoid doing mutation/moving over FFI if possible, there are safety concerns there. Eventually I'd like diplomat to start enforcing these things which would make me a bit more relaxed about it but right now our enforcement basically is "we have very few mutable types and we know how they get used".

In particular, I see hazards of mixing this with parts formatting since that involves persisting borrows.

Not opposed, but disprefer it.

I think we instead want to introduce a builder API. I think we should start in Rust and then port it to FFI. I've had in mind for a while that the builder API would work similarly to the Serde impl.

Quite interested.

(Yes, builder APIs also involve mutation, but this is a bit more relaxed: it's abnormal to ever borrow from a builder, builder types exist to be mutated and then consumed, and there are no other operations people typically add)

I lean toward option 2 because it is a smaller API surface, but I can be easily persuaded. If option 1 can be implemented more efficiently since it is structs and enums instead of a vector, that would convince me.

Requiring additional allocation for each DTF construction doesn't seem great. I do like option 1 and think that validation won't be too hard since Rust exhaustive matching will force us to handle everything. Might be messy, I'm not sure, if I recall correctly the "valid" fieldset combinations are actually pretty easily described.

String: specify the semantic field set as a string, perhaps in JSON syntax, and use the Serde impl directly.

I think this would be a fine API to have for 2.0. People probably will be wanting this anyway. It's not going to be efficient, but it's a nice API to have and gives us breathing room to design something good if we want.

Edit: Shane mentioned that semantic skeleton string formats are not specified in UTS 35, so this probably needs more work and is not an easy way out.

May have more thoughts later.

@sffc
Copy link
Member Author

sffc commented Dec 21, 2024

I put up a new builder API in #5944.

We now have several ways of specifying skeletons:

2.0-BETA1 Serde

{
  "fieldSet": ["year","month","day","weekday","time","zoneGeneric"],
  "length":"medium",
  "alignment":"column",
  "yearStyle":"always",
  "timePrecision":"secondF3"
}

PR #5944

{
  "dateFields":"YMDE",
  "zoneStyle":"Zs",
  "length":"medium",
  "alignment":"column",
  "yearStyle":"always",
  "timePrecision":"secondF3"
}

Rust API

fieldsets::YMDET::medium()
  .with_alignment(Alignment::Column)
  .with_year_style(YearStyle::Always)
  .with_time_precision(TimePrecision::SecondF3)
  .with_zone_generic()

ECMA-402 (for comparison)

{
  "era": "short",
  "year": "numeric",
  "month": "short",
  "day": "2-digit",
  "hour": "2-digit",
  "minute": "numeric",
  "second": "numeric",
  "fractionalSecondDigits": 3,
  "timeZoneName": "shortGeneric",
}

Comparing and contrasting:

2.0-BETA1 Serde PR #5944 Rust API ECMA-402
Type Safety Enums for options, but allows invalid sets of fields and/or options. Enums for options and fields, but allows invalid combinations of options. Fully type-safe Allows invalid sets of fields and/or options, including options not allowed by semantic skeletons.
FFI Compatibility Medium; needs a vector Easy Difficult Easy
ECMA-402 Compatibility Conversion is relatively straightforward Conversion is possible but takes a bit of work and a bunch of matching in userland Conversion is possible but likely requires even more work than PR #5944 N/A
Readability @sffc thinks this is the most readable since it avoids jargon like "YMDET" @Manishearth thinks this is the most readable since it is easier to parse, and people should know what "YMDET" means Mostly readable, aside from the YMDET Mostly readable, though verbose

What I think we should do, after a brief conversation with @Manishearth:

  1. Adopt PR Add new datetime field set builder #5944 for FFI
  2. Export the builder to Rust users as an alternative to the type-safe API
  3. Change the Serde impl to use the new schema
  4. Export a standalone API to map from ECMA-402 skeletons to ICU4X skeletons (instead of doing this in userland)
  5. Ask CLDR-TC to adopt this new schema as the standard serialized form in UTS 35

Thoughts? @zbraniecki

sffc added a commit that referenced this issue Dec 21, 2024
@sffc
Copy link
Member Author

sffc commented Dec 28, 2024

AnyCalendar pulls in a lot of data statically, no?

Only if the AnyCalendar constructor is used. I propose using the Gregorian constructor and then casting the resulting FixedCalendarDateTimeFormatter into a DateTimeFormatter via into_formatter.

I really want a way to allow for time zone selection over FFI without including all time zone data, but I guess for 2.0 we can just say, here's the builder API, and if you want less data, upvote this issue and write a custom FFI fn in the interim.

@sffc
Copy link
Member Author

sffc commented Feb 7, 2025

A few more thoughts on ZonedDateTimeFormatter FFI:

I'm only considering options that allow for DCE-based data slicing. Here are some options with C++ pseudocode:

// Option 1: build a DTF first and move out of it to get a ZDTF
auto formatter = DateTimeFormatter::create_mdet(locale).with_zone_specific_short();

// Option 1a: don't automatically retain the locale
auto formatter = DateTimeFormatter::create_mdet(locale).with_zone_specific_short(locale);

// Option 2: builder type specifically for ZDTF
auto builder = ZonedDateTimeFormatterBuilder::new(locale);
builder.load_mdet();
builder.load_zone_specific_short();
auto formatter = builder.build();

// Option 2a: similar API to option 1
auto builder = ZonedDateTimeFormatterBuilder::create_mdet(locale);
auto formatter = builder.build_with_zone_specific_short();

// Option 3: builder type for all datetime formatters, replacing type-specific constructors
auto builder = DateTimeFormatterBuilder::new(locale);
builder.load_mdet();
builder.load_zone_specific_short();
auto formatter = builder.build_zoned_datetime();

// Option 3a: a more type-safe version
DateTimeFormatterBuilder::new(locale)
    .with_mdet()
    .with_zone_specific_short()
    .build();

A bit more of a description of each:

  • Option 1: Move out of DTF
    • Pro: Relatively fewer APIs
    • Con: DTF format function can start throwing an "invalid state" exception
    • Con: DTF needs to retain the locale in case someone tries to call a ZDTF constructor later
    • Con: Prevents us from in the future adding APIs that borrow the DTF since we want to avoid both-borrow-and-mutate types in FFI
  • Option 1a: Move out of DTF and do not retain the locale
    • Pro: Regular DTF has almost no penalty
    • Con: The locales could end up being mismatched
  • Option 2: Builder for ZDTF
    • Pro: Decouples DTF from ZDTF
    • Con: Lots of code duplication; for example, we already have 4 create_mdet variants (anycal, fixedcal, anycal-buffer, fixedcal-buffer), and this creates 4 more
    • Con: This would likely be an FFI-only abstraction
  • Option 2a: Builder for ZDTF with API more like option 1
    • Pros and cons similar to option 1
  • Option 3: Builder type for all datetime formatters
    • Pro: Single centralized builder that we can document clearly
    • Pro: Could also be motivated as a Rust abstraction, keeping better alignment between Rust and FFI
    • Pro: No duplicate functions
    • Con: DTF loses its nice constructors
    • Con: User is required to call the correct build function
  • Option 3a: Builder type for all datetime formatters with more type safety
    • Pro: Looks a lot like the Rust static fieldset API, although it is implemented differently (each with function performs a data load)
    • Con: DTF loses its nice constructors

For any of these options, there are a few mechanical ways to implement the "build" function:

  1. Builder function takes &mut self, moves out of Option fields, and returns an opaque
  2. Builder function takes &self, moves out of RefCell or Mutex fields (possibly depending on a sync feature), and returns an opaque
    • Con: Need to think about sync; slightly more expensive than regular mutation
  3. Builder function takes &self, clones the data (which in most cases is a cheap clone), and returns an opaque
    • Con: The clone might not be cheap, and it is probaly more expensive than moving

I am currently thinking of exploring Option 3a with RefCell/Mutex gated on a sync feature added to icu_capi (which will also imply other sync features such as icu_provider/sync).

@Manishearth @robertbastian

@sffc
Copy link
Member Author

sffc commented Feb 11, 2025

@Manishearth The &mut self is still useful to Diplomat to indicate to users that the method is a mutating method. Should we do both &mut self and RefCell/Mutex for visibility and safety together?

It also means we could remove the RefCell/Mutex in a stable way if Diplomat solves the mutation problem.

@Manishearth
Copy link
Member

Oh, that's an interesting idea. Then we don't have to worry about people misusing it in a multithreaded context1.

Yeah, go for it!

Footnotes

  1. It's not possible in a singlethreaded context: nothing here is reentrant because we don't have callbacks.

@Manishearth
Copy link
Member

I am currently thinking of exploring Option 3a with RefCell/Mutex gated on a sync feature added to icu_capi (which will also imply other sync features such as icu_provider/sync).

Also I think I said this in person but I am happy with this plan, and do not have strong opinions on the choice of sync mitigation strategy though your latest idea seems liek the best.

@robertbastian
Copy link
Member

Builder function takes &self, clones the data (which in most cases is a cheap clone), and returns an opaque
Con: The clone might not be cheap, and it is probaly more expensive than moving

Which clones here will be expensive? The options are all Copy, the Locale is borrowed.

@sffc
Copy link
Member Author

sffc commented Feb 11, 2025

Builder function takes &self, clones the data (which in most cases is a cheap clone), and returns an opaque
Con: The clone might not be cheap, and it is probaly more expensive than moving

Which clones here will be expensive? The options are all Copy, the Locale is borrowed.

Each step of the builder loads data and returns DataErrors (or DateTimeLoadErrors). It's the only way to make data slicing work.

@robertbastian
Copy link
Member

How would that work? You'd have to poke a lot of holes in to the datetime crate. What are the intermediate builder types?

Cloning DataPayloads should be cheap, it's either a ref copy, or an Rc bump, which is still cheaper than a RefCell.

@sffc
Copy link
Member Author

sffc commented Feb 11, 2025

How would that work? You'd have to poke a lot of holes in to the datetime crate. What are the intermediate builder types?

I'd probably build it in Rust first: "Could also be motivated as a Rust abstraction, keeping better alignment between Rust and FFI"

Cloning DataPayloads should be cheap, it's either a ref copy, or an Rc bump, which is still cheaper than a RefCell.

In Rust, I wouldn't want to rely on this, since people can make payloads that own their data, but maybe it's okay in FFI? And we tell people that if they are making custom payloads, then don't use this function and make your own FFI wrapper?

I'm maybe okay with that. If we do this, then I'd prefer to implement option 1 or 1a. @Manishearth?

@sffc
Copy link
Member Author

sffc commented Feb 11, 2025

From a brief discussion, @Manishearth is okay with this path, and @robertbastian prefers the ergonomics of option 1.

I also think option 1 has the best ergonomics, though I am a bit sad to be adding a DataLocale to the heap size of DateTimeFormatter even if you don't normally need it.

@sffc
Copy link
Member Author

sffc commented Feb 12, 2025

Been thinking a bunch about how to implement this in a way that avoids expanding the Rust API in undesirable ways.

Here's an idea I had:

impl<C, FSet> TypedDateTimeNames<C, FSet> {
    pub fn from_formatter(
        self,
        preferences: DateTimeFormatterPreferences,
        formatter: FixedCalendarDateTimeFormatter<C, FSet>
    ) -> Self {
        // pull out the inner field of the formatter; throw away the pattern
    }
}

impl<C, FSet> TypedDateTimeNames<C, FSet> {
    pub fn try_into_formatter_unstable<P>(
        self,
        provider: &P,
        field_set: FSet,
    ) -> Result<FixedCalendarDateTimeFormatter<C, FSet>>
        /* where P contains FSet's Pattern-related data but not any names*/
    {
        // load the pattern and check that it matches
    }
}

This way, we can convert a formatter into a TypedDateTimeNames, load the time zone data, and then convert it back with a new pattern. Note that this is a low-level operation, so I put the APIs on TypedDateTimeNames and not on FixedCalendarDateTimeFormatter.

With DateTimeFormatter, I think we can go through TypedDateTimeNames<(), FSet>.

Manishearth pushed a commit that referenced this issue Feb 14, 2025
Toward #5940 

This allows a client such as FFI to add arbitrary data to a
DateTimeFormatter instance.

I put the conversion functions and docs on the DateTimeNames types
because this is a power-user feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-datetime Component: datetime, calendars, time zones S-large Size: A few weeks (larger feature, major refactoring)
Projects
None yet
Development

No branches or pull requests

3 participants