Update dumper to allow not force format #19

cschloer · 2019-06-05T09:15:33Z

Currently the dump processors force a particular format for datetime values (for example). We need the ability to use the format value within the schema (saved in the datapackage.json) as the format that is used to dump datetime values. The suggestion was to just have a flag in dump_to_* that, when set to true, uses whatever is in the schema rather than a forced format.

This is higher priority than the rest as we currently keep datetime fields as strings because their format is forced.

The text was updated successfully, but these errors were encountered:

roll · 2019-07-07T11:53:41Z

@akariv
@cschloer
Here is a short summary:

BCO-DMO had been having problems with saving date/time data because of format forcing on the dump_to_path stage
we had a meeting with BCO-DMO and Adam, and decided to try to provide an option to use user defined date/time formats
than this issue was created with a goal to have an ability to
- set schema.fields[].format on some update_resource stage
- use a flag to say to the dump_to_path use this format for saving data and metadata
I've created a POC - Allow custom temporal formats for dump_to_path datahq/dataflows#100 - it works but slightly hacky and requires to set schema.fields[].outputFormat on the update_resoruce stage instead of format because of the way how DPP works (format of date/times cannot be changes because it breaks parsing). See PR's user interface for dataflow below.
Adam suggested to add an ability to provide custom dialect and serializers for dumpers. Which I think can make the PR's code better but cannot resolve the initial issue. Or the solution should be fully written on the client side (the way how to use and handle format etc)
At this point I don't see I can contribute much more. Because we need to understand
- (first) what user API we're looking for (e.g. an exact datapackage-pipline.yml specification for some exemplar data) from BCO-DMO side AND
- (second) what kind of API Adam is willing to add to DPP based on an answer to the first question

def test_force_temporal_format():
    import datetime
    from dataflows import load, update_resource, dump_to_path

    # Dump
    Flow(
        load('data/temporal.csv', name='temporal'),
        update_resource(['temporal'], **{
            'schema': {
                'fields': [
                    {'name': 'date', 'type': 'date', 'format': '%Y-%m-%d', 'outputFormat': '%d/%m/%y'},
                    {'name': 'event', 'type': 'string'},
                ]
            }
        }),
        dump_to_path('data/force_temporal_format', force_temporal_format=False)
    ).process()

    # Load
    flow = Flow(
        load('data/force_temporal_format/datapackage.json')
    )
    data, dp, stats = flow.results()

    # Assert
    assert dp.descriptor == {
        'profile': 'data-package',
        'resources': [{
            'dialect': {
                'caseSensitiveHeader': False,
                'delimiter': ',',
                'doubleQuote': True,
                'header': True,
                'lineTerminator': '\r\n',
                'quoteChar': '"',
                'skipInitialSpace': False
            },
            'encoding': 'utf-8',
            'format': 'csv',
            'name': 'temporal',
            'path': 'temporal.csv',
            'profile': 'tabular-data-resource',
            'schema': {
                'fields': [
                    {'format': '%d/%m/%y', 'name': 'date', 'type': 'date'},
                    {'format': 'default', 'name': 'event', 'type': 'string'}
                ],
                'missingValues': ['']
            }
        }]
    }
    assert data == [[
        {'date': datetime.date(2015, 1, 2), 'event': 'start'},
        {'date': datetime.date(2016, 6, 25), 'event': 'finish'}
    ]]

akariv · 2019-07-16T06:50:57Z

@roll

I think we should make a distinction between dataflows and datapackage-pipelines here.
since dataflows has a Python native interface, it would be better imo to have a more flexible serializer/dialect kind of API.

On datapackage-pipelines, with the yaml restrictions, we could have a few 'preset' configurations - e.g. by using the force_temporal_format parameter.

wdyt?

roll · 2019-07-16T08:52:30Z

It makes sense 👍

@cschloer is on vacation now but when he's back we can take a look at it from the "api client" perspective.

roll · 2019-08-16T06:54:35Z

@cschloer
Please take a look at #19 (comment). For now, that's the point where we are for #19 and #22. A working prototype of what I've done so far is here - datahq/dataflows#100

cschloer · 2019-08-16T17:09:13Z

Hey,

sorry for the late reply!

From my understanding of what you said, you could add a flag to dump_to_path that uses the format specified in schema.fields[].format, but the problem you are anticipating is that we cannot update schema.fields[].format in the update_resource processor, so your workaround is to add a new schema.fields[].outputFormat field that we would update instead.

The only place where we currently interact with the schema.fields[].format field is in a custom processor I built called convert_date. This custom processor creates an entirely new field and it doesn't seem to have any problems setting the value of schema.fields[].format when the field is created. Therefore I think your original solution will work for us: the solution where a flag in dump_to_path will ensure that the value in schema.fields[].format will be dumped. We currently don't have any usecases where we want to update the value in schema.fields[].format, so the workaround is not necessary. Maybe if you want to support both our method and the workaround you could have the parameter in dump_to_path take in a string that is the key in schema.fields[].$KEY? Up to you, either would work for us!

@roll

roll · 2019-08-20T11:21:18Z

Thanks @cschloer

So let me verify how I see it:

(1) we're going to add a config option to the DPP's dump_to_path like datetime-format-field

  - run: dump.to_path
    parameters:
      resources: [duplicate]
      out-path: 'output'
      pretty-descriptor: true
      datetime-format-field: outputFormat

(2) Them we have two options:
(2a) DPP has to pass it to Dataflows to be processed OR
(2b) DPP has to implement a custom dumper to use with Dataflows (feels more complicated)

(3) It will resolve #19 and #22 altogether

cc @akariv

roll · 2019-08-22T11:16:13Z

Ok, thanks, I'll focus on finishing it then

roll · 2019-10-03T15:58:21Z

It's done in dataflows@0.0.60 (please try datahq/dataflows#100 (comment))

lwinfree assigned roll Jun 7, 2019

roll mentioned this issue Jun 28, 2019

Allow custom temporal formats for dump_to_path datahq/dataflows#100

Merged

roll added the enhancement New feature or request label Jul 5, 2019

roll assigned cschloer and unassigned roll Jul 7, 2019

roll removed the enhancement New feature or request label Jul 7, 2019

roll mentioned this issue Aug 15, 2019

Supporting time zone in datetime type #22

Closed

roll assigned roll and unassigned cschloer Aug 22, 2019

roll closed this as completed Oct 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dumper to allow not force format #19

Update dumper to allow not force format #19

cschloer commented Jun 5, 2019

roll commented Jul 7, 2019 •

edited

Loading

akariv commented Jul 16, 2019

roll commented Jul 16, 2019 •

edited

Loading

roll commented Aug 16, 2019 •

edited

Loading

cschloer commented Aug 16, 2019

roll commented Aug 20, 2019 •

edited

Loading

roll commented Aug 22, 2019

roll commented Oct 3, 2019

Update dumper to allow not force format #19

Update dumper to allow not force format #19

Comments

cschloer commented Jun 5, 2019

roll commented Jul 7, 2019 • edited Loading

akariv commented Jul 16, 2019

roll commented Jul 16, 2019 • edited Loading

roll commented Aug 16, 2019 • edited Loading

cschloer commented Aug 16, 2019

roll commented Aug 20, 2019 • edited Loading

roll commented Aug 22, 2019

roll commented Oct 3, 2019

roll commented Jul 7, 2019 •

edited

Loading

roll commented Jul 16, 2019 •

edited

Loading

roll commented Aug 16, 2019 •

edited

Loading

roll commented Aug 20, 2019 •

edited

Loading