Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dumper to allow not force format #19

Closed
cschloer opened this issue Jun 5, 2019 · 8 comments
Closed

Update dumper to allow not force format #19

cschloer opened this issue Jun 5, 2019 · 8 comments
Assignees

Comments

@cschloer
Copy link

cschloer commented Jun 5, 2019

Currently the dump processors force a particular format for datetime values (for example). We need the ability to use the format value within the schema (saved in the datapackage.json) as the format that is used to dump datetime values. The suggestion was to just have a flag in dump_to_* that, when set to true, uses whatever is in the schema rather than a forced format.

This is higher priority than the rest as we currently keep datetime fields as strings because their format is forced.

@roll
Copy link

roll commented Jul 7, 2019

@akariv
@cschloer
Here is a short summary:

  • BCO-DMO had been having problems with saving date/time data because of format forcing on the dump_to_path stage
  • we had a meeting with BCO-DMO and Adam, and decided to try to provide an option to use user defined date/time formats
  • than this issue was created with a goal to have an ability to
    • set schema.fields[].format on some update_resource stage
    • use a flag to say to the dump_to_path use this format for saving data and metadata
  • I've created a POC - Allow custom temporal formats for dump_to_path datahq/dataflows#100 - it works but slightly hacky and requires to set schema.fields[].outputFormat on the update_resoruce stage instead of format because of the way how DPP works (format of date/times cannot be changes because it breaks parsing). See PR's user interface for dataflow below.
  • Adam suggested to add an ability to provide custom dialect and serializers for dumpers. Which I think can make the PR's code better but cannot resolve the initial issue. Or the solution should be fully written on the client side (the way how to use and handle format etc)
  • At this point I don't see I can contribute much more. Because we need to understand
    • (first) what user API we're looking for (e.g. an exact datapackage-pipline.yml specification for some exemplar data) from BCO-DMO side AND
    • (second) what kind of API Adam is willing to add to DPP based on an answer to the first question

def test_force_temporal_format():
    import datetime
    from dataflows import load, update_resource, dump_to_path

    # Dump
    Flow(
        load('data/temporal.csv', name='temporal'),
        update_resource(['temporal'], **{
            'schema': {
                'fields': [
                    {'name': 'date', 'type': 'date', 'format': '%Y-%m-%d', 'outputFormat': '%d/%m/%y'},
                    {'name': 'event', 'type': 'string'},
                ]
            }
        }),
        dump_to_path('data/force_temporal_format', force_temporal_format=False)
    ).process()

    # Load
    flow = Flow(
        load('data/force_temporal_format/datapackage.json')
    )
    data, dp, stats = flow.results()

    # Assert
    assert dp.descriptor == {
        'profile': 'data-package',
        'resources': [{
            'dialect': {
                'caseSensitiveHeader': False,
                'delimiter': ',',
                'doubleQuote': True,
                'header': True,
                'lineTerminator': '\r\n',
                'quoteChar': '"',
                'skipInitialSpace': False
            },
            'encoding': 'utf-8',
            'format': 'csv',
            'name': 'temporal',
            'path': 'temporal.csv',
            'profile': 'tabular-data-resource',
            'schema': {
                'fields': [
                    {'format': '%d/%m/%y', 'name': 'date', 'type': 'date'},
                    {'format': 'default', 'name': 'event', 'type': 'string'}
                ],
                'missingValues': ['']
            }
        }]
    }
    assert data == [[
        {'date': datetime.date(2015, 1, 2), 'event': 'start'},
        {'date': datetime.date(2016, 6, 25), 'event': 'finish'}
    ]]

@roll roll assigned cschloer and unassigned roll Jul 7, 2019
@roll roll removed the enhancement New feature or request label Jul 7, 2019
@akariv
Copy link
Collaborator

akariv commented Jul 16, 2019

@roll

I think we should make a distinction between dataflows and datapackage-pipelines here.
since dataflows has a Python native interface, it would be better imo to have a more flexible serializer/dialect kind of API.

On datapackage-pipelines, with the yaml restrictions, we could have a few 'preset' configurations - e.g. by using the force_temporal_format parameter.

wdyt?

@roll
Copy link

roll commented Jul 16, 2019

It makes sense 👍

@cschloer is on vacation now but when he's back we can take a look at it from the "api client" perspective.

@roll
Copy link

roll commented Aug 16, 2019

@cschloer
Please take a look at #19 (comment). For now, that's the point where we are for #19 and #22. A working prototype of what I've done so far is here - datahq/dataflows#100

@cschloer
Copy link
Author

Hey,

sorry for the late reply!

From my understanding of what you said, you could add a flag to dump_to_path that uses the format specified in schema.fields[].format, but the problem you are anticipating is that we cannot update schema.fields[].format in the update_resource processor, so your workaround is to add a new schema.fields[].outputFormat field that we would update instead.

The only place where we currently interact with the schema.fields[].format field is in a custom processor I built called convert_date. This custom processor creates an entirely new field and it doesn't seem to have any problems setting the value of schema.fields[].format when the field is created. Therefore I think your original solution will work for us: the solution where a flag in dump_to_path will ensure that the value in schema.fields[].format will be dumped. We currently don't have any usecases where we want to update the value in schema.fields[].format, so the workaround is not necessary. Maybe if you want to support both our method and the workaround you could have the parameter in dump_to_path take in a string that is the key in schema.fields[].$KEY? Up to you, either would work for us!

@roll

@roll
Copy link

roll commented Aug 20, 2019

Thanks @cschloer

So let me verify how I see it:

(1) we're going to add a config option to the DPP's dump_to_path like datetime-format-field

  - run: dump.to_path
    parameters:
      resources: [duplicate]
      out-path: 'output'
      pretty-descriptor: true
      datetime-format-field: outputFormat

(2) Them we have two options:
(2a) DPP has to pass it to Dataflows to be processed OR
(2b) DPP has to implement a custom dumper to use with Dataflows (feels more complicated)

(3) It will resolve #19 and #22 altogether

cc @akariv

@roll roll assigned roll and unassigned cschloer Aug 22, 2019
@roll
Copy link

roll commented Aug 22, 2019

Ok, thanks, I'll focus on finishing it then

@roll
Copy link

roll commented Oct 3, 2019

It's done in dataflows@0.0.60 (please try datahq/dataflows#100 (comment))

@roll roll closed this as completed Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants