Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create DJL Time Series Dataset #1590

Closed
zachgk opened this issue Apr 20, 2022 · 7 comments · Fixed by #1667
Closed

Create DJL Time Series Dataset #1590

zachgk opened this issue Apr 20, 2022 · 7 comments · Fixed by #1667
Assignees
Labels
Call for Contribution enhancement New feature or request good first issue Good for newcomers

Comments

@zachgk
Copy link
Contributor

zachgk commented Apr 20, 2022

Description

A time series dataset contains a sequence of events happening across time. Some examples are climate, stocks, and forecasting. This issue is to add any time series dataset to the DJL basicdatasets.

References

@zachgk zachgk added enhancement New feature or request good first issue Good for newcomers Call for Contribution labels Apr 20, 2022
@WHALEEYE
Copy link
Contributor

Hello, I want to fix this issue and can you assign it to me? Thanks!

@lanking520
Copy link
Contributor

@WHALEEYE just did, thanks for your contribution

@WHALEEYE
Copy link
Contributor

@lanking520 @zachgk Hello, I'm trying to add Daily climate time series dataset into the project, but since the dataset is on kaggle, the user may need to login first to download this. So I guess maybe the dataset should be stored on somewhere else to allow users to automatically download it when they are using DJL?

@zachgk
Copy link
Contributor Author

zachgk commented May 21, 2022

Yeah @WHALEEYE, it would be good to not have to log in to get the dataset. What we can do is store the dataset along with the metadata file in S3 and distribute it that way.

Of course, the prerequisite is that the license permits us to redistribute the dataset. Some licenses like for mnist or imagenet do not. Fortunately, the climate dataset follows the CC0 license which does say that we can distribute it.

Here's what you should do. When you are creating the metadata file and testing it locally, set up a directory structure something like this:

/path/to/climate/dataset
    metadata.json (here is the metadata file for the climate dataset)
    1.0 (assuming dataset version is 1.0)
        datasetFile1
        datasetFile2
        ...

Then, in the metadata you can use a relative uri such as "1.0/datasetFile1". An example of this is the banana dataset. After it is working and you make the PR, don't add the dataset files to git. Just give us a reminder in the PR message to upload the files with the metadata and we will try to get it to match the metadata format. You can also compress any of the dataset files individually with gzip and it will be automatically extracted

@WHALEEYE
Copy link
Contributor

Thanks! I also get one question: this dataset seems not have labels, but the Record class is expected contain both feature and labels, so what should I place at the position of labels in the returned records?

@zachgk
Copy link
Contributor Author

zachgk commented May 21, 2022

For unlabeled data like time series, you can just use an empty NDList for the labels

@WHALEEYE
Copy link
Contributor

Ok, I got it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Call for Contribution enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants