Feature subsample #34

clegaard · 2020-04-12T15:11:04Z

Summary:

Added subsampling transform which allows the user to "divide" each sample into one or more smaller sub-samples.

The current implementation requires the user to specify the sub-samples produced by each sample.
Which also limits it to a subsampling process that always produce a fixed number of subsamples.

Updated documentation on this.

…her than a str

…into feature-subsample

…k on supersampling

LukasHedegaard · 2020-04-15T08:25:20Z

@clegaard , now that the caching branch has been merged into develop, there are some merge conflicts that need fixing before the merge. Also, more severe linting checks were added, so the branch needs to be checked for these.

…into feature-subsample

Also consolidated shuffle tests and updated from_iterable

LukasHedegaard · 2020-04-16T16:01:43Z

Hey, a couple of things:

Cool that you remembered to include a test for caching! 👍
Your commit naming doesn't follow our standard (https://chris.beams.io/posts/git-commit/) - please use it in the future. And what kind of commit message is more push? 😂
You have snuck in the csv loader, I see - how about canceling the other PR and just include it here?
You're returning a named tuple "Row" from the csv loader - this doesn't comply with our internal structure and should be changed to a regular tuple. If you believe you can infer names, how about setting the dataset names instead?
The naming of internal variables should start with '_' if they are not supposed to be directly used by the end-user. Most of the internal variables, you introduce don't follow this convention. Moreover, you've made some of them a tad brief: n_ss for instance should be written out to make it easier to understand.
It seems there is a bit of confusion in the class implementation. For instance, SubsampleDataset takes a dataset parameter and saves it a self.dataset. This is exactly the same as _downstream_getter (now _parent) used in the regular Dataset, so why save two versions?
I'm not that big a fan of overloading the dataset. You get a lot of extra state keeping, which you also expose to the user here. I think we should at least wrap the SubsampleDataset returned from subsample in another Dataset to limit the exposure of our guts.

Aside from this, I noticed and fixed some typing problems and snuck in a few issue fixes:
#41
#31

Also, I've renamed cachable (which was misspelled) to _cacheable (assuming we want it to be private)

SubsampleDataset and SuperSampleDataset - Prefixed private attributes with an underscore - Changed n_ss to sampling_ratio - Changed name of argument func to subsample_func and supersample_func - Use _parent reference other than self.dataset Dataset - Free version of supersample added Documentation - Doctest snippets modified to reflect new argument names

test_dataset - test now uses indicies to refer to attributes rather than field test_loaders - test now uses indicies to refere to attributes rather than field

LukasHedegaard · 2020-04-17T10:43:39Z

Hey, slightly better commit names. However, you're still not following the standard (did you actually read the document?). The names should be:
"Change attribute names and remove dataset reference"
"Change from_csv to return a plain tuple"

LukasHedegaard

I can see, you have added slicing. While it seems to work for Dataset, it doesn't for all the child classes that overload __getitem__ (StreamDataset, SubsampleDataset, SupersampleDataset). Moreover, the ItemGetter should also be updated to take slices if you want to be able to generically use self._parent[1:2] within the dataset class

Built-in Python iterables all support slicing e.g. >>> [1,2,3][0:2] [1,2] Implementation of Dataset.__getitem__ has changed: - add index out of bounds check - improve conversion from slice to indices using slice.indices(), now the following is possible: >>> ds = from_iterable(list(range(10))) >>> ds[:] [0, 1, 2, ..., 9] Add slicing support for SubsampleDataset and SupersampleDataset Testing: - add test case for Dataset.__getitem__ - add test case for SubsampleDataset.__getitem__ - add test case for SupersampleDataset.__getitem__

Update __getitem__ typehints to reflect that slicing Update typehints for bound transforms methods of dataset using forward declaration of types. See: https://www.python.org/dev/peps/pep-0484/#forward-references

clegaard added 28 commits April 8, 2020 17:30

getting started

e8600c8

more push

95b7e96

docs docs docs

5bac86f

more docs

4af5876

added more documentation for loaders

6fdbc35

added test data

2ad5f42

added file structure snipper

7af735b

renamed example dataset and implemented loader

9c0ce1a

added definition of slow maker to supress warning

f1a6b02

updated tests

18c8c60

function for getting test example paths now returns a Path object rat…

0b650c3

…her than a str

improved loaders

a4f9630

added option to pass arguments to pandas read_csv through kwargs

4007600

now with data

ea949cf

added figure to subsample documentation

23d4d53

Replaced code-fences (```) with code-block directive

f509553

added defintion of subsample-transform

b5e60cc

Merge branch 'develop' of https://github.com/LukasHedegaard/datasetops …

ddc10d3

…into feature-subsample

updated documentation related to subsampling

061809e

initial implmentation of subsampling and formatted using black

70ae6ab

added caching for subsampling

111bcc6

minor correction in test

862e63c

fixed bug in subsample caching

e173d62

fixed incorrect index

d82b012

added declaration of supersample transform

72884b3

added loader from generating a dataset from iterables and started wor…

6520910

…k on supersampling

Added caching for dataset's shape property

b5c7af2

implemented slicing for Dataset.__getitem__ and cached shape

0c6b59a

Merge branch 'develop' of https://github.com/LukasHedegaard/datasetops …

3177861

…into feature-subsample

clegaard and others added 11 commits April 15, 2020 15:35

fixed linting issues

3a7868c

migration to pytest for running doctest

4139996

disabled/commented doctests

83fad3a

defined testpaths in pytest config to stop it from recursing

c0c5bec

removed external dataset

56a1cee

renamed pytest doctest setup function

4049791

Update Loader (rm extend) and add range ids

91a392c

Also consolidated shuffle tests and updated from_iterable

Fix typing

a5c3b2d

Fix naming for DATASET_PATHS

8e98d5f

Silence intentional type error. Add comments

493a06d

Rename _downstream_getter and cachable

930a4a8

This was referenced Apr 16, 2020

Rename _downstream_getter -> _parent #41

Open

Remove extend from Loader #31

Open

Add workflow for doctests #40

Open

clegaard added 2 commits April 17, 2020 10:47

from_csv now returns a plain tuple

d881128

test_dataset - test now uses indicies to refer to attributes rather than field test_loaders - test now uses indicies to refere to attributes rather than field

LukasHedegaard self-requested a review April 17, 2020 10:44

LukasHedegaard requested changes Apr 17, 2020

View reviewed changes

clegaard added 2 commits April 18, 2020 11:12

Merge abstract.py into types.py and update typing

afc2ca3

Update __getitem__ typehints to reflect that slicing Update typehints for bound transforms methods of dataset using forward declaration of types. See: https://www.python.org/dev/peps/pep-0484/#forward-references

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature subsample #34

Feature subsample #34

clegaard commented Apr 12, 2020

LukasHedegaard commented Apr 15, 2020

LukasHedegaard commented Apr 16, 2020 •

edited

Loading

LukasHedegaard commented Apr 17, 2020

LukasHedegaard left a comment •

edited

Loading

Feature subsample #34

Are you sure you want to change the base?

Feature subsample #34

Conversation

clegaard commented Apr 12, 2020

LukasHedegaard commented Apr 15, 2020

LukasHedegaard commented Apr 16, 2020 • edited Loading

LukasHedegaard commented Apr 17, 2020

LukasHedegaard left a comment • edited Loading

Choose a reason for hiding this comment

LukasHedegaard commented Apr 16, 2020 •

edited

Loading

LukasHedegaard left a comment •

edited

Loading