Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature subsample #34

Open
wants to merge 44 commits into
base: develop
Choose a base branch
from
Open

Feature subsample #34

wants to merge 44 commits into from

Conversation

clegaard
Copy link
Collaborator

Summary:

  1. Added subsampling transform which allows the user to "divide" each sample into one or more smaller sub-samples.

The current implementation requires the user to specify the sub-samples produced by each sample.
Which also limits it to a subsampling process that always produce a fixed number of subsamples.

  1. Updated documentation on this.

@LukasHedegaard
Copy link
Owner

@clegaard , now that the caching branch has been merged into develop, there are some merge conflicts that need fixing before the merge. Also, more severe linting checks were added, so the branch needs to be checked for these.

@LukasHedegaard
Copy link
Owner

LukasHedegaard commented Apr 16, 2020

Hey, a couple of things:

  1. Cool that you remembered to include a test for caching! 👍
  2. Your commit naming doesn't follow our standard (https://chris.beams.io/posts/git-commit/) - please use it in the future. And what kind of commit message is more push? 😂
  3. You have snuck in the csv loader, I see - how about canceling the other PR and just include it here?
  4. You're returning a named tuple "Row" from the csv loader - this doesn't comply with our internal structure and should be changed to a regular tuple. If you believe you can infer names, how about setting the dataset names instead?
  5. The naming of internal variables should start with '_' if they are not supposed to be directly used by the end-user. Most of the internal variables, you introduce don't follow this convention. Moreover, you've made some of them a tad brief: n_ss for instance should be written out to make it easier to understand.
  6. It seems there is a bit of confusion in the class implementation. For instance, SubsampleDataset takes a dataset parameter and saves it a self.dataset. This is exactly the same as _downstream_getter (now _parent) used in the regular Dataset, so why save two versions?
  7. I'm not that big a fan of overloading the dataset. You get a lot of extra state keeping, which you also expose to the user here. I think we should at least wrap the SubsampleDataset returned from subsample in another Dataset to limit the exposure of our guts.

Aside from this, I noticed and fixed some typing problems and snuck in a few issue fixes:
#41
#31

Also, I've renamed cachable (which was misspelled) to _cacheable (assuming we want it to be private)

SubsampleDataset and SuperSampleDataset
- Prefixed private attributes with an underscore
- Changed n_ss to sampling_ratio
- Changed name of argument func to subsample_func and supersample_func
- Use _parent reference other than self.dataset

Dataset
- Free version of supersample added

Documentation
- Doctest snippets modified to reflect new argument names
test_dataset
- test now uses indicies to refer to attributes rather than field

test_loaders
- test now uses indicies to refere to attributes rather than field
@LukasHedegaard
Copy link
Owner

Hey, slightly better commit names. However, you're still not following the standard (did you actually read the document?). The names should be:
"Change attribute names and remove dataset reference"
"Change from_csv to return a plain tuple"

@LukasHedegaard LukasHedegaard self-requested a review April 17, 2020 10:44
Copy link
Owner

@LukasHedegaard LukasHedegaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see, you have added slicing. While it seems to work for Dataset, it doesn't for all the child classes that overload __getitem__ (StreamDataset, SubsampleDataset, SupersampleDataset). Moreover, the ItemGetter should also be updated to take slices if you want to be able to generically use self._parent[1:2] within the dataset class

Built-in Python iterables all support slicing e.g.
>>> [1,2,3][0:2]
[1,2]
Implementation of Dataset.__getitem__ has changed:
- add index out of bounds check
- improve conversion from slice to indices using slice.indices(), now
the following is possible:
>>> ds = from_iterable(list(range(10)))
>>> ds[:]
[0, 1, 2, ..., 9]

Add slicing support for SubsampleDataset and SupersampleDataset

Testing:
- add test case for Dataset.__getitem__
- add test case for SubsampleDataset.__getitem__
- add test case for SupersampleDataset.__getitem__
Update __getitem__ typehints to reflect that slicing

Update typehints for bound transforms methods of dataset using
forward declaration of types. See:
https://www.python.org/dev/peps/pep-0484/#forward-references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants