Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature subsample #34

Open
wants to merge 44 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
e8600c8
getting started
clegaard Apr 8, 2020
95b7e96
more push
clegaard Apr 8, 2020
5bac86f
docs docs docs
clegaard Apr 9, 2020
4af5876
more docs
clegaard Apr 10, 2020
6fdbc35
added more documentation for loaders
clegaard Apr 10, 2020
2ad5f42
added test data
clegaard Apr 10, 2020
7af735b
added file structure snipper
clegaard Apr 10, 2020
9c0ce1a
renamed example dataset and implemented loader
clegaard Apr 11, 2020
f1a6b02
added definition of slow maker to supress warning
clegaard Apr 11, 2020
18c8c60
updated tests
clegaard Apr 11, 2020
0b650c3
function for getting test example paths now returns a Path object rat…
clegaard Apr 11, 2020
a4f9630
improved loaders
clegaard Apr 11, 2020
4007600
added option to pass arguments to pandas read_csv through kwargs
clegaard Apr 11, 2020
ea949cf
now with data
clegaard Apr 11, 2020
23d4d53
added figure to subsample documentation
clegaard Apr 11, 2020
f509553
Replaced code-fences (```) with code-block directive
clegaard Apr 11, 2020
b5e60cc
added defintion of subsample-transform
clegaard Apr 12, 2020
ddc10d3
Merge branch 'develop' of https://github.com/LukasHedegaard/datasetop…
clegaard Apr 12, 2020
061809e
updated documentation related to subsampling
clegaard Apr 12, 2020
70ae6ab
initial implmentation of subsampling and formatted using black
clegaard Apr 12, 2020
111bcc6
added caching for subsampling
clegaard Apr 12, 2020
862e63c
minor correction in test
clegaard Apr 12, 2020
e173d62
fixed bug in subsample caching
clegaard Apr 13, 2020
d82b012
fixed incorrect index
clegaard Apr 13, 2020
72884b3
added declaration of supersample transform
clegaard Apr 13, 2020
6520910
added loader from generating a dataset from iterables and started wor…
clegaard Apr 14, 2020
b5c7af2
Added caching for dataset's shape property
clegaard Apr 14, 2020
0c6b59a
implemented slicing for Dataset.__getitem__ and cached shape
clegaard Apr 14, 2020
3177861
Merge branch 'develop' of https://github.com/LukasHedegaard/datasetop…
clegaard Apr 15, 2020
3a7868c
fixed linting issues
clegaard Apr 15, 2020
4139996
migration to pytest for running doctest
clegaard Apr 15, 2020
83fad3a
disabled/commented doctests
clegaard Apr 16, 2020
c0c5bec
defined testpaths in pytest config to stop it from recursing
clegaard Apr 16, 2020
56a1cee
removed external dataset
clegaard Apr 16, 2020
4049791
renamed pytest doctest setup function
clegaard Apr 16, 2020
91a392c
Update Loader (rm extend) and add range ids
LukasHedegaard Apr 16, 2020
a5c3b2d
Fix typing
LukasHedegaard Apr 16, 2020
8e98d5f
Fix naming for DATASET_PATHS
LukasHedegaard Apr 16, 2020
493a06d
Silence intentional type error. Add comments
LukasHedegaard Apr 16, 2020
930a4a8
Rename _downstream_getter and cachable
LukasHedegaard Apr 16, 2020
4b334e4
Attribute names and removed dataset reference
clegaard Apr 17, 2020
d881128
from_csv now returns a plain tuple
clegaard Apr 17, 2020
9ddc088
Improve behavior of slicing in __getitem__
clegaard Apr 18, 2020
afc2ca3
Merge abstract.py into types.py and update typing
clegaard Apr 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -140,4 +140,4 @@ dmypy.json
.DS_Store

tests/resources/
.datasetops_cache
.datasetops_cache
12 changes: 12 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import pytest

import datasetops as do
from datasetops.loaders import from_iterable, from_recursive_files

# see http://doc.pytest.org/en/latest/doctest.html (doctest_namespace fixture)
@pytest.fixture(autouse=True)
def setup_doctest_namespace(doctest_namespace):

doctest_namespace["do"] = do
doctest_namespace["from_iterable"] = from_iterable
doctest_namespace["load_files_recursive"] = from_recursive_files
9 changes: 2 additions & 7 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,8 @@
"recommonmark",
"sphinx_rtd_theme",
]
doctest_global_setup = """
try:
import datasetops as do
except ImportError:
do = None
"""

numfig = True

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
Expand Down Expand Up @@ -78,4 +73,4 @@
extensions.append("autoapi.extension")
autoapi_type = "python"
autoapi_dirs = ["../src/"]
autoapi_keep_files = True
autoapi_keep_files = False
27 changes: 0 additions & 27 deletions docs/getting_started.rst

This file was deleted.

20 changes: 20 additions & 0 deletions docs/getting_started/getting_started.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Getting started
===============

Before getting started with loading and processing datasets it is useful to have an overview of what the framework provides and its intended workflow.
As depicted in :numref:`fig_pipeline`, the framework provides a pipeline for processing the data by composing a chains of operations applied to the dataset.

.. _fig_pipeline:
.. figure:: ../pics/pipeline.svg
:figwidth: 600
:align: center
:alt: Dataset Ops pipeline

Dataset Ops Pipeline.

At the beginning of this chain is a *loader* which implements the process of reading a dataset stored in some specific file format.
Following this the raw data can then processed into a desired from by applying a number of transformations, independently of the underlying storage format.
After applying the transformations to the dataset it can be used as is or it can be converted into a type compatible with either PyTorch or TensorFlow.


An overview of the available loaders and transforms can be found in:
File renamed without changes.
9 changes: 8 additions & 1 deletion docs/howto/custom_loader.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,9 @@
Implementing A New Loader
=========================
=========================

In case the format of your dataset does not fit any of the standard loaders, it is possible to define your own custom loader.
By defining a custom loader you dataset can be integrated with the framework allowing transformations to be applied to its data, just like a standard loader.

To define a new loader a new class must be created that implements the interface declared by :class:`AbstractDataset <datasetops.abstract.AbstractDataset>`.
In the context of the library a dataset is an

44 changes: 21 additions & 23 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,54 +3,52 @@ Dataset Ops documentation
Friendly dataset operations for your data science needs.
Dataset Ops provides declarative loading, sampling, splitting and transformation operations for datasets, alongside export options for easy integration with Tensorflow and PyTorch.

.. figure:: pics/pipeline.svg
:figwidth: 600
:align: center
:alt: Dataset Ops pipeline
.. .. figure:: pics/pipeline.svg
.. :figwidth: 600
.. :align: center
.. :alt: Dataset Ops pipeline

Illustration Dataset Ops Pipeline.
Several built-in loaders makes it possible to load datasets stored in various formats.
Several operators are provided that provide common pre-processing steps to be applied to the data quickly.
Finally, the processed data can be used as is or exported in a format to be used with ML frameworks.
.. Illustration Dataset Ops Pipeline.
.. Several built-in loaders makes it possible to load datasets stored in various formats.
.. Several operators are provided that provide common pre-processing steps to be applied to the data quickly.
.. Finally, the processed data can be used as is or exported in a format to be used with ML frameworks.

First Steps
-----------
Are you looking for ways to install the framework
or do you looking for inspiration to get started?

* **Installing**: :doc:`Installing <installing>`
* **Installing**: :doc:`Installing <getting_started/installing>`

* **Getting Started**: :doc:`Getting started <getting_started>`
* **Getting Started**: :doc:`Getting started <getting_started/getting_started>`


.. toctree::
:maxdepth: 2
:hidden:
:caption: Getting Started:
:caption: Getting Started

installing
getting_started
getting_started/installing
getting_started/getting_started

Loaders and Transforms
----------------------

Get an overview of the available loaders and transforms that can be used with your dataset.

* **Loaders**: :doc:`Standard loaders <loaders/standard>`
* **Loaders**: :doc:`Loaders <overview/loaders>`

* **Transforms**: :doc:`General <transforms/common>` | :doc:`Image <transforms/images>` | :doc:`Time-series <transforms/timeseries>`
* **Transforms**: :doc:`Transforms <overview/transforms>`

It is also possible to implement your own loaders and transforms.

.. toctree::
:maxdepth: 2
:hidden:
:caption: Loaders and Transforms
:caption: Overview

loaders/standard
transforms/common
transforms/images
transforms/timeseries
overview/loaders
overview/transforms

Custom Loaders and Transforms
-----------------------------
Expand All @@ -65,7 +63,7 @@ For how-to guides on how to do this see:
.. toctree::
:maxdepth: 2
:hidden:
:caption: How-to guides:
:caption: How-to guides

howto/custom_loader
howto/custom_transform
Expand Down Expand Up @@ -106,7 +104,7 @@ See the example section:

.. toctree::
:maxdepth: 1
:caption: Examples:
:caption: Examples
:glob:

examples/*
Expand All @@ -127,7 +125,7 @@ Information on how to the codebase is tested, how it is published, and how to ad
.. toctree::
:maxdepth: 2
:hidden:
:caption: Contributing:
:caption: Contributing

development/communication
development/codebase
Expand Down
12 changes: 0 additions & 12 deletions docs/loaders/standard.rst

This file was deleted.

17 changes: 8 additions & 9 deletions docs/optimization/caching.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
.. _sec_caching:
Caching
=======

Expand All @@ -21,10 +22,9 @@ To cache some combination of dataset and transformations the *cache* function is

.. doctest::

>>> kernel = np.ones((5,5))*(5**2)
>>> train, val = do.load_mnist().whiten().image_filter(kernel).cache().split((0.7,0.3))
>>> # TODO
False
>>> # kernel = np.ones((5,5))*(5**2)
>>> # train, val = do.load_mnist().whiten().image_filter(kernel).cache().split((0.7,0.3))
... # doctest: +SKIP

The library keeps track of what values are available in the cache and ensures that the cache is recalculated when necessary.
For example the cache will be updated when a new operation is introduced before the cache operator, or when the parameters of one or more transforms are modified.
Expand All @@ -37,8 +37,7 @@ To ensure that the size of the cache does not grow indefinitely it is possible t

.. doctest::

>>> do.set_caching_cleanup_strategy("clean_unused")
>>> do.set_caching_cleanup_strategy("never")
>>> do.clear_cache()
>>> #TODO
False
>>> # do.set_caching_cleanup_strategy("clean_unused")
>>> # do.set_caching_cleanup_strategy("never")
>>> # do.clear_cache()
... # doctest: +SKIP
136 changes: 136 additions & 0 deletions docs/overview/loaders.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
Loaders
=======

Dataset Ops provides a set of standard loaders that covers loading of the most frequently exchange formats.

PyTorch
-------

Tensorflow
----------

Recursive File Loader
---------------------
Provides functionality to load files stored in a tree structure in a recursively in a generic manner.
A callback function must be specified which is invoked with the `path <https://docs.python.org/3/library/pathlib.html#pathlib.Path>`__ of each file.
When called this function should return a sample corresponding to the contents of the file.
Specific files may be skipped by returning None from the callback.

.. code-block::

patients
├── control
│   ├── somefile.csv
│   ├── subject_a.txt
│   └── subject_b.txt
└── experimental
├── subject_c.txt
└── subject_d.txt


.. doctest::

>>> def func(path):
... if(path.suffix != ".txt"):
... return None
... data = np.loadtxt(path)
... blood_pressure = data[:,0]
... is_control = path.parent != "control"
... return (blood_pressure, is_control)
>>>
>>> # ds = do.loaders.from_recursive_files("patients", func)
>>> # len(ds)
>>> 4
4
>>> #ds[0][0].shape
>>> (270, 1)
(270, 1)


Comma-separated values (CSV)
----------------------------

CSV is a format commonly used by spreadsheet editor to store tabular data.
Consider the scenario where the data describes the correlation between speed and vibration
under some specific load for two different car models, referred to as *car1* and *car2*.

For two experiments the folder structure may look like:

.. code-block::

cars_csv
├── car1
│   ├── load_1000.csv
│   └── load_2000.csv
└── car2
├── load_1000.csv
└── load_2000.csv

The contents of each file may look like:

.. code-block::

speed,vibration
1,0.5
2,1
3,1.5

The :func:`load_csv <datasetops.loaders.load_csv>` function allows either a single or multiple CSV files to be loaded.

.. note::

CSV is not standardized, rather it refers to a *family* of related formats, each differing slightly and with their own quirks.
Under the hood the framework relies on Pandas's `read_csv <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html>`__ implementation.

Single File
~~~~~~~~~~~
To load a single CSV file the path of the file is passed to the function.

.. .. doctest::

.. >>> ds = do.from_csv("car1/load_1000.csv")
.. >>> len(ds)
.. 3
.. >>> ds[0]
.. Empty DataFrame
.. Columns: []
.. Index: []
.. >>> ds[0].shape
.. (1,2)

Finally, it is possible to pass a function to transform the raw data into a sample.
The function must take the path and the raw data as argument and in turn return a new sample:

.. .. doctest::

.. >>> def func(path,data):
.. >>> load = int(path.stem.split("_")[-1])
.. >>> return (data,load)
.. >>> ds = do.load_csv("car1/load_1000.csv",func)
.. >>> ds[0][1]
.. 1000

This useful for converting the data into other formats or to extract labels from the name of the CSV file.

Multiple Files
~~~~~~~~~~~~~~
The process of loading multiple files is similar.
However, instead of specifying a single CSV file, a directory containing the CSV files must be specified instead.
This will search recursively for CSV files creating a sample for each file.

.. .. doctest::

.. >>> ds = do.load_csv("cars_csv")
.. >>> len(ds)
.. 4
.. >>> ds[0].shape
.. (3,2)

Similar to before it is possible to supply a callback function for transforming the data.

Data format
~~~~~~~~~~~
Its possible to control the format of the data read from the CSV files by specifying the *data_format* parameter.
The two options are a tuple or a Pandas `DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame>`__
If the column names defined in the CSV are valid attributes names a named tuple will be returned otherwise a plain tuple is returned.

Loading