LukasHedegaard · clegaard · Apr 8, 2020 · Apr 8, 2020 · Apr 9, 2020 · Apr 10, 2020
diff --git a/.gitignore b/.gitignore
@@ -140,4 +140,4 @@ dmypy.json
 .DS_Store
 
 tests/resources/
-.datasetops_cache
+.datasetops_cache
diff --git a/conftest.py b/conftest.py
@@ -0,0 +1,12 @@
+import pytest
+
+import datasetops as do
+from datasetops.loaders import from_iterable, from_recursive_files
+
+# see http://doc.pytest.org/en/latest/doctest.html (doctest_namespace fixture)
+@pytest.fixture(autouse=True)
+def setup_doctest_namespace(doctest_namespace):
+
+    doctest_namespace["do"] = do
+    doctest_namespace["from_iterable"] = from_iterable
+    doctest_namespace["load_files_recursive"] = from_recursive_files
diff --git a/docs/conf.py b/docs/conf.py
@@ -35,13 +35,8 @@
     "recommonmark",
     "sphinx_rtd_theme",
 ]
-doctest_global_setup = """
-try:
-    import datasetops as do
-except ImportError:
-    do = None
-"""
 
+numfig = True
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]
@@ -78,4 +73,4 @@
 extensions.append("autoapi.extension")
 autoapi_type = "python"
 autoapi_dirs = ["../src/"]
-autoapi_keep_files = True
+autoapi_keep_files = False
diff --git a/docs/getting_started.rst b/docs/getting_started.rst
diff --git a/docs/getting_started/getting_started.rst b/docs/getting_started/getting_started.rst
@@ -0,0 +1,20 @@
+Getting started
+===============
+
+Before getting started with loading and processing datasets it is useful to have an overview of what the framework provides and its intended workflow.
+As depicted in :numref:`fig_pipeline`, the framework provides a pipeline for processing the data by composing a chains of operations applied to the dataset.
+
+.. _fig_pipeline:
+.. figure:: ../pics/pipeline.svg
+   :figwidth: 600
+   :align: center
+   :alt: Dataset Ops pipeline
+
+   Dataset Ops Pipeline.
+
+At the beginning of this chain is a *loader* which implements the process of reading a dataset stored in some specific file format.
+Following this the raw data can then processed into a desired from by applying a number of transformations, independently of the underlying storage format.
+After applying the transformations to the dataset it can be used as is or it can be converted into a type compatible with either PyTorch or TensorFlow.
+
+
+An overview of the available loaders and transforms can be found in:
diff --git a/docs/installing.rst → docs/getting_started/installing.rst b/docs/installing.rst → docs/getting_started/installing.rst
diff --git a/docs/howto/custom_loader.rst b/docs/howto/custom_loader.rst
@@ -1,2 +1,9 @@
 Implementing A New Loader
-=========================
+=========================
+
+In case the format of your dataset does not fit any of the standard loaders, it is possible to define your own custom loader.
+By defining a custom loader you dataset can be integrated with the framework allowing transformations to be applied to its data, just like a standard loader.
+
+To define a new loader a new class must be created that implements the interface declared by :class:`AbstractDataset <datasetops.abstract.AbstractDataset>`.
+In the context of the library a dataset is an 
+
diff --git a/docs/index.rst b/docs/index.rst
@@ -3,54 +3,52 @@ Dataset Ops documentation
 Friendly dataset operations for your data science needs.
 Dataset Ops provides declarative loading, sampling, splitting and transformation operations for datasets, alongside export options for easy integration with Tensorflow and PyTorch.
 
-.. figure:: pics/pipeline.svg
-   :figwidth: 600
-   :align: center
-   :alt: Dataset Ops pipeline
+.. .. figure:: pics/pipeline.svg
+..    :figwidth: 600
+..    :align: center
+..    :alt: Dataset Ops pipeline
 
-   Illustration Dataset Ops Pipeline. 
-   Several built-in loaders makes it possible to load datasets stored in various formats.
-   Several operators are provided that provide common pre-processing steps to be applied to the data quickly.
-   Finally, the processed data can be used as is or exported in a format to be used with ML frameworks.
+..    Illustration Dataset Ops Pipeline. 
+..    Several built-in loaders makes it possible to load datasets stored in various formats.
+..    Several operators are provided that provide common pre-processing steps to be applied to the data quickly.
+..    Finally, the processed data can be used as is or exported in a format to be used with ML frameworks.
 
 First Steps
 -----------
 Are you looking for ways to install the framework 
 or do you looking for inspiration to get started?
 
-* **Installing**: :doc:`Installing <installing>`
+* **Installing**: :doc:`Installing <getting_started/installing>`
 
-* **Getting Started**: :doc:`Getting started <getting_started>`
+* **Getting Started**: :doc:`Getting started <getting_started/getting_started>`
 
 
 .. toctree::
    :maxdepth: 2
    :hidden:
-   :caption: Getting Started:
+   :caption: Getting Started
 
-   installing
-   getting_started
+   getting_started/installing
+   getting_started/getting_started
 
 Loaders and Transforms
 ----------------------
 
 Get an overview of the available loaders and transforms that can be used with your dataset.
 
-* **Loaders**: :doc:`Standard loaders <loaders/standard>`
+* **Loaders**: :doc:`Loaders <overview/loaders>`
 
-* **Transforms**: :doc:`General <transforms/common>` | :doc:`Image <transforms/images>` | :doc:`Time-series <transforms/timeseries>`
+* **Transforms**: :doc:`Transforms <overview/transforms>`
 
 It is also possible to implement your own loaders and transforms.
 
 .. toctree::
    :maxdepth: 2
    :hidden:
-   :caption: Loaders and Transforms
+   :caption: Overview
 
-   loaders/standard
-   transforms/common
-   transforms/images
-   transforms/timeseries
+   overview/loaders
+   overview/transforms
 
 Custom Loaders and Transforms
 -----------------------------
@@ -65,7 +63,7 @@ For how-to guides on how to do this see:
 .. toctree::
    :maxdepth: 2
    :hidden:
-   :caption: How-to guides:
+   :caption: How-to guides
 
    howto/custom_loader
    howto/custom_transform
@@ -106,7 +104,7 @@ See the example section:
 
 .. toctree::
     :maxdepth: 1
-    :caption: Examples:
+    :caption: Examples
     :glob:  
 
     examples/*
@@ -127,7 +125,7 @@ Information on how to the codebase is tested, how it is published, and how to ad
 .. toctree::
    :maxdepth: 2
    :hidden:
-   :caption: Contributing:
+   :caption: Contributing
 
    development/communication
    development/codebase

diff --git a/docs/loaders/standard.rst b/docs/loaders/standard.rst
diff --git a/docs/optimization/caching.rst b/docs/optimization/caching.rst
@@ -1,3 +1,4 @@
+.. _sec_caching:
 Caching
 =======
 
@@ -21,10 +22,9 @@ To cache some combination of dataset and transformations the *cache* function is
 
 .. doctest::
 
-    >>> kernel = np.ones((5,5))*(5**2)
-    >>> train, val = do.load_mnist().whiten().image_filter(kernel).cache().split((0.7,0.3))
-    >>> # TODO
-    False
+    >>> # kernel = np.ones((5,5))*(5**2)
+    >>> # train, val = do.load_mnist().whiten().image_filter(kernel).cache().split((0.7,0.3))
+    ... # doctest: +SKIP
 
 The library keeps track of what values are available in the cache and ensures that the cache is recalculated when necessary.
 For example the cache will be updated when a new operation is introduced before the cache operator, or when the parameters of one or more transforms are modified.
@@ -37,8 +37,7 @@ To ensure that the size of the cache does not grow indefinitely it is possible t
 
 .. doctest::
 
-    >>> do.set_caching_cleanup_strategy("clean_unused")
-    >>> do.set_caching_cleanup_strategy("never")
-    >>> do.clear_cache()
-    >>> #TODO
-    False
+    >>> # do.set_caching_cleanup_strategy("clean_unused")
+    >>> # do.set_caching_cleanup_strategy("never")
+    >>> # do.clear_cache()
+    ... # doctest: +SKIP
diff --git a/docs/overview/loaders.rst b/docs/overview/loaders.rst
@@ -0,0 +1,136 @@
+Loaders
+=======
+
+Dataset Ops provides a set of standard loaders that covers loading of the most frequently exchange formats.
+
+PyTorch
+-------
+
+Tensorflow
+----------
+
+Recursive File Loader
+---------------------
+Provides functionality to load files stored in a tree structure in a recursively in a generic manner.
+A callback function must be specified which is invoked with the `path <https://docs.python.org/3/library/pathlib.html#pathlib.Path>`__  of each file. 
+When called this function should return a sample corresponding to the contents of the file.
+Specific files may be skipped by returning None from the callback.
+
+.. code-block::
+
+    patients
+    ├── control
+    │   ├── somefile.csv
+    │   ├── subject_a.txt
+    │   └── subject_b.txt
+    └── experimental
+        ├── subject_c.txt
+        └── subject_d.txt
+
+
+.. doctest::
+
+    >>> def func(path):
+    ...     if(path.suffix != ".txt"):
+    ...         return None
+    ...     data = np.loadtxt(path)
+    ...     blood_pressure = data[:,0]
+    ...     is_control = path.parent != "control"
+    ...     return (blood_pressure, is_control)
+    >>>
+    >>> # ds = do.loaders.from_recursive_files("patients", func)
+    >>> # len(ds)
+    >>> 4
+    4
+    >>> #ds[0][0].shape
+    >>> (270, 1)
+    (270, 1)
+
+
+Comma-separated values (CSV)
+----------------------------
+
+CSV is a format commonly used by spreadsheet editor to store tabular data.
+Consider the scenario where the data describes the correlation between speed and vibration
+under some specific load for two different car models, referred to as *car1* and *car2*.
+
+For two experiments the folder structure may look like:
+
+.. code-block::
+
+    cars_csv
+    ├── car1
+    │   ├── load_1000.csv
+    │   └── load_2000.csv
+    └── car2
+        ├── load_1000.csv
+        └── load_2000.csv
+
+The contents of each file may look like:
+
+.. code-block::
+
+    speed,vibration
+    1,0.5
+    2,1
+    3,1.5
+
+The :func:`load_csv <datasetops.loaders.load_csv>` function allows either a single or multiple CSV files to be loaded.
+
+.. note::
+
+    CSV is not standardized, rather it refers to a *family* of related formats, each differing slightly and with their own quirks.
+    Under the hood the framework relies on Pandas's `read_csv <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html>`__ implementation.
+
+Single File
+~~~~~~~~~~~
+To load a single CSV file the path of the file is passed to the function.
+
+.. .. doctest::
+
+..     >>> ds = do.from_csv("car1/load_1000.csv")
+..     >>> len(ds)
+..     3
+..     >>> ds[0]
+..     Empty DataFrame
+..     Columns: []
+..     Index: []
+..     >>> ds[0].shape
+..     (1,2)
+
+Finally, it is possible to pass a function to transform the raw data into a sample.
+The function must take the path and the raw data as argument and in turn return a new sample:
+
+.. .. doctest::
+
+..     >>> def func(path,data):
+..     >>>     load = int(path.stem.split("_")[-1])
+..     >>>     return (data,load)
+..     >>> ds = do.load_csv("car1/load_1000.csv",func)
+..     >>> ds[0][1]
+..     1000
+
+This useful for converting the data into other formats or to extract labels from the name of the CSV file.
+
+Multiple Files
+~~~~~~~~~~~~~~
+The process of loading multiple files is similar. 
+However, instead of specifying a single CSV file, a directory containing the CSV files must be specified instead.
+This will search recursively for CSV files creating a sample for each file.
+
+.. .. doctest::
+
+..     >>> ds = do.load_csv("cars_csv")
+..     >>> len(ds)
+..     4
+..     >>> ds[0].shape
+..     (3,2)
+
+Similar to before it is possible to supply a callback function for transforming the data.
+
+Data format
+~~~~~~~~~~~
+Its possible to control the format of the data read from the CSV files by specifying the *data_format* parameter.
+The two options are a tuple or a Pandas `DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame>`__
+If the column names defined in the CSV are valid attributes names a named tuple will be returned otherwise a plain tuple is returned.
+