Merge branch 'staging' into jumin/update_o16n

recommenders-team · Sep 20, 2020 · ed707e3 · ed707e3
2 parents d387202 + d49665d
commit ed707e3
Show file tree

Hide file tree

Showing 21 changed files with 1,503 additions and 631 deletions.
diff --git a/README.md b/README.md
@@ -77,6 +77,7 @@ The table below lists the recommender algorithms currently available in the repo
 | LightFM/Hybrid Matrix Factorization | [Python CPU](examples/02_model_hybrid/lightfm_deep_dive.ipynb) | Hybrid | Hybrid matrix factorization algorithm for both implicit and explicit feedbacks |
 | LightGBM/Gradient Boosting Tree<sup>*</sup> | [Python CPU](examples/00_quick_start/lightgbm_tinycriteo.ipynb) / [PySpark](examples/02_model_content_based_filtering/mmlspark_lightgbm_criteo.ipynb) | Content-Based Filtering | Gradient Boosting Tree algorithm for fast training and low memory usage in content-based problems |
 | LightGCN | [Python CPU / Python GPU](examples/02_model_collaborative_filtering/lightgcn_deep_dive.ipynb) | Collaborative Filtering | Deep learning algorithm with simplifies the design of GCN for predicting implicit feedback |
+| GeoIMC | [Python CPU](examples/00_quick_start/geoimc_movielens.ipynb) | Hybrid | Matrix completion algorithm that has into account user and item features using Riemannian conjugate gradients optimization and following a geometric approach. |
 | GRU4Rec | [Python CPU / Python GPU](examples/00_quick_start/sequential_recsys_amazondataset.ipynb) | Collaborative Filtering | Sequential-based algorithm that aims to capture both long and short-term user preferences using recurrent neural networks |
 | Neural Recommendation with Long- and Short-term User Representations (LSTUR)<sup>*</sup> | [Python CPU / Python GPU](examples/00_quick_start/lstur_MIND.ipynb) | Content-Based Filtering | Neural recommendation algorithm with long- and short-term user interest modeling |
 | Neural Recommendation with Attentive Multi-View Learning (NAML)<sup>*</sup> | [Python CPU / Python GPU](examples/00_quick_start/naml_MIND.ipynb) | Content-Based Filtering | Neural recommendation algorithm with attentive multi-view learning |
@@ -110,7 +111,7 @@ We provide a [benchmark notebook](examples/06_benchmarks/movielens.ipynb) to ill
 | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | [ALS](examples/00_quick_start/als_movielens.ipynb) | 0.004732 |	0.044239 |	0.048462 |	0.017796 | 0.965038 |	0.753001 |	0.255647 |	0.251648 |
 | [SVD](examples/02_model_collaborative_filtering/surprise_svd_deep_dive.ipynb) | 0.012873	| 0.095930 |	0.091198 |	0.032783 | 0.938681 | 0.742690 | 0.291967 | 0.291971 |
-| [SAR](examples/00_quick_start/sar_movielens.ipynb) | 0.113028 |	0.388321 | 	0.333828 | 0.183179 | N/A |	N/A |	N/A |	N/A |
+| [SAR](examples/00_quick_start/sar_movielens.ipynb) | 0.110591 |	0.382461 | 	0.330753 | 0.176385 | 1.253805 | 1.048484 |	-0.569363 |	0.030474 |
 | [NCF](examples/02_model_hybrid/ncf_deep_dive.ipynb) | 0.107720	| 0.396118 |	0.347296 |	0.180775 | N/A |	N/A |	N/A |	N/A |
 | [BPR](examples/02_model_collaborative_filtering/cornac_bpr_deep_dive.ipynb) | 0.105365	| 0.389948 |	0.349841 |	0.181807 | N/A |	N/A |	N/A |	N/A |
 | [FastAI](examples/00_quick_start/fastai_movielens.ipynb) | 0.025503 |	0.147866 |	0.130329 |	0.053824 | 0.943084 |	0.744337 |	0.285308 |	0.287671 |

diff --git a/SETUP.md b/SETUP.md
@@ -51,13 +51,13 @@ conda update anaconda        # use 'conda install anaconda' if the package is no
 We provide a script, [generate_conda_file.py](tools/generate_conda_file.py), to generate a conda-environment yaml file
 which you can use to create the target environment using the Python version 3.6 with all the correct dependencies.
 
-**NOTE** the `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/). 
+**NOTE** the `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/).
 
 Assuming the repo is cloned as `Recommenders` in the local system, to install **a default (Python CPU) environment**:
 
     cd Recommenders
     python tools/generate_conda_file.py
-    conda env create -f reco_base.yaml 
+    conda env create -f reco_base.yaml
 
 You can specify the environment name as well with the flag `-n`.
 
@@ -70,7 +70,7 @@ Assuming that you have a GPU machine, to install the Python GPU environment:
 
     cd Recommenders
     python tools/generate_conda_file.py --gpu
-    conda env create -f reco_gpu.yaml 
+    conda env create -f reco_gpu.yaml
 
 </details>
 
@@ -85,7 +85,7 @@ To install the PySpark environment:
 
 > Additionally, if you want to test a particular version of spark, you may pass the --pyspark-version argument:
 >
->     python tools/generate_conda_file.py --pyspark-version 2.4.0
+>     python tools/generate_conda_file.py --pyspark-version 2.4.5
 
 Then, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
 
@@ -94,29 +94,29 @@ Click on the following menus to see details:
 <summary><strong><em>Set PySpark environment variables on Linux or MacOS</em></strong></summary>
 
 To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux).
+
 First, get the path of the environment `reco_pyspark` is installed:
 
     RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}')
     mkdir -p $RECO_ENV/etc/conda/activate.d
     mkdir -p $RECO_ENV/etc/conda/deactivate.d
 
+You also need to find where Spark is installed and set `SPARK_HOME` variable, on the DSVM, `SPARK_HOME=/dsvm/tools/spark/current`.
+
 Then, create the file `$RECO_ENV/etc/conda/activate.d/env_vars.sh` and add:
 
     #!/bin/sh
     RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}')
     export PYSPARK_PYTHON=$RECO_ENV/bin/python
     export PYSPARK_DRIVER_PYTHON=$RECO_ENV/bin/python
-    export SPARK_HOME_BACKUP=$SPARK_HOME
-    unset SPARK_HOME
+    export SPARK_HOME=/dsvm/tools/spark/current
 
-This will export the variables every time we do `conda activate reco_pyspark`.
-To unset these variables when we deactivate the environment, create the file `$RECO_ENV/etc/conda/deactivate.d/env_vars.sh` and add:
+This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, create the file `$RECO_ENV/etc/conda/deactivate.d/env_vars.sh` and add:
 
     #!/bin/sh
     unset PYSPARK_PYTHON
     unset PYSPARK_DRIVER_PYTHON
-    export SPARK_HOME=$SPARK_HOME_BACKUP
-    unset SPARK_HOME_BACKUP
+
 
 </details>
 
@@ -128,7 +128,7 @@ First, get the path of the environment `reco_pyspark` is installed:
     for /f "delims=" %A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%A"
 
 Then, create the file `%RECO_ENV%\etc\conda\activate.d\env_vars.bat` and add:
- 
+
     @echo off
     for /f "delims=" %%A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%%A"
     set PYSPARK_PYTHON=%RECO_ENV%\python.exe
@@ -149,7 +149,7 @@ create the file `%RECO_ENV%\etc\conda\deactivate.d\env_vars.bat` and add:
     set SPARK_HOME_BACKUP=
     set PYTHONPATH=%PYTHONPATH_BACKUP%
     set PYTHONPATH_BACKUP=
- 
+
 </details>
 
 </details>
@@ -176,7 +176,7 @@ We can register our created conda environment to appear as a kernel in the Jupyt
 
     conda activate my_env_name
     python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"
-    
+
 If you are using the DSVM, you can [connect to JupyterHub](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro#jupyterhub-and-jupyterlab) by browsing to `https://your-vm-ip:8000`.
 
 ### Troubleshooting for the DSVM
@@ -204,7 +204,7 @@ sudo update-alternatives --config java
 
 ### Requirements of Azure Databricks
 
-* Databricks Runtime version 4.3 (Apache Spark 2.3.1, Scala 2.11) or greater
+* Databricks Runtime version >= 4.3 (Apache Spark 2.3.1, Scala 2.11) and <= 5.5 (Apache Spark 2.4.3, Scala 2.11)
 * Python 3
 
 An example of how to create an Azure Databricks workspace and an Apache Spark cluster within the workspace can be found from [here](https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal). To utilize deep learning models and GPUs, you may setup GPU-enabled cluster. For more details about this topic, please see [Azure Databricks deep learning guide](https://docs.azuredatabricks.net/applications/deep-learning/index.html).
@@ -242,7 +242,7 @@ The installation script has a number of options that can also deal with differen
 python tools/databricks_install.py -h
 ```
 
-Once you have confirmed the databricks cluster is *RUNNING*, install the modules within this repository with the following commands. 
+Once you have confirmed the databricks cluster is *RUNNING*, install the modules within this repository with the following commands.
 
 ```{shell}
 cd Recommenders
@@ -339,7 +339,7 @@ Additionally, you must install the [spark-cosmosdb connector](https://docs.datab
 
 ## Install the utilities via PIP
 
-A [setup.py](setup.py) file is provided in order to simplify the installation of the utilities in this repo from the main directory. 
+A [setup.py](setup.py) file is provided in order to simplify the installation of the utilities in this repo from the main directory.
 
 This still requires the conda environment to be installed as described above. Once the necessary dependencies are installed, you can use the following command to install `reco_utils` as a python package.
 

diff --git a/contrib/sarplus/python/tests/test_pyspark_sar.py b/contrib/sarplus/python/tests/test_pyspark_sar.py
@@ -331,7 +331,7 @@ def test_sar_item_similarity(
         .reset_index(drop=True)
     )
 
-    if similarity_type is "cooccurrence":
+    if similarity_type == "cooccurrence":
         assert (item_similarity_ref == item_similarity).all().all()
     else:
         assert (