Release v1.8.1

warren-davies4 · warren-davies4 · commit e3afb7ea7313 · 2024-07-18T10:39:48.000Z
diff --git a/CONTRIBUTE.md b/CONTRIBUTE.md
@@ -43,12 +43,11 @@ Run the commands (or follow the MkDocs documentation to locally pip install MkDo
     # requirements.txt
 
     ## using pip
-    python -m pip install -r requirements.txt
+    pip install -r requirements.txt
 
     ## using Conda
     conda create --name <env_name> --file requirements.txt
 ```
-For best practices on creating virtual environments, please refer to the [RAP Community of Practice training resources](./docs/training_resources/python/virtual-environments/why-use-virtual-environments.md).
 
 ### Hosting
 
@@ -57,9 +56,6 @@ To host the website locally to view the live changes, run the command:
 ```bash
     mkdocs serve
 ```
-Open up http://127.0.0.1:8000/ in your browser, and you'll see the [RAP Community of Practice home page](https://nhsdigital.github.io/rap-community-of-practice/) being displayed with your updates applied.
-
-Read more: [Getting Started with MkDocs](https://www.mkdocs.org/getting-started/#getting-started-with-mkdocs)
 
 ### Editing the contents
 
diff --git a/docs/implementing_RAP/tools.md b/docs/implementing_RAP/tools.md
@@ -235,8 +235,12 @@ To work with this integration, you must install the Jupyter package in your base
 You can then create and run Jupyter-like code cells, defined within Python code using a `# %%` comment:
 
 ```Python
+# %%
+msg = "Hello World"
 print(msg)
 
+# %%
+msg = "Hello again"
 print(msg)
 ```
 
diff --git a/docs/training_resources/pyspark/pyspark-style-guide.md b/docs/training_resources/pyspark/pyspark-style-guide.md
@@ -237,14 +237,18 @@ We can create functions like in python, same logic applies, but instead using Py
 
 ```python
 def group_by_and_count_column(data: df, column_name: str):
-    “”
+    """
     Groups by specified column and returns count per grouping and sorts by descending order.
 
     Args: dataset we are reading from & the column we wish to group by
 
     Returns: groups from column and count
-    “””
+    """
+    # Group by CCG Code and count number of records per CCG
+    df_count = df.groupBy(df.column_name).count()
 
+    # sort by CCG Code descending order
+    result = df_count.sort(desc("count"))
 
     return result
 ```
diff --git a/docs/training_resources/python/basic-python-data-analysis-operations.md b/docs/training_resources/python/basic-python-data-analysis-operations.md
@@ -28,6 +28,8 @@ df - dataframe
 ```py
 df = pd.read_csv('your_file.csv')
 
+# or if required to edit headers for example
+
 df = pd.read_csv('your_file.csv', header=..., na_values=..., sep=..., etc)
 ```
 
@@ -84,48 +86,74 @@ You will soon notice after importing your data from the .sav file that the colum
 
 ```py
 df.columns = df.columns.str.lower()
+# or
+df.columns = df.columns.str.upper()
 ```
 
 ### Extracting the required columns
 
 To select a column:
 
 ```py
+# columns to keep
+to_keep = ["column 1", "column 2", "column 3", ...]
 
+# create the new table
+filtered_df = df[to_keep]
 ```
 
 ### Filter where a variable is not null/missing
 
 To filter rows based on some specific criteria:
 
 ```py
+# not null
+new_df = df[df["my_column"].notnull()]
 ```
 
 ### Joins
 
 ```py
+# a left join, one column to join on
+joined_df = df.merge(other_df, how="left", on="my_column")
 
+# inner join, on multiple columns
+joined_df = df.merge(other_df, how="inner", on=["column 1", "column 2"])
 ```
 
 ### Add a new column
 
 ```py
+# create new table with a new column that adds 5 to each value of another selected column
+new_df = df.assign(new_column=df["my column"] + 5)
 ```
 
 ### Sorting variables
 
 ```py
+# ascending order can be False or True
+df.sort_values(by="my column", ascending=False)
 
+# if you want to see missing values first, assign na_position
+df.sort_values(by="my column", ascending=False, na_position="first")
 
+# sort by multiple columns
+df.sort_values(by=["my column 1", "my column 2", ...])
 ```
 
 ### Transposing columns
 
 There's a few ways to transpose columns:
 
 ```py
+# set the index of columns
+df.set_index(["my column 1", "my column 2", "my column 3", ...], inplace=True)
 
+# using pandas transpose to transpose rows with columns and vice versa
+df_transposed = df.T
 
+# using pandas stack() to transpose non-index columns into a single new column
+df = df.stack().reset_index()
 ```
 
 To set the name of the axis for the index or columns you can use `rename_axis()`:
@@ -137,10 +165,17 @@ df = df.stack().rename_axis().reset_index()
 ### Grouping by variables
 
 ```py
+# group by one column
+new_df = df.groupby("my_column")
+
+# group by multiple columns
 
 # list of columns to group by
 grouped = ["column 1", "column 2", "column 3", ...]]
 
+# return new table with grouped columns
+new_df = df.groupby(grouped)
+
 ```
 
 ### Aggregations
@@ -154,8 +189,12 @@ new_df = df.groupby(grouped).agg(total_sum=("column to be summarised", "sum"), t
 ### Creating totals per row and per column
 
 ```py
+# total per column, adds a new row "Column Total"
+# this will sum all numeric row values for each column
 df.loc["Column Total"] = df.sum(numeric_only=True, axis=0)
 
+# total per row, creates a new column "Row Total"
+# this will sum all numeric column values for each row
 df.loc[:, "Row Total"] = df.sum(numeric_only=True, axis=1)
 ```
 
@@ -164,15 +203,23 @@ df.loc[:, "Row Total"] = df.sum(numeric_only=True, axis=1)
 When creating different aggregations/groupings which are saved in different dataframes, you can then combine these aggregations into one table. For example, suppose you have calculated the totals for age and gender in different dataframes and you wish to append these results to the final output dataframe.
 
 ```py
+# list the final output dataframe to store its aggregations
+list_df = [df]
 
+# append the calculated totals
+list_df.append(calc_totals_df)
 
+# concatenate into a single dataframe
+output_df = pd.concat(list_df, axis=0)
 ```
 
 ### Creating derivations
 
 To create a derivation based on the equivalent CASE WHEN SQL operation, there are several ways to do this in python:
 
 ```py
+# pandas package CASE WHEN
+# create the age 11 to 15 derivation
 df.loc[df["age"] < 0, "age11_15"] = df["age"]
 df.loc[(df["age"] > 0) & (df["age"] < 11), "age11_15"] = 11
 df.loc[(df["age"] > 10) & (df["age"] < 16), "age11_15"] = df["age"]
@@ -182,6 +229,8 @@ df.loc[df["age"] > 14, "age11_15"] = 15
 This results in creating a new column "age11_15" in the existing dataframe, based on the CASE WHEN conditions we applied for the new derivation.
 
 ```py
+# NumPy package CASE WHEN
+# create the age 11 to 15 derivation
 age11_15 = np.select(
     [
      df['age'] == 10, # WHEN
@@ -194,6 +243,8 @@ age11_15 = np.select(
     default=df['age'] # ELSE assign "age" column values
     )
 
+# assign the result to a new column
+df["age11_15"] = age11_15
 ```
 
 In the first bracket you assign the "WHEN" part of the condition, second bracket the "THEN", and "default=..." represents the "ELSE" part.
@@ -203,14 +254,24 @@ The NumPy option is faster and more efficient whereas Pandas is user friendlier
 ### Apply a column order
 
 ```py
+# create a list of the column headers in a specific order
+column_order = ["column 1", "column 2", "column 3", ...]
 
+# apply list to dataframe
+df = df[column_order]
 ```
 
 ### Exporting the output
 
 ```py
+# write output to a .csv
+df.to_csv("output.csv", ... <multiple parameters that can be inserted>)
 
+# write output to an excel workbook
+df.to_excel("output.xlsx", sheet_name="Sheet_name_1", ... <multiple parameters that can be inserted>)
 
+# write multiple sheets from different dataframes
+with pd.ExcelWriter("output.xlsx") as writer:
     df1.to_excel(writer, sheet_name="Sheet_name_1")
     df2.to_excel(writer, sheet_name="Sheet_name_2")
 ```
diff --git a/docs/training_resources/python/config-files.md b/docs/training_resources/python/config-files.md
@@ -188,6 +188,8 @@ elif config['report_type'] == 'monthly':
     df_report_data = get_monthly_data(config)
 ```
 
+## Over to you
+
 As you were reading through this, did any ideas pop into your head about your own projects? Any values you keep having (or forgetting!) to change? Any bits of code you sometimes need to comment out? If so, you've got a prime candidate for using a config file! So give it a try - implement the above steps in your project and see what you think.
 
 Good luck!
diff --git a/docs/training_resources/python/handling-file-paths.md b/docs/training_resources/python/handling-file-paths.md
@@ -56,15 +56,20 @@ operations
 For example, you can access the current working directory with the `cwd` attribute.
 
 ```python
+# Print the current working directory (cwd)
+print("CWD:", pathlib.Path.cwd())
 ```
 
 Pass strings to Path constructor to create a Path object
 
 ```python
+# . is the current directory
+cwd_path = pathlib.Path(".")
 print("CWD (again):", cwd_path)
 
+# Use resolve to get the absolute path!
+cwd_abspath = cwd_path.resolve()
 print("Absolute CWD:", cwd_abspath)
-
 ```
 
 ### Path attributes
@@ -74,6 +79,8 @@ The following examples show how pathlib makes it easier to extract specific attr
 #### Example: absolute path to the current file
 
 ```python
+# Note: __file__ is a global Python variable
+this_file_path = pathlib.Path(__file__)
 print("Path to file:", this_file_path)
 ```
 
@@ -155,6 +162,8 @@ import pandas as pd
 import pyreadstat  # needed to parse sav files in spss
 import pathlib2  # This is just a backwards compatible pathlib!
 
+# https://realpython.com/python-pathlib/
+
 # Add parameters
 BASE_DIR = pathlib2.Path(r"\\<path>\Publication\RAP")
 PUPIL_DIR = BASE_DIR / "Inputs" / "PupilData"
diff --git a/docs/training_resources/python/logging-and-error-handling.md b/docs/training_resources/python/logging-and-error-handling.md
@@ -113,6 +113,8 @@ This is bad practice as instead of handling the specific errors the code could t
 
 ```python
 try:
+    # Some problematic code that could raise different kinds of exceptions
+except ValueError as e:
     print('Found a value error!')
     print(repr(e))
     exit()
@@ -130,6 +132,8 @@ Alternatively if we really did want to handle all of those exceptions in the sam
 
 ```python
 try:
+    # Some problematic code that could raise different kinds of exceptions
+except (ValueError, ZeroDivisionError, KeyError) as e:
     print('Found an error!')
     print(repr(e))
     exit()
@@ -172,6 +176,8 @@ As a general rule of thumb avoid using the generic `Exception` class at all. It
 
 ```python
 try:
+    # Some problematic code that could raise different kinds of exceptions
+except Exception:
     print('Found an error!')
     exit()
 ```
@@ -205,6 +211,8 @@ def divide_two_numbers(a: float, b: float) -> float:
         print('Division failed because of: ' + repr(e))
         raise ZeroDivisionError
 
+# In use:
+a = 1.0
 b = 0
 try:
     result = divide_two_numbers(a, b)
@@ -223,6 +231,8 @@ Doing this raises a new ZeroDivisionError, which loses the stack trace of the or
 
 ```python
 except ZeroDivisionError:
+    # Do stuff
+    raise
 ```
 
 #### Don't let the program continue if it can't
diff --git a/docs/training_resources/python/python-functions.md b/docs/training_resources/python/python-functions.md
diff --git a/docs/training_resources/python/using-f-strings-sql-queries.md b/docs/training_resources/python/using-f-strings-sql-queries.md
diff --git a/docs/training_resources/python/visualisation-in-python.md b/docs/training_resources/python/visualisation-in-python.md