Skip to content

Commit e3afb7e

Browse files
Release v1.8.1
1 parent 3a695c4 commit e3afb7e

10 files changed

+188
-15
lines changed

CONTRIBUTE.md

+1-5
Original file line numberDiff line numberDiff line change
@@ -43,12 +43,11 @@ Run the commands (or follow the MkDocs documentation to locally pip install MkDo
4343
# requirements.txt
4444

4545
## using pip
46-
python -m pip install -r requirements.txt
46+
pip install -r requirements.txt
4747

4848
## using Conda
4949
conda create --name <env_name> --file requirements.txt
5050
```
51-
For best practices on creating virtual environments, please refer to the [RAP Community of Practice training resources](./docs/training_resources/python/virtual-environments/why-use-virtual-environments.md).
5251

5352
### Hosting
5453

@@ -57,9 +56,6 @@ To host the website locally to view the live changes, run the command:
5756
```bash
5857
mkdocs serve
5958
```
60-
Open up http://127.0.0.1:8000/ in your browser, and you'll see the [RAP Community of Practice home page](https://nhsdigital.github.io/rap-community-of-practice/) being displayed with your updates applied.
61-
62-
Read more: [Getting Started with MkDocs](https://www.mkdocs.org/getting-started/#getting-started-with-mkdocs)
6359

6460
### Editing the contents
6561

docs/implementing_RAP/tools.md

+4
Original file line numberDiff line numberDiff line change
@@ -235,8 +235,12 @@ To work with this integration, you must install the Jupyter package in your base
235235
You can then create and run Jupyter-like code cells, defined within Python code using a `# %%` comment:
236236

237237
```Python
238+
# %%
239+
msg = "Hello World"
238240
print(msg)
239241

242+
# %%
243+
msg = "Hello again"
240244
print(msg)
241245
```
242246

docs/training_resources/pyspark/pyspark-style-guide.md

+6-2
Original file line numberDiff line numberDiff line change
@@ -237,14 +237,18 @@ We can create functions like in python, same logic applies, but instead using Py
237237

238238
```python
239239
def group_by_and_count_column(data: df, column_name: str):
240-
“”
240+
"""
241241
Groups by specified column and returns count per grouping and sorts by descending order.
242242
243243
Args: dataset we are reading from & the column we wish to group by
244244
245245
Returns: groups from column and count
246-
“””
246+
"""
247+
# Group by CCG Code and count number of records per CCG
248+
df_count = df.groupBy(df.column_name).count()
247249

250+
# sort by CCG Code descending order
251+
result = df_count.sort(desc("count"))
248252

249253
return result
250254
```

docs/training_resources/python/basic-python-data-analysis-operations.md

+61
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ df - dataframe
2828
```py
2929
df = pd.read_csv('your_file.csv')
3030

31+
# or if required to edit headers for example
32+
3133
df = pd.read_csv('your_file.csv', header=..., na_values=..., sep=..., etc)
3234
```
3335

@@ -84,48 +86,74 @@ You will soon notice after importing your data from the .sav file that the colum
8486

8587
```py
8688
df.columns = df.columns.str.lower()
89+
# or
90+
df.columns = df.columns.str.upper()
8791
```
8892

8993
### Extracting the required columns
9094

9195
To select a column:
9296

9397
```py
98+
# columns to keep
99+
to_keep = ["column 1", "column 2", "column 3", ...]
94100

101+
# create the new table
102+
filtered_df = df[to_keep]
95103
```
96104

97105
### Filter where a variable is not null/missing
98106

99107
To filter rows based on some specific criteria:
100108

101109
```py
110+
# not null
111+
new_df = df[df["my_column"].notnull()]
102112
```
103113

104114
### Joins
105115

106116
```py
117+
# a left join, one column to join on
118+
joined_df = df.merge(other_df, how="left", on="my_column")
107119

120+
# inner join, on multiple columns
121+
joined_df = df.merge(other_df, how="inner", on=["column 1", "column 2"])
108122
```
109123

110124
### Add a new column
111125

112126
```py
127+
# create new table with a new column that adds 5 to each value of another selected column
128+
new_df = df.assign(new_column=df["my column"] + 5)
113129
```
114130

115131
### Sorting variables
116132

117133
```py
134+
# ascending order can be False or True
135+
df.sort_values(by="my column", ascending=False)
118136

137+
# if you want to see missing values first, assign na_position
138+
df.sort_values(by="my column", ascending=False, na_position="first")
119139

140+
# sort by multiple columns
141+
df.sort_values(by=["my column 1", "my column 2", ...])
120142
```
121143

122144
### Transposing columns
123145

124146
There's a few ways to transpose columns:
125147

126148
```py
149+
# set the index of columns
150+
df.set_index(["my column 1", "my column 2", "my column 3", ...], inplace=True)
127151

152+
# using pandas transpose to transpose rows with columns and vice versa
153+
df_transposed = df.T
128154

155+
# using pandas stack() to transpose non-index columns into a single new column
156+
df = df.stack().reset_index()
129157
```
130158

131159
To set the name of the axis for the index or columns you can use `rename_axis()`:
@@ -137,10 +165,17 @@ df = df.stack().rename_axis().reset_index()
137165
### Grouping by variables
138166

139167
```py
168+
# group by one column
169+
new_df = df.groupby("my_column")
170+
171+
# group by multiple columns
140172

141173
# list of columns to group by
142174
grouped = ["column 1", "column 2", "column 3", ...]]
143175

176+
# return new table with grouped columns
177+
new_df = df.groupby(grouped)
178+
144179
```
145180

146181
### Aggregations
@@ -154,8 +189,12 @@ new_df = df.groupby(grouped).agg(total_sum=("column to be summarised", "sum"), t
154189
### Creating totals per row and per column
155190

156191
```py
192+
# total per column, adds a new row "Column Total"
193+
# this will sum all numeric row values for each column
157194
df.loc["Column Total"] = df.sum(numeric_only=True, axis=0)
158195

196+
# total per row, creates a new column "Row Total"
197+
# this will sum all numeric column values for each row
159198
df.loc[:, "Row Total"] = df.sum(numeric_only=True, axis=1)
160199
```
161200

@@ -164,15 +203,23 @@ df.loc[:, "Row Total"] = df.sum(numeric_only=True, axis=1)
164203
When creating different aggregations/groupings which are saved in different dataframes, you can then combine these aggregations into one table. For example, suppose you have calculated the totals for age and gender in different dataframes and you wish to append these results to the final output dataframe.
165204

166205
```py
206+
# list the final output dataframe to store its aggregations
207+
list_df = [df]
167208

209+
# append the calculated totals
210+
list_df.append(calc_totals_df)
168211

212+
# concatenate into a single dataframe
213+
output_df = pd.concat(list_df, axis=0)
169214
```
170215

171216
### Creating derivations
172217

173218
To create a derivation based on the equivalent CASE WHEN SQL operation, there are several ways to do this in python:
174219

175220
```py
221+
# pandas package CASE WHEN
222+
# create the age 11 to 15 derivation
176223
df.loc[df["age"] < 0, "age11_15"] = df["age"]
177224
df.loc[(df["age"] > 0) & (df["age"] < 11), "age11_15"] = 11
178225
df.loc[(df["age"] > 10) & (df["age"] < 16), "age11_15"] = df["age"]
@@ -182,6 +229,8 @@ df.loc[df["age"] > 14, "age11_15"] = 15
182229
This results in creating a new column "age11_15" in the existing dataframe, based on the CASE WHEN conditions we applied for the new derivation.
183230

184231
```py
232+
# NumPy package CASE WHEN
233+
# create the age 11 to 15 derivation
185234
age11_15 = np.select(
186235
[
187236
df['age'] == 10, # WHEN
@@ -194,6 +243,8 @@ age11_15 = np.select(
194243
default=df['age'] # ELSE assign "age" column values
195244
)
196245

246+
# assign the result to a new column
247+
df["age11_15"] = age11_15
197248
```
198249

199250
In the first bracket you assign the "WHEN" part of the condition, second bracket the "THEN", and "default=..." represents the "ELSE" part.
@@ -203,14 +254,24 @@ The NumPy option is faster and more efficient whereas Pandas is user friendlier
203254
### Apply a column order
204255

205256
```py
257+
# create a list of the column headers in a specific order
258+
column_order = ["column 1", "column 2", "column 3", ...]
206259

260+
# apply list to dataframe
261+
df = df[column_order]
207262
```
208263

209264
### Exporting the output
210265

211266
```py
267+
# write output to a .csv
268+
df.to_csv("output.csv", ... <multiple parameters that can be inserted>)
212269

270+
# write output to an excel workbook
271+
df.to_excel("output.xlsx", sheet_name="Sheet_name_1", ... <multiple parameters that can be inserted>)
213272

273+
# write multiple sheets from different dataframes
274+
with pd.ExcelWriter("output.xlsx") as writer:
214275
df1.to_excel(writer, sheet_name="Sheet_name_1")
215276
df2.to_excel(writer, sheet_name="Sheet_name_2")
216277
```

docs/training_resources/python/config-files.md

+2
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,8 @@ elif config['report_type'] == 'monthly':
188188
df_report_data = get_monthly_data(config)
189189
```
190190

191+
## Over to you
192+
191193
As you were reading through this, did any ideas pop into your head about your own projects? Any values you keep having (or forgetting!) to change? Any bits of code you sometimes need to comment out? If so, you've got a prime candidate for using a config file! So give it a try - implement the above steps in your project and see what you think.
192194

193195
Good luck!

docs/training_resources/python/handling-file-paths.md

+10-1
Original file line numberDiff line numberDiff line change
@@ -56,15 +56,20 @@ operations
5656
For example, you can access the current working directory with the `cwd` attribute.
5757

5858
```python
59+
# Print the current working directory (cwd)
60+
print("CWD:", pathlib.Path.cwd())
5961
```
6062

6163
Pass strings to Path constructor to create a Path object
6264

6365
```python
66+
# . is the current directory
67+
cwd_path = pathlib.Path(".")
6468
print("CWD (again):", cwd_path)
6569

70+
# Use resolve to get the absolute path!
71+
cwd_abspath = cwd_path.resolve()
6672
print("Absolute CWD:", cwd_abspath)
67-
6873
```
6974

7075
### Path attributes
@@ -74,6 +79,8 @@ The following examples show how pathlib makes it easier to extract specific attr
7479
#### Example: absolute path to the current file
7580

7681
```python
82+
# Note: __file__ is a global Python variable
83+
this_file_path = pathlib.Path(__file__)
7784
print("Path to file:", this_file_path)
7885
```
7986

@@ -155,6 +162,8 @@ import pandas as pd
155162
import pyreadstat # needed to parse sav files in spss
156163
import pathlib2 # This is just a backwards compatible pathlib!
157164

165+
# https://realpython.com/python-pathlib/
166+
158167
# Add parameters
159168
BASE_DIR = pathlib2.Path(r"\\<path>\Publication\RAP")
160169
PUPIL_DIR = BASE_DIR / "Inputs" / "PupilData"

docs/training_resources/python/logging-and-error-handling.md

+10
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,8 @@ This is bad practice as instead of handling the specific errors the code could t
113113

114114
```python
115115
try:
116+
# Some problematic code that could raise different kinds of exceptions
117+
except ValueError as e:
116118
print('Found a value error!')
117119
print(repr(e))
118120
exit()
@@ -130,6 +132,8 @@ Alternatively if we really did want to handle all of those exceptions in the sam
130132

131133
```python
132134
try:
135+
# Some problematic code that could raise different kinds of exceptions
136+
except (ValueError, ZeroDivisionError, KeyError) as e:
133137
print('Found an error!')
134138
print(repr(e))
135139
exit()
@@ -172,6 +176,8 @@ As a general rule of thumb avoid using the generic `Exception` class at all. It
172176

173177
```python
174178
try:
179+
# Some problematic code that could raise different kinds of exceptions
180+
except Exception:
175181
print('Found an error!')
176182
exit()
177183
```
@@ -205,6 +211,8 @@ def divide_two_numbers(a: float, b: float) -> float:
205211
print('Division failed because of: ' + repr(e))
206212
raise ZeroDivisionError
207213

214+
# In use:
215+
a = 1.0
208216
b = 0
209217
try:
210218
result = divide_two_numbers(a, b)
@@ -223,6 +231,8 @@ Doing this raises a new ZeroDivisionError, which loses the stack trace of the or
223231

224232
```python
225233
except ZeroDivisionError:
234+
# Do stuff
235+
raise
226236
```
227237

228238
#### Don't let the program continue if it can't

0 commit comments

Comments
 (0)