certification python project from freecodecamp
-
using Python to explore the relationship between cardiac disease, body measurements, blood markers, and lifestyle choices.
-
visualize and make calculations from medical examination data using matplotlib, seaborn, and pandas. The dataset values were collected during medical examinations.
-
import libraries
libraries
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np
-
import data
read data
df = pd.read_csv("medical_examination.csv")
-
add 'overweight' column
overweight
df['overweight'] = df["weight"] / (df["height"]/100)**2 df.loc[df["overweight"] > 25, "overweight"] = 1 df.loc[df["overweight"] !=1, "overweight"] = 0
-
normalize data by making 0 always good and 1 always bad. If the value of 'cholesterol' or 'gluc' is 1, make the value 0. If the value is more than 1, make the value 1.
normalize
df["cholesterol"].replace({1:0, 2:1, 3:1}, inplace=True) df["gluc"].replace({1:0, 2:1, 3:1}, inplace=True)
-
create categorical plot
draw_cat_plot
def draw_cat_plot(): - create DataFrame for cat plot using `pd.melt` using just the values from 'cholesterol', 'gluc', 'smoke', 'alco', 'active', and 'overweight'. df_cat = df.copy(deep=True) df_cat = pd.melt(df_cat, id_vars="cardio", value_vars=["active", "alco", "cholesterol", "gluc", "overweight", "smoke"])
- group and reformat the data to split it by 'cardio'. Show the counts of each feature. df_cat = df_cat.groupby(["cardio", "variable", "value"]).agg(total = ("value", "count")) df_cat = pd.DataFrame(df_cat) df_cat.reset_index(inplace=True)draw plot
fig = sns.catplot(data = df_cat, x ="variable", y = "total", hue = "value", col = "cardio", kind = "bar").fig
fig.savefig('catplot.png') return fig -
clean the data by filtering out the following patient segments that represent incorrect data:
criteria
- diastolic pressure is higher than systolic (Keep the correct data with (df['ap_lo'] <= df['ap_hi']))
- height is less than the 2.5th percentile (Keep the correct data with (df['height'] >= df['height'].quantile(0.025)))
- height is more than the 97.5th percentile
- weight is less than the 2.5th percentile
- weight is more than the 97.5th percentile
draw_heat_map
def draw_heat_map(): df_heat = df.copy(deep=True) df_heat = df_heat[ (df_heat['ap_lo'] <= df_heat['ap_hi']) & (df_heat['height'] >= df_heat['height'].quantile(0.025)) & (df_heat['height'] <= df_heat['height'].quantile(0.975)) & (df_heat['weight'] >= df_heat['weight'].quantile(0.025)) & (df_heat['weight'] <= df_heat['weight'].quantile(0.975)) ]
correlation matrix
corr = df_heat.corr(method="pearson")
- masking for upper triangle of heat map mask = np.triu(corr)draw heat map
fig, ax = plt.subplots(figsize=(12,12)) ax = sns.heatmap(data=corr, mask=mask, annot=True, cmap="cubehelix", fmt=".1f", annot_kws={"fontsize":8}, linewidths=1) fig.savefig('heatmap.png') return fig
-
put them together in medical_data_visualizer.py file
-
call via main.py