certification python project from freecodecamp
using Python to explore the relationship between cardiac disease, body measurements, blood markers, and lifestyle choices.
visualize and make calculations from medical examination data using matplotlib, seaborn, and pandas. The dataset values were collected during medical examinations.
import libraries
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np
import data
read data
df = pd.read_csv("medical_examination.csv")
add 'overweight' column
df['overweight'] = df["weight"] / (df["height"]/100)**2 df.loc[df["overweight"] > 25, "overweight"] = 1 df.loc[df["overweight"] !=1, "overweight"] = 0
normalize data by making 0 always good and 1 always bad. If the value of 'cholesterol' or 'gluc' is 1, make the value 0. If the value is more than 1, make the value 1.
df["cholesterol"].replace({1:0, 2:1, 3:1}, inplace=True) df["gluc"].replace({1:0, 2:1, 3:1}, inplace=True)
create categorical plot
def draw_cat_plot(): - create DataFrame for cat plot using `pd.melt` using just the values from 'cholesterol', 'gluc', 'smoke', 'alco', 'active', and 'overweight'. df_cat = df.copy(deep=True) df_cat = pd.melt(df_cat, id_vars="cardio", value_vars=["active", "alco", "cholesterol", "gluc", "overweight", "smoke"])
- group and reformat the data to split it by 'cardio'. Show the counts of each feature. df_cat = df_cat.groupby(["cardio", "variable", "value"]).agg(total = ("value", "count")) df_cat = pd.DataFrame(df_cat) df_cat.reset_index(inplace=True)draw plot
fig = sns.catplot(data = df_cat, x ="variable", y = "total", hue = "value", col = "cardio", kind = "bar").fig
fig.savefig('catplot.png') return fig -
clean the data by filtering out the following patient segments that represent incorrect data:
- diastolic pressure is higher than systolic (Keep the correct data with (df['ap_lo'] <= df['ap_hi']))
- height is less than the 2.5th percentile (Keep the correct data with (df['height'] >= df['height'].quantile(0.025)))
- height is more than the 97.5th percentile
- weight is less than the 2.5th percentile
- weight is more than the 97.5th percentile
def draw_heat_map(): df_heat = df.copy(deep=True) df_heat = df_heat[ (df_heat['ap_lo'] <= df_heat['ap_hi']) & (df_heat['height'] >= df_heat['height'].quantile(0.025)) & (df_heat['height'] <= df_heat['height'].quantile(0.975)) & (df_heat['weight'] >= df_heat['weight'].quantile(0.025)) & (df_heat['weight'] <= df_heat['weight'].quantile(0.975)) ]
correlation matrix
corr = df_heat.corr(method="pearson")
- masking for upper triangle of heat map mask = np.triu(corr)draw heat map
fig, ax = plt.subplots(figsize=(12,12)) ax = sns.heatmap(data=corr, mask=mask, annot=True, cmap="cubehelix", fmt=".1f", annot_kws={"fontsize":8}, linewidths=1) fig.savefig('heatmap.png') return fig
put them together in medical_data_visualizer.py file
call via main.py