Skip to content

Utilizing classification models and ensemble methods to achieve high accuracy in determining if an individual has heart disease or not.

Notifications You must be signed in to change notification settings

mkosaka1/HeartDisease_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Heart Disease Classification

Cardiovascular disease or heart disease is the leading cause of death amongst women and men and amongst most racial/ethnic groups in the United States. Heart disease describes a range of conditions that affect your heart. Diseases under the heart disease umbrella include blood vessel diseases, such as coronary artery disease. From the CDC, roughly every 1 in 4 deaths each year are due to heart disease. The WHO states that human life style is the main reason behind this heart problem. Apart from this there are many key factors which warns that the person may/may not getting chance of heart disease.

Using UCI’s heart disease dataset found here, I build classification models and used ensemble methods to produce a highly accurate model. In the current dataset, publications of this study included 303 patients and chose 14 out of 76 features that are relevant in predicting heart disease.

Variable Descriptions

  • age: age in years
  • sex: sex (1 = male; 0 = female)
  • cp: chest pain type — Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic
  • trestbps: resting blood pressure (in mm Hg on admission to the hospital)
  • chol: serum cholestoral in mg/dl
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • restecg: resting electrocardiographic results- Value 0: normal, Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalach: maximum heart rate achieved
  • exang: exercise induced angina (1 = yes; 0 = no)
  • oldpeak = ST depression induced by exercise relative to rest
  • slope: the slope of the peak exercise ST segment- Value 1: upsloping, Value 2: flat, Value 3: downsloping
  • ca: number of major vessels (0–3) colored by flourosopy
  • thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
  • target: 1 = disease, 0 = no disease

Variable Types

  • Continuous — age, trestbps, chol, thalach, oldpeak
  • Binary — sex, fbs, exang, target
  • Categorical — cp, restecg, slope, ca, thal

Exploratory Data Analysis (code found in 1. EDA.ipynb)

There is a slight class imbalance, but not severe enough to require upsampling/downsampling methods.

Now let’s see how fasting blood sugar (fbs) varies amongst the target variable. Fbs is a diabetes indicator with fbs >120 mg/d is considered diabetic (True):

Here, we see that the number for class true, is lower compared to class false. This provides an indication that fbs might not be a strong feature differentiating between heart disease an non-disease patient.

Next, we'll look at ca which is the number of major blood vessels (0-3) colored by a flourosopy.

We can clearly see that a large proportion of those with heart disease were likely to have 0 major blood vessels to the heart.

Classification (code found in 2. Classification.ipynb

A simple dummy classifier resulted in an accuracy score of 49.18%.

Next I took different machine learning models to find which algorithm has the best accuracy score.

Then I selected the top 3 performing classification models (SVC,Logistic Regression, and Naive Bayes)and inputed them into a StackingCVClassifier. This resulted in an accuracy score of ~92%, outperforming all individual classification models.

Variable ca was the most important feature in predicting the target variable

Grid Search + Ensemble

Using grid search methods on all individual classification models then utilizing the Stacking CVClassifier resulted in the same ensemble accuracy without using grid search.

Summary

From this classification project, we can see that using a stacking cv classifier improves the overall accuracy compared to using individual classification models. Our base model performed at ~49% while our ensemble model performed at ~92% accuracy. Using the grid search plus ensemble methods did not improve the overall accuracy.

About

Utilizing classification models and ensemble methods to achieve high accuracy in determining if an individual has heart disease or not.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published