Breast_Cancer_kaggle.Rmd

---
title: "Using a Machine Learning Ensemble to Make Breast Cancer Predictions"
author: "Jarred Priester"
date: "1/11/2022"
output:
  pdf_document: default
  html_document: default
---
1. Overview
  + 1.1 description of dataset
  + 1.2 goal of project
  + 1.3 steps to achieve goal
2. Data Cleaning
  + 2.1 downloading the data
  + 2.2 missing information
  + 2.3 scaling the matrix
3. Exploratory Data Analysis
  + 3.1 exploring the data
  + 3.2 visualization
4. Models
  + 4.1 setting up the models
  + 4.2 logistic regression
  + 4.3 random forest
  + 4.4 K nearest neighbors
  + 4.5 linear discriminant analysis
  + 4.6 Neural Network
  + 4.7 Ensemble
5. Results
  + 5.1 results table
  + 5.2 best model
6. Conclusion
  + 6.1 summary
  + 6.2 limitations
  + 6.3 future work

# 1. Overview

Something that we would all like to eradicate from our society is cancer. Unfortunately, cancer has effected our lives far too often. Thankfully, cancer research has advanced quite a lot in the past decades thanks in large part to advances in technology and in particular, machine learning. Hopefully this project will shine some light on the frame work of how machine learning can be used to further cancer research.  

# 1.1 Decription of dataset

The Breast Cancer Wisconsin (Diagnostic) Data Set is a popular data set from the University of California Irvine Machine Learning Repository. The data set consist of 529 rows and 32 columns. Each row represents a tumor sample and each column represents a feature, more details are below. 

The following is from UCI's Machine Learning Repository website:

*Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.*

*Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.*

*The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].*

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

# 1.2 Goal of the project

The goal of this project will be to successfully create a model that could classify the given tumor samples into factors of malignant or benign. The metric we will use to determine success is the F1 score. The goal is to create a model that can achieve a F1 score of .9 or higher.

# 1.3 Steps to achieve this goal

To achieve this goal we will first download, clean and analyze the dataset. We will then run 5 different algorithms that come up with the binary classification predictions of malignant or benign. We will then combine them to create an ensemble that takes the classification that appears the most for each sample . Lastly we will create a table of results and find the model with the highest F1 score.

# Data Cleaning

# 2.1 downloading the data
```{r,results='hide',message=FALSE}
  #installing required packages if not previouly installed
if(!require(matrixStats)) install.packages("matrixStats")
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(caret)) install.packages("caret")
if(!require(dslabs)) install.packages("dslabs")
if(!require(dplyr)) install.packages("dplyr")
if(!require(tidyr)) install.packages("tidyr")
if(!require(ggthemes)) install.packages("ggthemes")
if(!require(knitr)) install.packages("knitr")

#setting digits to 3 places
options(digits = 3)

#downloading the libraries
library(matrixStats)
library(tidyverse)
library(caret)
library(dslabs)
library(dplyr)
library(tidyr)
library(ggthemes)
library(knitr)

#downloading the data from the dslabs library
data(brca)
```

the data are in two list. Let's take a look at the dimensions of both list
```{r}
dim(brca$x)
```
```{r}
dim(brca$y)
head(brca$y)
```

taking a look at the brca$x data
```{r}
head(brca$x)
```

changing brca$x to just x
```{r}
x <- brca$x
```

changing brca$y to just y
```{r}
y <- brca$y
```

taking a look at the variables in x
```{r}
colnames(x)
```

structure of x
```{r}
str(x)
```
summary of x
```{r}
summary(x)
```

# 2.2 missing information

taking a look to see if there are any NAs or blank cells
```{r}
colSums(is.na(x))
```
```{r}
sum(x == "")
```
There is no missing information so we now move on to the next step.

# 2.3 scaling the matrix

After looking at the summary of x we can see that the features do not have the same ranges. In fact some are quite larger than others. So to avoid any features influencing the models in an adverse way, we are now going to scale the matrix
```{r}
x_centered <- sweep(x, 2, colMeans(x))
x_scaled <- sweep(x_centered, 2, colSds(x), FUN = "/")
```

checking the first column's standard deviation, should be close to 1 since we scaled the matrix
```{r}
sd(x_scaled[,1])
```

checking the first column's median value, should be close to 0 after scaling
```{r}
median(x_scaled[,1])
```

# 3. Exploratory Data Analysis

# 3.1 exploring the data

Is our outcomes balanced?

our outcomes are not balance around 2/3 are benign (not cancerous) 
```{r}
mean(y == "M")
mean(y == "B")
```

# 3.2 Visialization

Next we will create a Heatmap in order to get a visual at how the features correlate to each other, if at all.
```{r}
d_features <- dist(t(x_scaled))
heatmap(as.matrix(d_features), labRow = NA, labCol = NA)
```
We can see that there is correlation throughout the data set so there is some promise that we will be able to classify the data accurately.


Hierarchical clustering
```{r}
h <- hclust(d_features)
groups <- cutree(h, k = 5)
groups
split(names(groups), groups)
plot(h)
```


PCA: proportion of variance
```{r}
pc <- prcomp(x_scaled)
```

Plot of the first two principal components with color representing tumor type
```{r}
#(benign/malignant)
data.frame(pc$x[,1:2], tumor=brca$y) %>% 
  ggplot(aes(PC1,PC2, fill = tumor, color = tumor))+
  geom_point() +
  labs(title = "first two principal components with color representing tumor type") +
  theme_economist()
```
We can see a clear separation between the first two components by tumor type. This tells us that we should be able to classify this data into malignant and benign with high accuracy.


plot showing the density of first principal component
```{r}
data.frame(pc$x[,1:2], tumor=brca$y) %>% 
  ggplot(aes(PC1,fill = tumor))+
  geom_density() +
  labs(title = "first principal component density with color representing tumor type") +
  theme_economist()
```
Same information as the scatter plot but this time as a density plot. Again, you can see the separate between the two tumor types


boxplot of first ten principal components
```{r}
data.frame(type = brca$y, pc$x[,1:10]) %>%
  gather(key = "PC", value = "value", -type) %>%
  ggplot(aes(PC, value, fill = type)) +
  geom_boxplot() +
  ggtitle("boxplot of first ten principal components") +
  theme_economist()
```
Here we can see that the malignant and benign interquartiles do not overlap, meaning there is separation in the data. That will help the models be able to classify the data.

# 4 Models

# 4.1 setting up the models

Creating the training and test sets
```{r,message=FALSE,warning=FALSE}
# set.seed(1) if using R 3.5 or earlier
set.seed(30, sample.kind = "Rounding")    # if using R 3.6 or later
test_index <- createDataPartition(brca$y, times = 1, p = 0.2, list = FALSE)
test_x <- x_scaled[test_index,]
test_y <- y[test_index]
train_x <- x_scaled[-test_index,]
train_y <- y[-test_index]
```

What proportion of the training set is benign?
```{r}
mean(train_y == "B")
```

What proportion of the test set is benign?
```{r}
mean(test_y == "B")
```

Will be using k-fold cross validation on all the algorithms
creating the k-fold parameters, k is 10
```{r,message=FALSE,warning=FALSE}
set.seed(30, sample.kind = "Rounding")
control <- trainControl(method = "cv", number = 10, p = .9)
```

# 4.2 logistic regression


training the model using the training set
```{r,message=FALSE,warning=FALSE,results='hide'}
set.seed(9, sample.kind = "Rounding")
train_glm <- train(train_x, as.factor(train_y),
                   method = "glm",
                   family = "binomial",
                   trControl = control)
```

creating the predictions
```{r}
glm_preds <- predict(train_glm, test_x)
```

creating a confusion matrix
```{r}
cm_glm <- confusionMatrix(glm_preds,test_y, positive = "M")
cm_glm
```

# 4.3 random forest

training the model using the training set
```{r,message=FALSE,warning=FALSE,results='hide'}
set.seed(9, sample.kind = "Rounding")
train_rf <- train(train_x, train_y,
                  method = "rf",
                  tuneGrid = data.frame(mtry = seq(2,40,2)),
                  importance = TRUE,
                  trControl = control)
```

best tune
```{r}
train_rf$bestTune
```

plot of training results
```{r,warning=FALSE}
plot(train_rf)
```

predictions
```{r}
rf_preds <- predict(train_rf, test_x)
```

variable importance
```{r,warning=FALSE,message=FALSE}
varImp(train_rf)
```

creating a confusion matrix 
```{r,warning=FALSE,message=FALSE}
cm_rf <- confusionMatrix(rf_preds, test_y, positive = "M")
cm_rf
```

# 4.4 K Nearest Neighbors

setting up the tuning parameters
```{r,message=FALSE,warning=FALSE}
set.seed(7, sample.kind = "Rounding")
tuning <- data.frame(k = seq(1, 20, 1))
```

training the model
```{r,message=FALSE,warning=FALSE,results='hide'}
train_knn <- train(train_x, train_y,
                   method = "knn", 
                   tuneGrid = tuning,
                   trControl = control)
```

best tune
```{r,warning=FALSE,message=FALSE}
train_knn$bestTune
```

plot of training model results
```{r,message=FALSE,warning=FALSE}
plot(train_knn)
```

predictions
```{r}
knn_preds <- predict(train_knn, test_x)
```

creating a confusion matrix
```{r,warning=FALSE}
cm_knn <- confusionMatrix(knn_preds, test_y, positive = "M")
cm_knn
```

# 4.5 Linear discriminant analysis

training the model using the training set
```{r,message=FALSE,warning=FALSE,results='hide'}
set.seed(7, sample.kind = "Rounding")
train_lda <- train(train_x, train_y, 
                   method = "lda",
                   trControl = control)
```

predictions
```{r}
lda_preds <- predict(train_lda, test_x)
```

creating a confusion matrix
```{r,message=FALSE,warning=FALSE}
cm_LDA <- confusionMatrix(lda_preds, test_y,
                          positive = "M")
cm_LDA
```

# 4.6 Neural Network

setting the tuning parameter size and decay
```{r,message=FALSE, warning=FALSE}
set.seed(7, sample.kind = "Rounding")

tuning <- data.frame(size = seq(100), decay = seq(.01,1,.1))
```


training the model on the train set
```{r,message=FALSE,warning=FALSE,results='hide'}
train_nn <- train(train_x, train_y,
                  method = "nnet",
                  tuneGrid = tuning,
                  trControl = control)
```

creating a graph for the tuning results
```{r,message=FALSE,warning=FALSE}
ggplot(train_nn, highlight = TRUE) +
  ggtitle("Neural Network")
```

best tune
```{r}
train_nn$bestTune
```

creating predictions
```{r}
nn_preds <- predict(train_nn, test_x)
```

getting accuracy results
```{r,message=FALSE,warning=FALSE}
cm_nn <- confusionMatrix(nn_preds, test_y, positive = "M")
```

viewing accuracy results
```{r,warning=FALSE}
cm_nn
```

# 4.7 Ensemble

creating a data frame of the prediction results of all the models
```{r}
preds <- data.frame(log_r = glm_preds, 
                    rf = rf_preds,
                    knn = knn_preds,
                    lda = lda_preds,
                    nn = nn_preds)
preds
```

Now that we have a data frame with all the predictions, we will take the mode of each sample and use that result as the ensemble's prediction for each sample.
```{r}
ensemble <- apply(preds,1,function(x) names(which.max(table(x))))

#factoring the results
ensemble <- as.factor(ensemble)
```

creating a confusion matrix
```{r}
cm_ensemble <- confusionMatrix(ensemble,
                               test_y,
                               positive = "M")
cm_ensemble
```

# 5. Results

# 5.1 Results table
```{r}
cm_list <- list(log_r = cm_glm,
                rf = cm_rf,
                knn = cm_knn,
                lda = cm_LDA,
                nn = cm_nn,
                ensemble = cm_ensemble)

cm_results <- sapply(cm_list, function(x) x$byClass)

results_table <- kable(cm_results)
results_table
```

# 5.2 Best model

Which model had the highest Sensitivity?
```{r}
which.max(cm_results[1,])
```

Which model had the highest Specificity?
```{r}
which.max(cm_results[2,])
```

Which model had the highest F1 Score?
```{r}
which.max(cm_results[7,])
```

Which model had the highest Balanced Accuracy?
```{r}
which.max(cm_results[11,])
```

Knn is our best model by multiple performance measures
```{r}
cm_knn
```

# 6. Conclusion

# 6.1 summary

We were able to create six different models that were able to classify the data set into malignant and benign, including an ensemble which combined the results of the first five models. Out of the six models, the most accurate was the K Nearest Neighbors model with a F1 score of ***.977***

# 6.2 limitations

The main limitation of this project is that the size of this data set is small. Would these models hold up to such a high accuracy on a big data set?

# 6.3 future work
  
In my opinion this is a great starting place for predicting whether or not tumor samples are cancerous. In order to build on this model we would need to add tens of thousands of more samples and possibly more features. There might be other factors that doctors and researchers have found to be important such as family medical history, age, drugs or alcohol use, etc. Those might be relevant features to add to the data set. But all in all, this model is a great starting point for continuous breast cancer research.