Skip to content

mohammadmehdikeramati/A_data_science_project_from_the_scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Problem description (working on)

In this project the objective is finding the most effective factors among proposed on our data set (such as 'Eligiblity', 'Participant age', 'Participant gender' and etc) on 'quality score'. In fact, a questionnaire was designed by one the iranian medical researcher institue (Cohort) to collect data from their studied group. These questionnaires were filled via an interview and at the end of the interview, they considered a score to describe the quality of the collected data. They believe factors such as participants' education level, ehnicity, and things like these and also the interviewers' work experience, age and etc have effect on the qulity score of collected data but they do not know which factor or factors are most effective. Hence, we are going to invesigate this issue step by step, from early and simple data analysis methods such as visulization to advanced manners like desinging a Neural Network to find correlation. It is worth mentioning that this data set is accessable only via mailing me (m.mehdi_kr@outlook.com).

Feature selection based on correlation matrix

At the first stage to achieve an intuition we used both Pearson and Spearman correlation to detect relationship between each factor and our target. The fundamental difference between the two correlation coefficients is that the Pearson coefficient works with a linear relationship between the two variables whereas the Spearman Coefficient works with monotonic relationships as well.This issue is explainable via below figures:

State 1

State 2

State 3

State 4

State 5

The results of applying these two correlation metrics were demonstrated below (respectively the first one is Pearson and the second one is Spearman correlation). This part code is uploaded as 'Correlation matrix':

Pearson corr

Spearman corr

The thing about these metrics is that, they are not proper in nonlinear relation detection (The below figure can make this issue more understanable). Therefore, to detect any nonlinearity we have to employ stronger strategies.

State 6

Feature selection based on visualization

Visualizing each pare of a factor and our target can make last step mateixes' vlues more sensable. Indeed, if you compare claculated correlation values between 'Month' and 'Quality Score' with their visualiztion, you can understand why both Pearson and Spearman correlation values of this pare is much more than other pares. This part script is uploaded as 'Visualization'.

Eligible

Intervention group

Interviewer Ethnicity

Interviewer Gender

Participant ethnicity

Participant gender

Village StateCode

Interviewer Education

City code

Village Cluster

Village Participant Nubmer

Participant age

Interviewer WorkExperience

Interviewer_age

FU Month

From this step observtion it can be concluded the features can be classified into two main groups, numerical features (such as 'Interviewer_WorkExperience', 'Interviewer_age','FU_Month' ) and categorical features (such as 'Eligible', 'Participant_gender', 'Participant_ethnicity', 'Intervention_group', 'City_code', 'Village_Cluster', 'Village_StateCode','Interviewer_Gender','Interviewer_Education', 'Interviewer_Ethnicity'). So, it is much better to investigate each group seprately.

Categorical features

The obtained results from correlation calculation illustrate that, the relationship (based on Pearson and Spearman) between each categorical feature and our target is too weak. The investigation of last step figures more precisely show that, this is because the variation of classes in each feature is almost the same. In fact, each classes' mean in each feature are really close to each other. So, a good strategy to improve the classes' variation of each categorical feature is combining them with each other. Indeed, the idea is combining small differences to make more sensable variations. As an example, combination of three features, which belong to interviewer ('Interviewer_Gender','Interviewer_Education', 'Interviewer_Ethnicity') is proposed in the below picture. Comparison between the combined features' figure and each feature's figure separately can show the created variaton perfectlty.

At the first, I made my real effort to creat all the possible combination of these features. For example, all the possible way to opt two features from ten and combining them, all the possible way to opt three features from ten and combining them, all the possible way to opt four features from ten and combining them and etc. I did this strategy to combine two, three, four and five features and I found it is not an efficient way because not only investigating all the combinations' figures takes a huge amount of time, but also it is almost impposible beacuse the variation of many combinations are really close to each other and chossing between them is not possible. More important, combining features with this manner will exponentialy increase numbmers of classes sepecifically from four features on. This section code is uploaded as 'Combining features and visualization'.

Because of these issuses, we have to find a way to achieve maximum variation via combining the least numbers of features. Hence, 'the best variation creation' method and 'wisely combining feature' method were proposed. With regards to the former one, the features which have bigger differnce between thier classes' mean , are eligible to be opted for combination. The eligibility list is:

1- City_code: it has three classes and difference among clasess are 0.08, 2.16 and 2.25.

2- Village_state_code: it has two classes and the difference between classes is 1.144.

3- Interviewer_education: it has three classes and difference among clasess are 0.669, 0.911 and 0.212.

4- Participant_ ethnicity: it has two classes and the difference between classes is 0.682.

5- Participant_ gender: it has two classes and the difference between classes is 0.645.

6- Interviention_ group: it has two classes and the difference between classes is 0.590.

7- Eligible: it has two classes and the difference between classes is 0.459.

8- Interviewer_gender: it has two classes and the difference between classes is 0.453.

9- Interviewer_ethnicity: it has two classes and the difference between classes is 0.433.

The combination's figure of respectively first three, four, five and six features of above list are presented below:

Best two Best three Best four Best five Best six

In terms of latter one, classifing categorical features into two groups of 'participant' and 'interviewer' and then removing features, which only increase numbers of classes and do not improve variation from each class, seems to be more rational. In terms of interviewer group, number of features which can be cinsidered as interviwer information is three ('Interviewer_Gender','Interviewer_Education', 'Interviewer_Ethnicity') and accordingly, the number of classes, which are created with this number of features is eleven. The combination's figure is demontrated in the following:

Three interviewer

In terms of participnt class, number of features which can be cinsidered as participant information is six ('Eligible', 'Participant_gender', 'Participant_ethnicity', 'Intervention_group', 'City_code', 'Village_StateCode'). Because combination of these number of features can inrease number of classes exponantialy, we removed some of them which accordig the proposed list in 'the best variation creation' method cannot imporove variation too much.

Combintion of all features:

Six participant

Combintion of five features (one of them were removed):

Five participant

Combintion of four features (two of them were removed):

Four participant

It is noteworthy mentioning that, having closer look at our data set indicates 'Village_Cluster' is not an independent feature. This kind of feature combination is uploaded as 'Combining features and visualization- wisely'.

In order to find the best created features (ones which have stronger relation with target), some supervised algorithms were implemented. This is because these algorithms can consider even non linear relations. Finding relation between a categorical feature and a numerical target can be done through two manners. Firstly, considering this problem as a regression problem and using regression algorithm. Secondly, reversing the problem (considering our target as a feature and our feature as a target) and solving it as a classification problem.

Two rgression solutions

In terms of using regression algortithms, two different approaches were considered: Applying linear regression and an Artificial Neural Network (ANN) based regression. In both methods, algorithms try to predict target using features (one by one). A featre will be selected, which its prediction (result of implementing regression algorithms on it) has a lower Mean Absolute Errore or Mean Squre Error. This part's code was uploaded as 'Regression categorical feature using linear regression.

Results of applying linear regression on features in 'best variation creation' group were proposed below. In additon, to make comparison improvement of regression algortithm's predication (when a new feature was combined), first the result of applying linear regression on 'City_ code', which was a feature with the maximum differences among clsasses' mean was perposed. Then, according eligibility list (presented earlier) features were combined (at the first, two first ones, then three first ones and etc), inear regression were applied and result was demonstrated.

MAE of City_code: 11.863

21b

MAE of Best_two: 11.862

22b

MAE of Best_three: 11.863

23b

MAE of Best_four: 11.870

24b

MAE of Best_five: 11.870

25b

MAE of Best_six: 11.870

62b

Also, results of applying ANN based regression on features in 'best variation creation' group were proposed here. The procedure is completely same as the last step but before proposing results, the applied archetecture were presented. This part's code was uploaded as 'Regression categorical feature using an ANN.

Validation MAE of City_code: 0.7421693643199738

11b(0 7421693643199738)

loss(0 7421693643199738)

Validation MAE of Best_two: 0.7415406365835382

12b(0 7415406365835382)

loss(0 7415406365835382)

Validation MAE of Best_three: 0.742745038042128

13b(0 742745038042128)

loss(0 742745038042128)

Validation MAE of Best_four: 0.7431747175569832

14b(0 7431747175569832)

loss(0 7431747175569832)

Validation MAE of Best_five: 0.7433048499085942

15b(0 7433048499085942)

loss(0 7433048499085942)

Validation MAE of Best_six: 0.7441807791097875

16b(0 7441807791097875)

loss(0 7441807791097875)

To sum up, all in all the result of applying ANN for regression porpuse was better than linear regression (the figure below can prove this issue). This issue is obviouse from making comparison between their prediction visualization, MAE and MSE. This phenomenon shows the relation among our ctegorical features and target can be estimated better using non linear functions in comparison linear noes. To be more specific, both regression algorithms had better performance when it comes to combination of first three features of eligibility list and combining more features reduced the network performance. This is because average of classes (mean) is the key difference but it is not strong enough to make distiction among huge numbers of created classes.

Linear regression vs ANN- Best three

City_code vs Best_two

In the following a same procedure was used for 'wisely combining feature' group. The thing about this group was that algorithms show better performance on combining four features, which were about participants in comparison with combining three features, which were related to interviewers. Also, combining more than four features of participants decreased the network performcande and this is because the reason, discussed earlier. It is worth mentioning that as we expected the rsult of 'best variation creation' group was better than 'wisely combining feature' group, because its features were combined according eligibility list.

MAE of Four_participant (Linear regression): 11.869

24p

MAE of Five_participant (Linear regression): 11.866

25p

MAE of Six_participant (Linear regression): 11.867

26p

MAE of Three_interviewer (Linear regression): 11.886 23i

Validation MAE of Four_participant (ANN based regression): 0.7436238367965747

14p(0 7436238367965747)

loss(0 7436238367965747)

Validation MAE of Five_participant (ANN based regression): 0.7436517098824937

15p(0 7436517098824937)

loss (0 7436517098824937)

Validation MAE of Six_participant (ANN based regression): 0.7440071549946741

16p(0 7440071549946741)

loss(0 7440071549946741)

Validation MAE of Three_interviewer: 0.7436813963040475

13i(0 7436813963040475)

loss (0 7436813963040475)

Two classification solutions

We decided to solve this problem reverse to detect if there were any stronger relation. For this purpose two different approaches were employed, K-Neigherest Nighbour and an ANN. The first one is a simple but high speed classifier whereas second one is an accurate but slow. The results of applying KNN on 'best creation variation' group's fearures is prersented below. This section's script was uploaded as 'KNN classifier'.

City_code:

acc

City_code_1

Best_two:

acc

Best_two

Best_three:

acc

Best_three

Best_four:

acc

Best_four

Our ANN architecture was totaly same as the achitecture, used for regression pupose but there was a bit change in their hyperparameters. Indeed, we used accuracy as a metric instead of MSE and MAE. Also, we added a sigmoid activation function in its outputlayer. The results of applying ANN on 'best creation variation' group's fearures is displayed below. This part script was uploaded as 'ANN classifier'.

City_code:

acc City_code

Best_two:

acc Best_two

Best_three:

acc Best_two

Best_four:

acc Best_four

The results was not realy good and this issue cause us to do not continue to apply on another group and even other combinations in the 'best creation variation' group. In fact, combining features not only did not improve classification, but also make it worse. This is because the classifiers shoulod indicate each vlaue in our target belongs to which class. The thing is, the target's values were literally close to each other and they did not have a specific signature for exact classification pupose and cause our classifiers poor performance. Therefore, reversing problem to find relation was not a good strategy for this problem. Another interesting result, which was out of our expectation was better performance of KNN in comparison with ANN. This isuue shows that, eventhough ANN is a powerfull algorithm, whenever the data was not rich enough, this algorithm can not be trained properly and as a result its performance decreased.

Numerical features

Based on correlation calculation, proposed earlier (Pearson and Spearman) the relation between numerical features such as 'Participant_age', 'Interviewer_WorkExperience', 'Interviewer_age' and our target is too weak, however 'FU_Month' shows strong relation with our target. In this part we decided to investigate this relation more precisely. In fact, the asssumption of existing a linear and non linear relation among our features and target were being investigated respectively via employing linear regression and an ANN based regression. It is worth mentioning that because both features and target are numerical, regression appraoch were opted and also because there was not any specific signature to make our numerical feature, categorical classification approaches to detec relation were not implemented.

In the following results of applying respectively, linear regression and ANN based regression were proposed. This section codes was uploaded as 'Regression numerical features using linear regression' and 'Regression numerical features using ANN'.

MAE of FU_Month: 7.909 ML-MAE 7 909

MAE of Interviewer_age: 11.964 IL-MAE 11 964

MAE of Interviewer_workExperience: 11.962 IL-MAE 11 962

MAE of Participant_age: 11.958 PL- MAE 11 958

MAE and MSE of FU_Month respectively: 9.1214 and 211.0773 MA-MAE 9 1214

MA-loss

MAE and MSE of Interviewer_age respectively: 12.3395 : 283.1067 IA-MAE 12 3395

IL-loss

MAE and MSE of Interviewer_workExperience respectively: 12.3941 and 284.2482 IA-MAE 12 3941

IL-loss

MAE and MSE of Participant_age respectively: 12.4319 and 288.4689 PA- MAE 12 4319

PA- loss

Interesting issue is better performance of linear regression in comparison with ANN. This is because, the relation between our numerical features and target is linear and almost there is not any non linear relation.

About

Finding the most effective factors among the ones, proposed on the dataset

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages