Skip to content

Commit

Permalink
Host DataBalanceAnalysis-AdultCensusIncome cell outputs in blob inste…
Browse files Browse the repository at this point in the history
…ad of inline, use Interpretability-Image Explainers as outstanding notebook in features/responsible_ai/
  • Loading branch information
ms-kashyap committed Nov 5, 2021
1 parent 70ee581 commit 7565de5
Show file tree
Hide file tree
Showing 7 changed files with 29 additions and 65 deletions.
54 changes: 16 additions & 38 deletions notebooks/DataBalanceAnalysis - Adult Census Income.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -158,9 +158,7 @@ fig.tight_layout()
plt.show()
```


![png](DataBalanceAnalysis-AdultCensusIncome_files/DataBalanceAnalysis-AdultCensusIncome_13_0.png)

![Demographic Parity of Races in Adult Dataset](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_AdultCensusIncome_RacesDP.png)

#### Interpret Feature Balance Measures

Expand Down Expand Up @@ -273,9 +271,7 @@ fig.tight_layout()
plt.show()
```


![png](DataBalanceAnalysis-AdultCensusIncome_files/DataBalanceAnalysis-AdultCensusIncome_18_0.png)

![Distribution Balance Measures of Sex and Race in Adult Dataset](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_AdultCensusIncome_DistributionMeasures.png)

#### Interpret Distribution Balance Measures

Expand Down
Binary file not shown.
Binary file not shown.
8 changes: 4 additions & 4 deletions website/docs/features/responsible_ai/Data Balance Analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,22 +175,22 @@ This involves under-sampling from majority class and over-sampling from minority
1. Under-sampling may remove valuable information.
2. Over-sampling may cause overfitting and poor generalization on test set.

![Bar chart undersampling and oversampling](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_SamplingBar.png)
![Bar chart undersampling and oversampling](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_SamplingBar.png)

There are smarter techniques to under-sample and over-sample in literature and implemented in Python’s [imbalanced-learn](https://imbalanced-learn.org/stable/) package.

For example, we can cluster the records of the majority class, and do the under-sampling by removing records from each cluster, thus seeking to preserve information.

One technique of under-sampling is use of Tomek Links. Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. A similar way to under-sample majority class is using Near-Miss. It first calculates the distance between all the points in the larger class with the points in the smaller class. When two points belonging to different classes are very close to each other in the distribution, this algorithm eliminates the datapoint of the larger class thereby trying to balance the distribution.

![Tomek Links](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_TomekLinks.png)
![Tomek Links](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_TomekLinks.png)

In over-sampling, instead of creating exact copies of the minority class records, we can introduce small variations into those copies, creating more diverse synthetic samples. This technique is called SMOTE (Synthetic Minority Oversampling Technique). It randomly picks a point from the minority class and computes the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

![Synthetic Samples](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_SyntheticSamples.png)
![Synthetic Samples](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_SyntheticSamples.png)

### Reweighting

There is an expected and observed value in each table cell. The weight is essentially expected / observed value. This is easy to extend to multiple features with more than 2 groups. The weights are then incorporated in loss function of model training.

![Reweighting](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_Reweight.png)
![Reweighting](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_Reweight.png)
24 changes: 7 additions & 17 deletions website/notebookconvert.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,10 @@ def add_header_to_markdown(folder, md):
f.close()


def convert_notebook_to_markdown(folder, nb, outputdir):
file_path = os.path.join(folder, nb)
def convert_notebook_to_markdown(file_path, outputdir):
print(f"Converting {file_path} into markdown")

# If the notebook contains cell outputs such as figures, a folder containing cell output images is generated alongside the markdown file
# By default, both the folder and files contain the notebook name. But spaces in the notebook name create linking errors in the generated markdown
# Therefore, we first generate the markdown file, output folder, and output files with no spaces
nb_no_spaces = nb.replace(" ", "").replace(".ipynb", "")

convert_cmd = f'jupyter nbconvert --output-dir="{outputdir}" --NbConvertApp.output_base="{nb_no_spaces}" --to markdown "{file_path}"'
convert_cmd = f'jupyter nbconvert --output-dir="{outputdir}" --to markdown "{file_path}"'
os.system(convert_cmd)

# Afterwards, we rename the generated markdown file to ensure that the markdown file has the same name as notebook
md_no_spaces = os.path.join(outputdir, f"{nb_no_spaces}.md")
md_final = os.path.join(outputdir, nb.replace(".ipynb", ".md"))
print(f"Renaming {md_no_spaces} to {md_final}")
os.rename(md_no_spaces, md_final)
print()


Expand All @@ -42,7 +29,10 @@ def convert_allnotebooks_in_folder(folder, outputdir):
"CognitiveServices": os.path.join(outputdir, "examples", "cognitive_services"),
"DataBalanceAnalysis": os.path.join(outputdir, "examples", "responsible_ai"),
"DeepLearning": os.path.join(outputdir, "examples", "deep_learning"),
"Interpretability": os.path.join(outputdir, "examples", "responsible_ai"),
"Interpretability - Image Explainers": os.path.join(outputdir, "features", "responsible_ai"),
"Interpretability - Explanation Dashboard": os.path.join(outputdir, "examples", "responsible_ai"),
"Interpretability - Tabular SHAP explainer": os.path.join(outputdir, "examples", "responsible_ai"),
"Interpretability - Text Explainers": os.path.join(outputdir, "examples", "responsible_ai"),
"ModelInterpretability": os.path.join(outputdir, "examples", "responsible_ai"),
"Regression": os.path.join(outputdir, "examples", "regression"),
"TextAnalytics": os.path.join(outputdir, "examples", "text_analytics"),
Expand Down Expand Up @@ -70,7 +60,7 @@ def convert_allnotebooks_in_folder(folder, outputdir):
if os.path.exists(os.path.join(finaldir, md)):
os.remove(os.path.join(finaldir, md))

convert_notebook_to_markdown(folder, nb, finaldir)
convert_notebook_to_markdown(os.path.join(folder, nb), finaldir)
add_header_to_markdown(finaldir, md)


Expand Down

0 comments on commit 7565de5

Please sign in to comment.