From 9f985b489a438bf350fd1d5bf8181690c40176ae Mon Sep 17 00:00:00 2001
From: Frits Hermans <frits.hermans@ing.com>
Date: Mon, 27 Jan 2025 16:21:06 +0100
Subject: [PATCH 1/3] update explanation on nr of trees in GBDT

---
 notebooks/ensemble_ex_03.ipynb    | 15 ++++++++-------
 notebooks/ensemble_sol_03.ipynb   | 17 +++++++++--------
 python_scripts/ensemble_ex_03.py  | 15 ++++++++-------
 python_scripts/ensemble_sol_03.py | 17 +++++++++--------
 4 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/notebooks/ensemble_ex_03.ipynb b/notebooks/ensemble_ex_03.ipynb
index 895d786c5..f9d1e4590 100644
--- a/notebooks/ensemble_ex_03.ipynb
+++ b/notebooks/ensemble_ex_03.ipynb
@@ -101,20 +101,21 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Both gradient boosting and random forest models improve when increasing the\n",
-    "number of trees in the ensemble. However, the scores reach a plateau where\n",
-    "adding new trees just makes fitting and scoring slower.\n",
+    "Random forest models improve when increasing the number of trees in the\n",
+    "ensemble. However, the scores reach a plateau where adding new trees just\n",
+    "makes fitting and scoring slower.\n",
     "\n",
-    "To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n",
+    "Gradient boosting models overfit when the number of trees is too large. To\n",
+    "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
     "offers an early-stopping option. Internally, the algorithm uses an\n",
     "out-of-sample set to compute the generalization performance of the model at\n",
     "each addition of a tree. Thus, if the generalization performance is not\n",
     "improving for several iterations, it stops adding trees.\n",
     "\n",
     "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
-    "of trees is certainly too large. Change the parameter `n_iter_no_change` such\n",
-    "that the gradient boosting fitting stops after adding 5 trees that do not\n",
-    "improve the overall generalization performance."
+    "of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
+    "such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
+    "deterioration of the overall generalization performance."
    ]
   },
   {
diff --git a/notebooks/ensemble_sol_03.ipynb b/notebooks/ensemble_sol_03.ipynb
index 7fc5dae16..fce47c9a2 100644
--- a/notebooks/ensemble_sol_03.ipynb
+++ b/notebooks/ensemble_sol_03.ipynb
@@ -129,20 +129,21 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Both gradient boosting and random forest models improve when increasing the\n",
-    "number of trees in the ensemble. However, the scores reach a plateau where\n",
-    "adding new trees just makes fitting and scoring slower.\n",
+    "Random forest models improve when increasing the number of trees in the\n",
+    "ensemble. However, the scores reach a plateau where adding new trees just\n",
+    "makes fitting and scoring slower.\n",
     "\n",
-    "To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n",
+    "Gradient boosting models overfit when the number of trees is too large. To\n",
+    "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
     "offers an early-stopping option. Internally, the algorithm uses an\n",
     "out-of-sample set to compute the generalization performance of the model at\n",
     "each addition of a tree. Thus, if the generalization performance is not\n",
     "improving for several iterations, it stops adding trees.\n",
     "\n",
     "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
-    "of trees is certainly too large. Change the parameter `n_iter_no_change` such\n",
-    "that the gradient boosting fitting stops after adding 5 trees that do not\n",
-    "improve the overall generalization performance."
+    "of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
+    "such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
+    "deterioration of the overall generalization performance."
    ]
   },
   {
@@ -167,7 +168,7 @@
    "source": [
     "We see that the number of trees used is far below 1000 with the current\n",
     "dataset. Training the gradient boosting model with the entire 1000 trees would\n",
-    "have been useless."
+    "have been harmful."
    ]
   },
   {
diff --git a/python_scripts/ensemble_ex_03.py b/python_scripts/ensemble_ex_03.py
index 72f8f362c..cecb9484a 100644
--- a/python_scripts/ensemble_ex_03.py
+++ b/python_scripts/ensemble_ex_03.py
@@ -64,20 +64,21 @@
 # Write your code here.
 
 # %% [markdown]
-# Both gradient boosting and random forest models improve when increasing the
-# number of trees in the ensemble. However, the scores reach a plateau where
-# adding new trees just makes fitting and scoring slower.
+# Random forest models improve when increasing the number of trees in the
+# ensemble. However, the scores reach a plateau where adding new trees just
+# makes fitting and scoring slower.
 #
-# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting
+# Gradient boosting models overfit when the number of trees is too large. To
+# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
 # offers an early-stopping option. Internally, the algorithm uses an
 # out-of-sample set to compute the generalization performance of the model at
 # each addition of a tree. Thus, if the generalization performance is not
 # improving for several iterations, it stops adding trees.
 #
 # Now, create a gradient-boosting model with `n_estimators=1_000`. This number
-# of trees is certainly too large. Change the parameter `n_iter_no_change` such
-# that the gradient boosting fitting stops after adding 5 trees that do not
-# improve the overall generalization performance.
+# of trees is certainly too large. Change the parameter `n_iter_no_change`
+# such that the gradient boosting fitting stops after adding 5 trees to avoid
+# deterioration of the overall generalization performance.
 
 # %%
 # Write your code here.
diff --git a/python_scripts/ensemble_sol_03.py b/python_scripts/ensemble_sol_03.py
index a72542464..2086ad366 100644
--- a/python_scripts/ensemble_sol_03.py
+++ b/python_scripts/ensemble_sol_03.py
@@ -86,20 +86,21 @@
 )
 
 # %% [markdown]
-# Both gradient boosting and random forest models improve when increasing the
-# number of trees in the ensemble. However, the scores reach a plateau where
-# adding new trees just makes fitting and scoring slower.
+# Random forest models improve when increasing the number of trees in the
+# ensemble. However, the scores reach a plateau where adding new trees just
+# makes fitting and scoring slower.
 #
-# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting
+# Gradient boosting models overfit when the number of trees is too large. To
+# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
 # offers an early-stopping option. Internally, the algorithm uses an
 # out-of-sample set to compute the generalization performance of the model at
 # each addition of a tree. Thus, if the generalization performance is not
 # improving for several iterations, it stops adding trees.
 #
 # Now, create a gradient-boosting model with `n_estimators=1_000`. This number
-# of trees is certainly too large. Change the parameter `n_iter_no_change` such
-# that the gradient boosting fitting stops after adding 5 trees that do not
-# improve the overall generalization performance.
+# of trees is certainly too large. Change the parameter `n_iter_no_change`
+# such that the gradient boosting fitting stops after adding 5 trees to avoid
+# deterioration of the overall generalization performance.
 
 # %%
 # solution
@@ -110,7 +111,7 @@
 # %% [markdown] tags=["solution"]
 # We see that the number of trees used is far below 1000 with the current
 # dataset. Training the gradient boosting model with the entire 1000 trees would
-# have been useless.
+# have been harmful.
 
 # %% [markdown]
 # Estimate the generalization performance of this model again using the

From 49b85fa7bc2eb37bb98f2600b4c7d4f0279474f2 Mon Sep 17 00:00:00 2001
From: Frits Hermans <post@fritshermans.nl>
Date: Wed, 29 Jan 2025 16:01:59 +0100
Subject: [PATCH 2/3] Update ensemble_sol_03.py

Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
---
 python_scripts/ensemble_sol_03.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python_scripts/ensemble_sol_03.py b/python_scripts/ensemble_sol_03.py
index 2086ad366..55f882443 100644
--- a/python_scripts/ensemble_sol_03.py
+++ b/python_scripts/ensemble_sol_03.py
@@ -111,7 +111,7 @@
 # %% [markdown] tags=["solution"]
 # We see that the number of trees used is far below 1000 with the current
 # dataset. Training the gradient boosting model with the entire 1000 trees would
-# have been harmful.
+# have been detrimental.
 
 # %% [markdown]
 # Estimate the generalization performance of this model again using the

From fc5bc202b04a06ad201098d5dc42b4540c3117e6 Mon Sep 17 00:00:00 2001
From: Frits Hermans <post@fritshermans.nl>
Date: Wed, 29 Jan 2025 16:02:09 +0100
Subject: [PATCH 3/3] Update ensemble_sol_03.ipynb

Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
---
 notebooks/ensemble_sol_03.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/notebooks/ensemble_sol_03.ipynb b/notebooks/ensemble_sol_03.ipynb
index fce47c9a2..4906e1b55 100644
--- a/notebooks/ensemble_sol_03.ipynb
+++ b/notebooks/ensemble_sol_03.ipynb
@@ -168,7 +168,7 @@
    "source": [
     "We see that the number of trees used is far below 1000 with the current\n",
     "dataset. Training the gradient boosting model with the entire 1000 trees would\n",
-    "have been harmful."
+    "have been detrimental."
    ]
   },
   {