From 9f985b489a438bf350fd1d5bf8181690c40176ae Mon Sep 17 00:00:00 2001 From: Frits Hermans Date: Mon, 27 Jan 2025 16:21:06 +0100 Subject: [PATCH 1/3] update explanation on nr of trees in GBDT --- notebooks/ensemble_ex_03.ipynb | 15 ++++++++------- notebooks/ensemble_sol_03.ipynb | 17 +++++++++-------- python_scripts/ensemble_ex_03.py | 15 ++++++++------- python_scripts/ensemble_sol_03.py | 17 +++++++++-------- 4 files changed, 34 insertions(+), 30 deletions(-) diff --git a/notebooks/ensemble_ex_03.ipynb b/notebooks/ensemble_ex_03.ipynb index 895d786c5..f9d1e4590 100644 --- a/notebooks/ensemble_ex_03.ipynb +++ b/notebooks/ensemble_ex_03.ipynb @@ -101,20 +101,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Both gradient boosting and random forest models improve when increasing the\n", - "number of trees in the ensemble. However, the scores reach a plateau where\n", - "adding new trees just makes fitting and scoring slower.\n", + "Random forest models improve when increasing the number of trees in the\n", + "ensemble. However, the scores reach a plateau where adding new trees just\n", + "makes fitting and scoring slower.\n", "\n", - "To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n", + "Gradient boosting models overfit when the number of trees is too large. To\n", + "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n", "offers an early-stopping option. Internally, the algorithm uses an\n", "out-of-sample set to compute the generalization performance of the model at\n", "each addition of a tree. Thus, if the generalization performance is not\n", "improving for several iterations, it stops adding trees.\n", "\n", "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n", - "of trees is certainly too large. Change the parameter `n_iter_no_change` such\n", - "that the gradient boosting fitting stops after adding 5 trees that do not\n", - "improve the overall generalization performance." + "of trees is certainly too large. Change the parameter `n_iter_no_change`\n", + "such that the gradient boosting fitting stops after adding 5 trees to avoid\n", + "deterioration of the overall generalization performance." ] }, { diff --git a/notebooks/ensemble_sol_03.ipynb b/notebooks/ensemble_sol_03.ipynb index 7fc5dae16..fce47c9a2 100644 --- a/notebooks/ensemble_sol_03.ipynb +++ b/notebooks/ensemble_sol_03.ipynb @@ -129,20 +129,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Both gradient boosting and random forest models improve when increasing the\n", - "number of trees in the ensemble. However, the scores reach a plateau where\n", - "adding new trees just makes fitting and scoring slower.\n", + "Random forest models improve when increasing the number of trees in the\n", + "ensemble. However, the scores reach a plateau where adding new trees just\n", + "makes fitting and scoring slower.\n", "\n", - "To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n", + "Gradient boosting models overfit when the number of trees is too large. To\n", + "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n", "offers an early-stopping option. Internally, the algorithm uses an\n", "out-of-sample set to compute the generalization performance of the model at\n", "each addition of a tree. Thus, if the generalization performance is not\n", "improving for several iterations, it stops adding trees.\n", "\n", "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n", - "of trees is certainly too large. Change the parameter `n_iter_no_change` such\n", - "that the gradient boosting fitting stops after adding 5 trees that do not\n", - "improve the overall generalization performance." + "of trees is certainly too large. Change the parameter `n_iter_no_change`\n", + "such that the gradient boosting fitting stops after adding 5 trees to avoid\n", + "deterioration of the overall generalization performance." ] }, { @@ -167,7 +168,7 @@ "source": [ "We see that the number of trees used is far below 1000 with the current\n", "dataset. Training the gradient boosting model with the entire 1000 trees would\n", - "have been useless." + "have been harmful." ] }, { diff --git a/python_scripts/ensemble_ex_03.py b/python_scripts/ensemble_ex_03.py index 72f8f362c..cecb9484a 100644 --- a/python_scripts/ensemble_ex_03.py +++ b/python_scripts/ensemble_ex_03.py @@ -64,20 +64,21 @@ # Write your code here. # %% [markdown] -# Both gradient boosting and random forest models improve when increasing the -# number of trees in the ensemble. However, the scores reach a plateau where -# adding new trees just makes fitting and scoring slower. +# Random forest models improve when increasing the number of trees in the +# ensemble. However, the scores reach a plateau where adding new trees just +# makes fitting and scoring slower. # -# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting +# Gradient boosting models overfit when the number of trees is too large. To +# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting # offers an early-stopping option. Internally, the algorithm uses an # out-of-sample set to compute the generalization performance of the model at # each addition of a tree. Thus, if the generalization performance is not # improving for several iterations, it stops adding trees. # # Now, create a gradient-boosting model with `n_estimators=1_000`. This number -# of trees is certainly too large. Change the parameter `n_iter_no_change` such -# that the gradient boosting fitting stops after adding 5 trees that do not -# improve the overall generalization performance. +# of trees is certainly too large. Change the parameter `n_iter_no_change` +# such that the gradient boosting fitting stops after adding 5 trees to avoid +# deterioration of the overall generalization performance. # %% # Write your code here. diff --git a/python_scripts/ensemble_sol_03.py b/python_scripts/ensemble_sol_03.py index a72542464..2086ad366 100644 --- a/python_scripts/ensemble_sol_03.py +++ b/python_scripts/ensemble_sol_03.py @@ -86,20 +86,21 @@ ) # %% [markdown] -# Both gradient boosting and random forest models improve when increasing the -# number of trees in the ensemble. However, the scores reach a plateau where -# adding new trees just makes fitting and scoring slower. +# Random forest models improve when increasing the number of trees in the +# ensemble. However, the scores reach a plateau where adding new trees just +# makes fitting and scoring slower. # -# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting +# Gradient boosting models overfit when the number of trees is too large. To +# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting # offers an early-stopping option. Internally, the algorithm uses an # out-of-sample set to compute the generalization performance of the model at # each addition of a tree. Thus, if the generalization performance is not # improving for several iterations, it stops adding trees. # # Now, create a gradient-boosting model with `n_estimators=1_000`. This number -# of trees is certainly too large. Change the parameter `n_iter_no_change` such -# that the gradient boosting fitting stops after adding 5 trees that do not -# improve the overall generalization performance. +# of trees is certainly too large. Change the parameter `n_iter_no_change` +# such that the gradient boosting fitting stops after adding 5 trees to avoid +# deterioration of the overall generalization performance. # %% # solution @@ -110,7 +111,7 @@ # %% [markdown] tags=["solution"] # We see that the number of trees used is far below 1000 with the current # dataset. Training the gradient boosting model with the entire 1000 trees would -# have been useless. +# have been harmful. # %% [markdown] # Estimate the generalization performance of this model again using the From 49b85fa7bc2eb37bb98f2600b4c7d4f0279474f2 Mon Sep 17 00:00:00 2001 From: Frits Hermans Date: Wed, 29 Jan 2025 16:01:59 +0100 Subject: [PATCH 2/3] Update ensemble_sol_03.py Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com> --- python_scripts/ensemble_sol_03.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python_scripts/ensemble_sol_03.py b/python_scripts/ensemble_sol_03.py index 2086ad366..55f882443 100644 --- a/python_scripts/ensemble_sol_03.py +++ b/python_scripts/ensemble_sol_03.py @@ -111,7 +111,7 @@ # %% [markdown] tags=["solution"] # We see that the number of trees used is far below 1000 with the current # dataset. Training the gradient boosting model with the entire 1000 trees would -# have been harmful. +# have been detrimental. # %% [markdown] # Estimate the generalization performance of this model again using the From fc5bc202b04a06ad201098d5dc42b4540c3117e6 Mon Sep 17 00:00:00 2001 From: Frits Hermans Date: Wed, 29 Jan 2025 16:02:09 +0100 Subject: [PATCH 3/3] Update ensemble_sol_03.ipynb Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com> --- notebooks/ensemble_sol_03.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/ensemble_sol_03.ipynb b/notebooks/ensemble_sol_03.ipynb index fce47c9a2..4906e1b55 100644 --- a/notebooks/ensemble_sol_03.ipynb +++ b/notebooks/ensemble_sol_03.ipynb @@ -168,7 +168,7 @@ "source": [ "We see that the number of trees used is far below 1000 with the current\n", "dataset. Training the gradient boosting model with the entire 1000 trees would\n", - "have been harmful." + "have been detrimental." ] }, {