Skip to content

Commit 1a6797e

Browse files
author
Emily Strong
authored
Offline Evaluation Metrics Implementations (#9)
Adds inverse propensity scoring and doubly robust evaluation metrics
1 parent e7813ad commit 1a6797e

17 files changed

+536
-194
lines changed

CHANGELOG.txt

+8-2
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
CHANGELOG
33
=========
44

5+
-------------------------------------------------------------------------------
6+
July 29, 2021 1.3.0
7+
-------------------------------------------------------------------------------
8+
9+
- Added Inverse Propensity Scoring (IPS) and Doubly Robust Estimation (DR) CTR estimation methods.
10+
511
-------------------------------------------------------------------------------
612
July 12, 2021 1.2.2
713
-------------------------------------------------------------------------------
@@ -18,7 +24,7 @@ June 23, 2021 1.2.1
1824
April 16, 2021 1.2.0
1925
-------------------------------------------------------------------------------
2026

21-
- Fixed deprecation warning of numpy 1.20 dtype
27+
- Fixed deprecation warning of numpy 1.20 dtype
2228

2329
-------------------------------------------------------------------------------
2430
April 13, 2021 1.1.0
@@ -37,4 +43,4 @@ February 1, 2021 1.0.0
3743
December 1, 2020
3844
-------------------------------------------------------------------------------
3945

40-
- Development starts.
46+
- Development starts.

README.md

+10-4
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,12 @@ Jurity is developed by the Artificial Intelligence Center of Excellence at Fidel
2323
## Recommenders Metrics
2424
* [AUC: Area Under the Curve](https://fidelity.github.io/jurity/about_reco.html#auc-area-under-the-curve)
2525
* [CTR: Click-through rate](https://fidelity.github.io/jurity/about_reco.html#ctr-click-through-rate)
26-
* [Precision@K](https://fidelity.github.io/jurity/about_reco.html#precision)
27-
* [Recall@K](https://fidelity.github.io/jurity/about_reco.html#recall)
26+
* [DR: Doubly robust estimation](https://fidelity.github.io/jurity/about_reco.html#ctr-click-through-rate)
27+
* [IPS: Inverse propensity scoring](https://fidelity.github.io/jurity/about_reco.html#ctr-click-through-rate)
2828
* [MAP@K: Mean Average Precision](https://fidelity.github.io/jurity/about_reco.html#map-mean-average-precision)
2929
* [NDCG: Normalized discounted cumulative gain](https://fidelity.github.io/jurity/about_reco.html#ndcg-normalized-discounted-cumulative-gain)
30+
* [Precision@K](https://fidelity.github.io/jurity/about_reco.html#precision)
31+
* [Recall@K](https://fidelity.github.io/jurity/about_reco.html#recall)
3032

3133
## Classification Metrics
3234
* [Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
@@ -104,18 +106,22 @@ predicted = pd.DataFrame({"user_id": [1, 2, 3, 4], "item_id": [1, 2, 2, 3], "cli
104106
# Metrics
105107
auc = BinaryRecoMetrics.AUC(click_column="clicks")
106108
ctr = BinaryRecoMetrics.CTR(click_column="clicks")
109+
dr = BinaryRecoMetrics.CTR(click_column="clicks", estimation='dr')
110+
ips = BinaryRecoMetrics.CTR(click_column="clicks", estimation='ips')
111+
map_k = RankingRecoMetrics.MAP(click_column="clicks", k=2)
107112
ncdg_k = RankingRecoMetrics.NDCG(click_column="clicks", k=3)
108113
precision_k = RankingRecoMetrics.Precision(click_column="clicks", k=2)
109114
recall_k = RankingRecoMetrics.Recall(click_column="clicks", k=2)
110-
map_k = RankingRecoMetrics.MAP(click_column="clicks", k=2)
111115

112116
# Scores
113117
print("AUC:", auc.get_score(actual, predicted))
114118
print("CTR:", ctr.get_score(actual, predicted))
119+
print("Doubly Robust:", dr.get_score(actual, predicted))
120+
print("IPS:", ips.get_score(actual, predicted))
121+
print("MAP@K:", map_k.get_score(actual, predicted))
115122
print("NCDG:", ncdg_k.get_score(actual, predicted))
116123
print("Precision@K:", precision_k.get_score(actual, predicted))
117124
print("Recall@K:", recall_k.get_score(actual, predicted))
118-
print("MAP@K:", map_k.get_score(actual, predicted))
119125
```
120126

121127
## Quick Start: Classification Evaluation

docs/.buildinfo

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: 2e1ad5d9e655c410a8bf3d73cfd7b84d
3+
config: 01e4225d941cab66da9ab9ff047a8f1f
44
tags: 645f666f9bcd5a90fca523b33c5a78b7

docs/_sources/about_reco.rst.txt

+31-1
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,43 @@ Binary recommender metrics directly measure the click interaction.
1717
CTR: Click-through Rate
1818
^^^^^^^^^^^^^^^^^^^^^^^
1919

20-
CTR measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.
20+
CTR offers three reward estimation methods.
21+
22+
Direct estimation ("matching") measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.
2123

2224
Let :math:`M` denote the set of user-item pairs that appear in both actual ratings and recommendations, and :math:`C(M_i)` be an indicator function that produces :math:`1` if the user clicked on the item, and :math:`0` if they didn't.
2325

2426
.. math::
2527
CTR = \frac{1}{\left | M \right |}\sum_{i=1}^{\left | M \right |} C(M_i)
2628
29+
Inverse propensity scoring (IPS) weights the items by how likely they were to be recommended by the historic policy
30+
if the user saw the item in the historic data. Due to the probability inversion, less likely items are given more weight.
31+
32+
.. math::
33+
IPS = \frac{1}{n} \sum r_a \times \frac{I(\hat{a} = a)}{P(a|x,h)}
34+
35+
In this calculation: n is the total size of the test data; :math:`r_a` is the observed reward;
36+
:math:`\hat{a}` is the recommended item; :math:`I(\hat{a} = a}` is a boolean of whether the user-item pair has
37+
historic data; and :math:`P(a|x,h)` is the probability of the item being recommended for the test context given
38+
the historic data.
39+
40+
Doubly robust estimation (DR) combines the directly predicted values with a correction based on how
41+
likely an item was to be recommended by the historic policy if the user saw the item in the historic data.
42+
43+
.. math::
44+
DR = \frac{1}{n} \sum \hat{r}_a + \frac{(r_a -\hat{r}_a) I(\hat{a} = a}{p(a|x,h)}
45+
46+
In this calculation, :math:`\hat{r}_a` is the predicted reward.
47+
48+
At a high level, doubly robust estimation combines a direct estimate with an IPS-like correction if historic data is
49+
available. If historic data is not available, the second term is 0 and only the predicted reward is used for the
50+
user-item pair.
51+
52+
53+
The IPS and DR implementations are based on: Dudík, Miroslav, John Langford, and Lihong Li.
54+
"Doubly robust policy evaluation and learning." Proceedings of the 28th International Conference on International
55+
Conference on Machine Learning. 2011. Available as arXiv preprint arXiv:1103.4601 
56+
2757
AUC: Area Under the Curve
2858
^^^^^^^^^^^^^^^^^^^^^^^^^
2959

docs/_static/pygments.css

+1-6
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,5 @@
1-
pre { line-height: 125%; }
2-
td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
3-
span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
4-
td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
5-
span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
61
.highlight .hll { background-color: #ffffcc }
7-
.highlight { background: #f8f8f8; }
2+
.highlight { background: #f8f8f8; }
83
.highlight .c { color: #408080; font-style: italic } /* Comment */
94
.highlight .err { border: 1px solid #FF0000 } /* Error */
105
.highlight .k { color: #008000; font-weight: bold } /* Keyword */

docs/about_reco.html

+21-1
Original file line numberDiff line numberDiff line change
@@ -184,10 +184,30 @@ <h2>Binary Recommender Metrics<a class="headerlink" href="#binary-recommender-me
184184
<p>Binary recommender metrics directly measure the click interaction.</p>
185185
<div class="section" id="ctr-click-through-rate">
186186
<h3>CTR: Click-through Rate<a class="headerlink" href="#ctr-click-through-rate" title="Permalink to this headline"></a></h3>
187-
<p>CTR measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.</p>
187+
<p>CTR offers three reward estimation methods.</p>
188+
<p>Direct estimation (“matching”) measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.</p>
188189
<p>Let <span class="math notranslate nohighlight">\(M\)</span> denote the set of user-item pairs that appear in both actual ratings and recommendations, and <span class="math notranslate nohighlight">\(C(M_i)\)</span> be an indicator function that produces <span class="math notranslate nohighlight">\(1\)</span> if the user clicked on the item, and <span class="math notranslate nohighlight">\(0\)</span> if they didn’t.</p>
189190
<div class="math notranslate nohighlight">
190191
\[CTR = \frac{1}{\left | M \right |}\sum_{i=1}^{\left | M \right |} C(M_i)\]</div>
192+
<p>Inverse propensity scoring (IPS) weights the items by how likely they were to be recommended by the historic policy
193+
if the user saw the item in the historic data. Due to the probability inversion, less likely items are given more weight.</p>
194+
<div class="math notranslate nohighlight">
195+
\[IPS = \frac{1}{n} \sum r_a \times \frac{I(\hat{a} = a)}{P(a|x,h)}\]</div>
196+
<p>In this calculation: n is the total size of the test data; <span class="math notranslate nohighlight">\(r_a\)</span> is the observed reward;
197+
<span class="math notranslate nohighlight">\(\hat{a}\)</span> is the recommended item; <span class="math notranslate nohighlight">\(I(\hat{a} = a}\)</span> is a boolean of whether the user-item pair has
198+
historic data; and <span class="math notranslate nohighlight">\(P(a|x,h)\)</span> is the probability of the item being recommended for the test context given
199+
the historic data.</p>
200+
<p>Doubly robust estimation (DR) combines the directly predicted values with a correction based on how
201+
likely an item was to be recommended by the historic policy if the user saw the item in the historic data.</p>
202+
<div class="math notranslate nohighlight">
203+
\[DR = \frac{1}{n} \sum \hat{r}_a + \frac{(r_a -\hat{r}_a) I(\hat{a} = a}{p(a|x,h)}\]</div>
204+
<p>In this calculation, <span class="math notranslate nohighlight">\(\hat{r}_a\)</span> is the predicted reward.</p>
205+
<p>At a high level, doubly robust estimation combines a direct estimate with an IPS-like correction if historic data is
206+
available. If historic data is not available, the second term is 0 and only the predicted reward is used for the
207+
user-item pair.</p>
208+
<p>The IPS and DR implementations are based on: Dudík, Miroslav, John Langford, and Lihong Li.
209+
“Doubly robust policy evaluation and learning.” Proceedings of the 28th International Conference on International
210+
Conference on Machine Learning. 2011. Available as arXiv preprint arXiv:1103.4601</p>
191211
</div>
192212
<div class="section" id="auc-area-under-the-curve">
193213
<h3>AUC: Area Under the Curve<a class="headerlink" href="#auc-area-under-the-curve" title="Permalink to this headline"></a></h3>

0 commit comments

Comments
 (0)