Statistics #36

scholtzan · 2020-04-22T23:27:28Z

This PR adds support for defining and processing statistics. Currently still a WIP, but I think it's makes sense to see if this is going in the right direction.

pensieve/analysis.py

pensieve/statistics.py

scholtzan · 2020-04-23T22:55:29Z

This is what the data looks like in BigQuery when running the integration test:

pensieve/statistics.py

tdsmith

I don't think I really understand the apply/transform methods; don't we just need one of them?

Pretreatments and statistics do different things so maybe it makes sense for the methods to be named different things -- pretreatments transform dataframes to dataframes, and statistics transform dataframes to StatisticsResultCollections.

pensieve/statistics.py

pensieve/pre_treatment.py

pensieve/statistics.py

pensieve/analysis.py

pensieve/statistics.py

emtwo

I like the design overall. I guess we still need to figure out more details about the pre treatments like what they are and are they always the same for a given statistic like you mentioned.

pensieve/analysis.py

pensieve/statistics.py

pensieve/analysis.py

tdsmith · 2020-04-30T05:31:12Z

It looks like we're hitting python/mypy#4717. We should just # type: ignore the call site, but I suspect we can just remove that code instead for other reasons.

pensieve/config.py

scholtzan · 2020-04-30T20:38:49Z

I made changes to represent statistics in the config for metrics like discussed:

[metrics]
weekly = ["active_ticks"]

[metrics.active_ticks.statistics.bootstrap_quantiles]
quantiles = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

[metrics.active_ticks.statistics.statistic_without_arguments]

[metrics.active_ticks.statistics.bootstrap_means]
pre_treatments = ["trim", "log"]

[metrics.my_cool_metric]
select_expr = "1"
data_source = "whatever"
# This "inline table" syntax produces the same output as the explicit table syntax
statistics = {statistic_without_arguments={}, bootstrap_quantiles={}}

I'll add more tests if the current approach looks okay.

scholtzan · 2020-05-01T15:50:41Z

pensieve/config.py

+            Summary(
+                metric=mozanalysis.metrics.desktop.search_count,
+                treatment=BootstrapMean(num_samples=1000),
+            ),


We could also move those into a .toml config file that lives in the pensieve repository instead of hardcode them. It might be easier to find and change if there is a separate file for this and could also serve as a configuration example.

tdsmith

This is looking good!

I think maybe we should drop "treatment" everywhere in favor of "statistic" just to avoid introducing a new noun. If you don't like "pretreatment" without a "treatment" then maybe they're "filter"s but either seems fine to me.

pensieve/statistics.py

pensieve/config.py

pensieve/statistics.py

pensieve/pre_treatment.py

scholtzan · 2020-05-01T22:47:51Z

pensieve/statistics.py

+    of the experiment.
+    """
+
+    ref_branch_label: str = "control"


Is it fair to assume that every statistic has ref_branch_label?

Not every, but many statistics will be interested in making a comparison to the control group.

scholtzan · 2020-05-01T22:48:53Z

pensieve/config.py

+                if statistic.name() == statistic_name:
+                    pre_treatments = [pt.resolve for pt in self.pre_treatments]
+
+                    if "ref_branch_label" not in params:


So unless ref_branch_label is explicitly set in the config we'll use the control branch from experimenter. Or else it's just the default "control" branch

scholtzan · 2020-05-01T23:06:07Z

pensieve/pre_treatment.py

+        """Return snake-cased name of the statistic."""
+        return re.sub(r"(?<!^)(?=[A-Z])", "_", cls.__name__).lower()
+
+    #@abstractmethod


mypy doesn't agree with this...

tdsmith

I'm hype for this.

pensieve/analysis.py

pensieve/pre_treatment.py

scholtzan force-pushed the statistics branch from c513f0e to bf60f4b Compare April 23, 2020 22:46