Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics #36

Merged
merged 28 commits into from
May 7, 2020
Merged

Statistics #36

merged 28 commits into from
May 7, 2020

Conversation

scholtzan
Copy link
Collaborator

@scholtzan scholtzan commented Apr 22, 2020

This PR adds support for defining and processing statistics. Currently still a WIP, but I think it's makes sense to see if this is going in the right direction.

@scholtzan
Copy link
Collaborator Author

scholtzan commented Apr 23, 2020

This is what the data looks like in BigQuery when running the integration test:
Screen Shot 2020-04-23 at 15 47 19

@scholtzan scholtzan requested review from tdsmith and emtwo April 23, 2020 22:59
Copy link
Contributor

@tdsmith tdsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I really understand the apply/transform methods; don't we just need one of them?

Pretreatments and statistics do different things so maybe it makes sense for the methods to be named different things -- pretreatments transform dataframes to dataframes, and statistics transform dataframes to StatisticsResultCollections.

Copy link

@emtwo emtwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the design overall. I guess we still need to figure out more details about the pre treatments like what they are and are they always the same for a given statistic like you mentioned.

@scholtzan scholtzan marked this pull request as ready for review April 28, 2020 23:55
@tdsmith
Copy link
Contributor

tdsmith commented Apr 30, 2020

It looks like we're hitting python/mypy#4717. We should just # type: ignore the call site, but I suspect we can just remove that code instead for other reasons.

@scholtzan
Copy link
Collaborator Author

scholtzan commented Apr 30, 2020

I made changes to represent statistics in the config for metrics like discussed:

[metrics]
weekly = ["active_ticks"]

[metrics.active_ticks.statistics.bootstrap_quantiles]
quantiles = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

[metrics.active_ticks.statistics.statistic_without_arguments]

[metrics.active_ticks.statistics.bootstrap_means]
pre_treatments = ["trim", "log"]

[metrics.my_cool_metric]
select_expr = "1"
data_source = "whatever"
# This "inline table" syntax produces the same output as the explicit table syntax
statistics = {statistic_without_arguments={}, bootstrap_quantiles={}}

I'll add more tests if the current approach looks okay.

Summary(
metric=mozanalysis.metrics.desktop.search_count,
treatment=BootstrapMean(num_samples=1000),
),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also move those into a .toml config file that lives in the pensieve repository instead of hardcode them. It might be easier to find and change if there is a separate file for this and could also serve as a configuration example.

Copy link
Contributor

@tdsmith tdsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good!

I think maybe we should drop "treatment" everywhere in favor of "statistic" just to avoid introducing a new noun. If you don't like "pretreatment" without a "treatment" then maybe they're "filter"s but either seems fine to me.

of the experiment.
"""

ref_branch_label: str = "control"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fair to assume that every statistic has ref_branch_label?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not every, but many statistics will be interested in making a comparison to the control group.

if statistic.name() == statistic_name:
pre_treatments = [pt.resolve for pt in self.pre_treatments]

if "ref_branch_label" not in params:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unless ref_branch_label is explicitly set in the config we'll use the control branch from experimenter. Or else it's just the default "control" branch

"""Return snake-cased name of the statistic."""
return re.sub(r"(?<!^)(?=[A-Z])", "_", cls.__name__).lower()

#@abstractmethod
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy doesn't agree with this...

@scholtzan scholtzan requested a review from tdsmith May 1, 2020 23:11
Copy link
Contributor

@tdsmith tdsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hype for this.

@scholtzan scholtzan merged commit 02c16ca into mozilla:master May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants