Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds MAX and COUNT DISTINCT aggregations to mean Fact Metrics #3384

Merged
merged 14 commits into from
Dec 9, 2024

Conversation

lukesonnet
Copy link
Collaborator

@lukesonnet lukesonnet commented Dec 3, 2024

Features and Changes

Adds COUNT DISTINCT and MAX to mean fact metrics.

COUNT DISTINCT uses HLL estimation methods (or ClickHouse's uniq default estimation method) exclusively and is not available for: non-SQL warehouses (Mixpanel), Postgres, Vertica, MySQL, and MS SQL (SQL SERVER).

Does so by adding a new aggregation field to column ref, and default is SUM if undefined, otherwise one of defined aggregations.

TODO:

  • API endpoint updates
  • Documentation updates
  • Testing

Testing

Tested all but Presto/Athena with dummy queries.

  • Test in experiment with
    • Clickhouse
    • BigQuery
    • Snowflake
    • Databricks
    • Redshift
    • Presto
    • Athena (won't do, only Presto)
  • Test in metric analysis with
    • Clickhouse
    • BigQuery
    • Snowflake
    • Databricks
    • Redshift
    • Presto
    • Athena (won't do, only Presto)

Tested Create and Update API endpoints, as well as errors when trying to use the wrong aggregation.

Screenshots

https://www.loom.com/share/dc78491dbeff475da17d0d197d4eb989

Copy link

github-actions bot commented Dec 3, 2024

Deploy preview for docs ready!

✅ Preview
https://docs-c9z1gwbea-growthbook.vercel.app

Built with commit 4213412.
This pull request is being automatically deployed with vercel-action

}
abstract hllAggregate(col: string): string;
abstract hllReaggregate(col: string): string;
abstract hllCardinality(col: string): string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would default hasCountDistinctHLL to return false and instead of abstract methods, implement default ones here that throw. So basically the same thing you're doing in Mysql, etc.. Then those data sources that don't support this will have no changes.

// TODO test null
return this.hllCardinality(this.hllAggregate(valueColumn));
} else if (columnRef?.aggregation === "max") {
return `MAX(COALESCE(${valueColumn}, 0))`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will technically break if the column has all negative values with at least 1 null, this will force the max value to be 0 instead of the true max value. Could swap it to COALESCE(MAX(${valueColumn}), 0). I don't think this is worth fixing, just calling it out as something interesting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is interesting. I think the idea that NULL is 0 always made sense until now. I'll think about this some more.

): string {
if (columnRef?.aggregation === "count distinct") {
return this.hllCardinality(col);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably refactor some of the user filter code to use this since it's similar (selects a SUM or COUNT, but then does a CASE WHEN at the end to get the final value). Nothing needed in this PR, just making a note for later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in reality we are measuring the final temp value, which in most cases is just 1<->1 but here it's cardinality of the state and for user filter it's applying the threshold. The reason it didn't work to just duplicate it there is that I need to measure the state value before capping is applied, which isn't an issue for the threshold metrics.

So I either need to carry around the state AND its value, or I need to handle it in logical statements like the above.

@lukesonnet lukesonnet merged commit 7d509fc into main Dec 9, 2024
5 checks passed
@lukesonnet lukesonnet deleted the ls/new-agg branch December 9, 2024 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants