Skip to content

Commit 3a4cc94

Browse files
authored
Add deduplication macro (#512)
* Update README.md * Mutually excl range examples in disclosure triangle * Fix union_relations error when no include/exclude provided * Fix union_relations error when no include/exclude provided (#509) * Update CHANGELOG.md * Add dedupe macro * Add test for dedupe macro * Add documentation to README * Add entry to CHANGELOG * Implement review
1 parent 3e814b2 commit 3a4cc94

File tree

7 files changed

+86
-4
lines changed

7 files changed

+86
-4
lines changed

CHANGELOG.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
# dbt-utils v0.8.3
2+
## New features
3+
- A macro for deduplicating data ([#335](https://github.com/dbt-labs/dbt-utils/issues/335), [#512](https://github.com/dbt-labs/dbt-utils/pull/512))
4+
15
# dbt-utils v0.8.2
26
## Fixes
37
- Fix union_relations error from [#473](https://github.com/dbt-labs/dbt-utils/pull/473) when no include/exclude parameters are provided ([#505](https://github.com/dbt-labs/dbt-utils/issues/505), [#509](https://github.com/dbt-labs/dbt-utils/pull/509))
@@ -32,7 +36,7 @@
3236

3337
# dbt-utils v0.8.0
3438
## 🚨 Breaking changes
35-
- dbt ONE POINT OH is here! This version of dbt-utils requires _any_ version (minor and patch) of v1, which means far less need for compatibility releases in the future.
39+
- dbt ONE POINT OH is here! This version of dbt-utils requires _any_ version (minor and patch) of v1, which means far less need for compatibility releases in the future.
3640
- The partition column in the `mutually_exclusive_ranges` test is now always called `partition_by_col`. This enables compatibility with `--store-failures` when multiple columns are concatenated together. If you have models built on top of the failures table, update them to reflect the new column name. ([#423](https://github.com/dbt-labs/dbt-utils/issues/423), [#430](https://github.com/dbt-labs/dbt-utils/pull/430))
3741

3842
## Contributors:

README.md

+16-1
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ For compatibility details between versions of dbt-core and dbt-utils, [see this
3636

3737
- [SQL generators](#sql-generators)
3838
- [date_spine](#date_spine-source)
39+
- [dedupe](#dedupe-source)
3940
- [haversine_distance](#haversine_distance-source)
4041
- [group_by](#group_by-source)
4142
- [star](#star-source)
@@ -706,6 +707,20 @@ This macro returns the sql required to build a date spine. The spine will includ
706707
}}
707708
```
708709

710+
#### deduplicate ([source](macros/sql/deduplicate.sql))
711+
This macro returns the sql required to remove duplicate rows from a model or source.
712+
713+
**Usage:**
714+
715+
```
716+
{{ dbt_utils.deduplicate(
717+
relation=source('my_source', 'my_table'),
718+
group_by="user_id, cast(timestamp as day)",
719+
order_by="timestamp desc"
720+
)
721+
}}
722+
```
723+
709724
#### haversine_distance ([source](macros/sql/haversine_distance.sql))
710725
This macro calculates the [haversine distance](http://daynebatten.com/2015/09/latitude-longitude-distance-sql/) between a pair of x/y coordinates.
711726

@@ -748,7 +763,7 @@ group by 1,2,3
748763
```
749764

750765
#### star ([source](macros/sql/star.sql))
751-
This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with the star macro. This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`).
766+
This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with the star macro. This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`).
752767

753768
The macro also has optional `prefix` and `suffix` arguments. When one or both are provided, they will be concatenated onto each field's alias in the output (`prefix` ~ `field_name` ~ `suffix`). NB: This prevents the output from being used in any context other than a select statement.
754769

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
user_id,event,version
2+
1,play,1
3+
1,play,2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
user_id,event,version
2+
1,play,2

integration_tests/models/sql/schema.yml

+7-2
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ models:
8585
tests:
8686
- dbt_utils.equality:
8787
compare_model: ref('data_pivot_expected')
88-
88+
8989
- name: test_pivot_apostrophe
9090
tests:
9191
- dbt_utils.equality:
@@ -137,8 +137,13 @@ models:
137137
tests:
138138
- dbt_utils.equality:
139139
compare_model: ref('data_union_expected')
140-
140+
141141
- name: test_get_relations_by_pattern
142142
tests:
143143
- dbt_utils.equality:
144144
compare_model: ref('data_union_events_expected')
145+
146+
- name: test_dedupe
147+
tests:
148+
- dbt_utils.equality:
149+
compare_model: ref('data_deduplicate_expected')
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
with deduped as (
2+
3+
{{ dbt_utils.deduplicate(ref('data_deduplicate'), group_by='user_id', order_by='version desc') | indent }}
4+
5+
)
6+
7+
select * from deduped

macros/sql/deduplicate.sql

+46
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{%- macro deduplicate(relation, group_by, order_by=none) -%}
2+
{{ return(adapter.dispatch('deduplicate', 'dbt_utils')(relation, group_by, order_by=order_by)) }}
3+
{% endmacro %}
4+
5+
{%- macro default__deduplicate(relation, group_by, order_by=none) -%}
6+
7+
select
8+
{{ dbt_utils.star(relation, relation_alias='deduped') | indent }}
9+
from (
10+
select
11+
_inner.*,
12+
row_number() over (
13+
partition by {{ group_by }}
14+
{% if order_by is not none -%}
15+
order by {{ order_by }}
16+
{%- endif %}
17+
) as rn
18+
from {{ relation }} as _inner
19+
) as deduped
20+
where deduped.rn = 1
21+
22+
{%- endmacro -%}
23+
24+
{#
25+
-- It is more performant to deduplicate using `array_agg` with a limit
26+
-- clause in BigQuery:
27+
-- https://github.com/dbt-labs/dbt-utils/issues/335#issuecomment-788157572
28+
#}
29+
{%- macro bigquery__deduplicate(relation, group_by, order_by=none) -%}
30+
31+
select
32+
{{ dbt_utils.star(relation, relation_alias='deduped') | indent }}
33+
from (
34+
select
35+
array_agg (
36+
original
37+
{% if order_by is not none -%}
38+
order by {{ order_by }}
39+
{%- endif %}
40+
limit 1
41+
)[offset(0)] as deduped
42+
from {{ relation }} as original
43+
group by {{ group_by }}
44+
)
45+
46+
{%- endmacro -%}

0 commit comments

Comments
 (0)