-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] agg_concat
doesn't work on strings
#2847
Conversation
CodSpeed Performance ReportMerging #2847 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great initial work @vicky1999! Left some comments regarding tests as well as how you can improve the performance of this.
let mut concat_result = String::new(); | ||
|
||
for idx in 0..self.len() { | ||
if self.get(idx).is_some() { | ||
let x = self.get(idx).unwrap().to_string(); | ||
concat_result.push_str(&x); | ||
} | ||
} | ||
let result_box = Box::new(arrow2::array::Utf8Array::<i64>::from_slice([concat_result])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, simply creating a String and calling push_str
can be inefficient due to potentially repeated heap allocations.
Instead, we only want one heap allocation. The easy way would be to do String::with_capacity
, but, if you take a look at arrow2::array::Utf8Array, you should notice that it stores values as a Buffer<u8>
. Therefore, all you really need to do to is to clone the buffer and modify the offsets to reflect the whole buffer as a single string, e.g. let new_offsets = OffsetsBuffer::<i64>::try_from(vec![0, *self.as_arrow.offsets().last()])?;
Note: make sure to account for cases where the whole array may be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colin-ho In case of all nulls, the datatype was not Utf8. In that case, what can we do? can we return an empty string or [None]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make a UTF8 array with all nulls via casting: df.select(daft.col('data').cast(daft.DataType.String()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case, the output of agg_concat
should be [None]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I will look into it. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colin-ho I made the changes as per the comments for both concat and grouped concat on string and also with null list. Please let me know if any other changes were needed.
tests/table/test_table_aggs.py
Outdated
def test_agg_concat_on_string_null() -> None: | ||
df3 = from_pydict({"a": ["the", " quick", None, " fox"]}) | ||
res = df3.agg(col("a").agg_concat()).to_pydict() | ||
expected = ["the quick fox"] | ||
for txt in expected: | ||
assert txt in res["a"] | ||
|
||
|
||
def test_agg_concat_on_string_groupby_null() -> None: | ||
df3 = from_pydict({"a": ["the", " quick", None, " fox"], "b": [1, 2, 1, 2]}) | ||
res = df3.groupby("b").agg_concat("a").to_pydict() | ||
expected = ["the", " quick fox"] | ||
for txt in expected: | ||
assert txt in res["a"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add tests for data with all nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate testcases were added to check the data with all nulls.
for group in groups { | ||
let mut group_res = String::new(); | ||
for idx in group { | ||
let ind: usize = idx.to_usize(); | ||
if self.get(ind).is_some() { | ||
let x = self.get(ind).unwrap().to_string(); | ||
group_res.push_str(&x); | ||
} | ||
} | ||
result_data.push(group_res); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, let's not do push_str
. You can see grouped_sum: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/array/ops/sum.rs#L23 for inspiration on how to do this for sum via trusted_len_iter
. But instead of summing, you would be concating the strings. You can try: https://doc.rust-lang.org/std/primitive.slice.html#method.join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Thanks for taking this on @vicky1999 ! My last few suggestions were pretty simple, so I just merged them in for you |
No problem. Thanks. |
Solves #2768