Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] agg_concat doesn't work on strings #2847

Merged
merged 8 commits into from
Sep 25, 2024

Conversation

vicky1999
Copy link
Contributor

Solves #2768

@github-actions github-actions bot added the enhancement New feature or request label Sep 16, 2024
Copy link

codspeed-hq bot commented Sep 16, 2024

CodSpeed Performance Report

Merging #2847 will not alter performance

Comparing vicky1999:vicky1999/#2768 (48bfa25) with main (195dd00)

Summary

✅ 17 untouched benchmarks

@jaychia jaychia requested a review from colin-ho September 16, 2024 19:18
Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great initial work @vicky1999! Left some comments regarding tests as well as how you can improve the performance of this.

Comment on lines 156 to 164
let mut concat_result = String::new();

for idx in 0..self.len() {
if self.get(idx).is_some() {
let x = self.get(idx).unwrap().to_string();
concat_result.push_str(&x);
}
}
let result_box = Box::new(arrow2::array::Utf8Array::<i64>::from_slice([concat_result]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, simply creating a String and calling push_str can be inefficient due to potentially repeated heap allocations.

Instead, we only want one heap allocation. The easy way would be to do String::with_capacity, but, if you take a look at arrow2::array::Utf8Array, you should notice that it stores values as a Buffer<u8>. Therefore, all you really need to do to is to clone the buffer and modify the offsets to reflect the whole buffer as a single string, e.g. let new_offsets = OffsetsBuffer::<i64>::try_from(vec![0, *self.as_arrow.offsets().last()])?;

Note: make sure to account for cases where the whole array may be null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colin-ho In case of all nulls, the datatype was not Utf8. In that case, what can we do? can we return an empty string or [None]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make a UTF8 array with all nulls via casting: df.select(daft.col('data').cast(daft.DataType.String()))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case, the output of agg_concat should be [None]

Copy link
Contributor Author

@vicky1999 vicky1999 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I will look into it. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colin-ho I made the changes as per the comments for both concat and grouped concat on string and also with null list. Please let me know if any other changes were needed.

Comment on lines 893 to 906
def test_agg_concat_on_string_null() -> None:
df3 = from_pydict({"a": ["the", " quick", None, " fox"]})
res = df3.agg(col("a").agg_concat()).to_pydict()
expected = ["the quick fox"]
for txt in expected:
assert txt in res["a"]


def test_agg_concat_on_string_groupby_null() -> None:
df3 = from_pydict({"a": ["the", " quick", None, " fox"], "b": [1, 2, 1, 2]})
res = df3.groupby("b").agg_concat("a").to_pydict()
expected = ["the", " quick fox"]
for txt in expected:
assert txt in res["a"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add tests for data with all nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate testcases were added to check the data with all nulls.

Comment on lines 171 to 181
for group in groups {
let mut group_res = String::new();
for idx in group {
let ind: usize = idx.to_usize();
if self.get(ind).is_some() {
let x = self.get(ind).unwrap().to_string();
group_res.push_str(&x);
}
}
result_data.push(group_res);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, let's not do push_str. You can see grouped_sum: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/array/ops/sum.rs#L23 for inspiration on how to do this for sum via trusted_len_iter. But instead of summing, you would be concating the strings. You can try: https://doc.rust-lang.org/std/primitive.slice.html#method.join

@vicky1999 vicky1999 marked this pull request as draft September 18, 2024 02:17
@vicky1999 vicky1999 marked this pull request as ready for review September 24, 2024 20:44
Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@colin-ho colin-ho merged commit d57433a into Eventual-Inc:main Sep 25, 2024
36 checks passed
@colin-ho
Copy link
Contributor

Thanks for taking this on @vicky1999 ! My last few suggestions were pretty simple, so I just merged them in for you

@vicky1999
Copy link
Contributor Author

No problem. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants