Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.combine() fails with ExtensionArray inside of Series #21183

Merged
merged 18 commits into from
Jun 8, 2018
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ Reshaping
Other
^^^^^

- :meth:`Series.combine()` no longer fails with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be positive. say it works!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

-
-
-
Expand Down
52 changes: 43 additions & 9 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2185,18 +2185,33 @@ def _binop(self, other, func, level=None, fill_value=None):

this_vals, other_vals = ops.fill_binop(this.values, other.values,
fill_value)

with np.errstate(all='ignore'):
result = func(this_vals, other_vals)
name = ops.get_op_result_name(self, other)

if (is_extension_array_dtype(this_vals) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back a bit here, why does ExtensionArray get a fallback to [func(a, b) for a, b in zip(...)], when ndarray doesn't? This confuses the expected signature for func: does it that two arrays a and b, or two scalars?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
And I also don't really understand how this is related to combine, as _binop does not seem to be used there. I think _binop is for implementing the flex arithmetic ops like .add() etc. But for that I think we should do this together with the actual ops change, and probably end up with the same conclusion (no explicit element-wise fallback but rely on the ExtensionArray implementation to do this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree here. I am puzzled why this change at all. If the extension array supports ops this would work as is, if not it would raise a TypeError. So not sure what you are trying to do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. When I separated out this PR from #20889, I should have put back the old _binop. Fixed in my next commit.

is_extension_array_dtype(other_vals)):
try:
result = func(this_vals, other_vals)
except TypeError:
result = NotImplemented

if result is NotImplemented:
result = [func(a, b) for a, b in zip(this_vals, other_vals)]
if is_extension_array_dtype(this_vals):
excons = type(this_vals)._from_sequence
else:
excons = type(other_vals)._from_sequence
result = excons(result)
else:
with np.errstate(all='ignore'):
result = func(this_vals, other_vals)
result = self._constructor(result, index=new_index, name=name)
result = result.__finalize__(self)
if name is None:
# When name is None, __finalize__ overwrites current name
result.name = None
return result

def combine(self, other, func, fill_value=np.nan):
def combine(self, other, func, fill_value=None):
"""
Perform elementwise binary operation on two Series using given function
with optional fill value when an index is missing from one Series or
Expand All @@ -2208,6 +2223,9 @@ def combine(self, other, func, fill_value=np.nan):
func : function
Function that takes two scalars as inputs and return a scalar
fill_value : scalar value
The default specifies to use np.nan unless self is
backed by ExtensionArray, in which case the ExtensionArray
na_value is used.

Returns
-------
Expand All @@ -2227,20 +2245,36 @@ def combine(self, other, func, fill_value=np.nan):
Series.combine_first : Combine Series values, choosing the calling
Series's values first
"""
self_is_ext = is_extension_array_dtype(self.values)
if fill_value is None:
if self_is_ext:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thikn na_value_for_dtype is appropriate here, probably with compat=False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made that change.

fill_value = self.dtype.na_value
else:
fill_value = np.nan
if isinstance(other, Series):
new_index = self.index.union(other.index)
new_name = ops.get_op_result_name(self, other)
new_values = np.empty(len(new_index), dtype=self.dtype)
for i, idx in enumerate(new_index):
new_values = []
for idx in new_index:
lv = self.get(idx, fill_value)
rv = other.get(idx, fill_value)
with np.errstate(all='ignore'):
new_values[i] = func(lv, rv)
new_values.append(func(lv, rv))
else:
new_index = self.index
with np.errstate(all='ignore'):
new_values = func(self._values, other)
if not self_is_ext:
with np.errstate(all='ignore'):
new_values = func(self._values, other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't really understand (but this is related with the current implementation, not your changes) why we don't do it element-wise here (no loop over the values as is the case if other is Series).

For me, this seems like a bug in the current implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche You're correct. I created a new issue #21248 . I will fix that here.

else:
new_values = [func(lv, other) for lv in self._values]
new_name = self.name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a comment on what is going on here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if self_is_ext and not is_categorical_dtype(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am puzzled by all of these element-by-element operations. Is there a reason you don't simply define .combine on an EA array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback combine is by definition an element-by-element operation. We do exactly the same for other dtypes, see 10 lines above.
This check here only tries to convert the scalars back to an ExtensionArray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The not is_categorical_dtype(self.values) signals again that we need a way to pass more information on the dtype to _from_sequence. xref #20747, but not for this PR!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't use self_is_ext simply use is_extension_type and write this out so its readable

if is_categorical_dtype(...):
    pass
elif is_extension_type(...):
   ....

why is the try/except here? that is very odd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the if test as you suggested.

@jreback wrote:

why is the try/except here? that is very odd

The idea is to first try to coerce to the same type as the extension dtype, and if that doesn't work, just call the regular constructor for the Series.

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what kind of op hits the type error here? I find this a bit disconcerting that you need to catch a TypeError? is this a case that the combine op returns a result which is not an extension type (e.g. say its an int or someting), is that the reason? if so pls indicate via a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Here is an example:

In [1]: import pandas as pd

In [2]: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
   ...: a

In [3]: s = pd.Series(DecimalArray(make_data())).head()

In [4]: s.combine(0.5, lambda x, y: x < y)
Out[4]:
0    False
1    False
2    False
3    False
4    False
dtype: bool

Since the function passed to combine is an arbitrary function, it could return a result of any type, which may not be the type that will fit in the EA.

I'll add a comment.

new_values = self._values._from_sequence(new_values)
except TypeError:
pass

return self._constructor(new_values, index=new_index, name=new_name)

def combine_first(self, other):
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/extension/category/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

import pytest
import numpy as np
import pandas as pd

import pandas.util.testing as tm

from pandas.api.types import CategoricalDtype
from pandas import Categorical
Expand Down Expand Up @@ -154,6 +157,17 @@ class TestMethods(base.BaseMethodsTests):
def test_value_counts(self, all_data, dropna):
pass

def test_combine(self):
# GH 20825
orig_data1 = make_data()
orig_data2 = make_data()
s1 = pd.Series(Categorical(orig_data1, ordered=True))
s2 = pd.Series(Categorical(orig_data2, ordered=True))
result = s1.combine(s2, lambda x1, x2: x1 <= x2)
expected = pd.Series([a <= b for (a, b) in
zip(orig_data1, orig_data2)])
tm.assert_series_equal(result, expected)


class TestCasting(base.BaseCastingTests):
pass
4 changes: 3 additions & 1 deletion pandas/tests/extension/decimal/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@ class DecimalArray(ExtensionArray):
dtype = DecimalDtype()

def __init__(self, values):
assert all(isinstance(v, decimal.Decimal) for v in values)
for val in values:
if not isinstance(val, self.dtype.type):
raise TypeError
values = np.asarray(values, dtype=object)

self._data = values
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,22 @@ def test_value_counts(self, all_data, dropna):

tm.assert_series_equal(result, expected)

def test_combine(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this in the base class? (and then if needed skip it for the json tests)

And make use of the fixtures? (so passing all_data as an argument to the test function, instead of calling DecimalArray(make_data()) in the test function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That took a bit of work to do, since I need two vectors that are different. Found discussion at pytest-dev/pytest#2703 about how to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's clever .. :-)

# GH 20825
orig_data1 = make_data()
orig_data2 = make_data()
s1 = pd.Series(DecimalArray(orig_data1))
s2 = pd.Series(DecimalArray(orig_data2))
result = s1.combine(s2, lambda x1, x2: x1 <= x2)
expected = pd.Series([a <= b for (a, b) in
zip(orig_data1, orig_data2)])
tm.assert_series_equal(result, expected)

result = s1.combine(s2, lambda x1, x2: x1 + x2)
expected = pd.Series(DecimalArray([a + b for (a, b) in
zip(orig_data1, orig_data2)]))
tm.assert_series_equal(result, expected)


class TestCasting(BaseDecimal, base.BaseCastingTests):
pass
Expand Down