-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Series.combine() fails with ExtensionArray inside of Series #21183
Conversation
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -47,6 +47,7 @@ Bug Fixes | |||
~~~~~~~~~ | |||
|
|||
- tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`) | |||
- Series.combine() no longer fails with ExtensionArray inside of Series (:issue:`20825`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Series.combine() --> :meth:`Series.combine`
- ExtensionArray --> :class:`~pandas.api.extensions.ExtensionArray`
- Series --> :class:`Series`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pandas/core/series.py
Outdated
except TypeError: | ||
result = NotImplemented | ||
except Exception as e: | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this last except
block is redundant, since non-TypeError
exceptions will be raised regardless of if it that block is present or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pandas/core/series.py
Outdated
if isinstance(other, Series): | ||
new_index = self.index.union(other.index) | ||
new_name = ops.get_op_result_name(self, other) | ||
new_values = np.empty(len(new_index), dtype=self.dtype) | ||
new_values = [] | ||
for i, idx in enumerate(new_index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the enumerate
is no longer necessary since i
isn't used, so you should be able to directly iterate over new_index
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -157,3 +160,13 @@ def test_value_counts(self, all_data, dropna): | |||
|
|||
class TestCasting(base.BaseCastingTests): | |||
pass | |||
|
|||
|
|||
def test_combine(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved to be within the TestMethods
class? And can you reference the GitHub issue number in a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Codecov Report
@@ Coverage Diff @@
## master #21183 +/- ##
==========================================
+ Coverage 91.85% 91.85% +<.01%
==========================================
Files 153 153
Lines 49549 49573 +24
==========================================
+ Hits 45512 45537 +25
+ Misses 4037 4036 -1
Continue to review full report at Codecov.
|
pandas/core/series.py
Outdated
new_name = self.name | ||
|
||
if (self_is_ext and self.values.is_sequence_of_dtype(new_values)): | ||
new_values = self._values._from_sequence(new_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what conditions is self.values.is_sequence_of_dtype(new_values)
false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine
takes a func
as an argument. The func
may not necessarily return objects of the ExtensionDtype
. For example, consider a logical operator. So here self.values
is the original ExtensionArray
, but new_values
is the result of applying the func
element by element, and those individual results aren't necessarily of the corresponding ExtensionDtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
How important is it to allow coercion of output type? The previous code certainly considered dtype-preserving functions to be the expected case, since the pre-allocated new_dtype
had the same dtype as self. Only if necessary was it coerced.
Without having studied the uses of combine
, can we say "extension arrays are not allowed to upcast in combine?". i.e. Series[ExtensionArray].combine(Series[Any]) -> Series[ExtensionArray]
. That doesn't sound quiet right...
Anyway, my aversion is to having to perform a full scan of the data just to determine the dtype. That's what types are for :) Can we ask the user to provide an output_dtype
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the following (using the implementation in this PR):
In [1]: import pandas as pd
...: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
...: a
...:
In [2]: da1= make_data()
...: da2= make_data()
...: da1[:5], da2[:5]
...:
Out[2]:
([Decimal('0.9100374922132099531069115982973016798496246337890625'),
Decimal('0.559003605365540945371094494475983083248138427734375'),
Decimal('0.31951366993722529752375294265220873057842254638671875'),
Decimal('0.238154945978455767630066475248895585536956787109375'),
Decimal('0.5851317119327676952167394119896925985813140869140625')],
[Decimal('0.12525543165059660477567149428068660199642181396484375'),
Decimal('0.68213474143447905273518472313298843801021575927734375'),
Decimal('0.29982507800002611286771525556105189025402069091796875'),
Decimal('0.297029189226840184545608281041495501995086669921875'),
Decimal('0.5969224093736389402664599401759915053844451904296875')])
In [3]: s1 = pd.Series(DecimalArray(da1))
...: s2 = pd.Series(DecimalArray(da2))
...: pd.DataFrame({'s1':s1, 's2':s2}).head()
...:
Out[3]:
s1
s2
0 0.91003749221320995310691159829730167984962463... 0.12525543165059660477567149428068660199642181...
1 0.55900360536554094537109449447598308324813842... 0.68213474143447905273518472313298843801021575...
2 0.31951366993722529752375294265220873057842254... 0.29982507800002611286771525556105189025402069...
3 0.23815494597845576763006647524889558553695678... 0.29702918922684018454560828104149550199508666...
4 0.58513171193276769521673941198969259858131408... 0.59692240937363894026645994017599150538444519...
In [4]: s1.dtype, s2.dtype
Out[4]:
(<pandas.tests.extension.decimal.array.DecimalDtype at 0x15bc5ef1048>,
<pandas.tests.extension.decimal.array.DecimalDtype at 0x15bc5ef1048>)
In [5]: cores = s1.combine(s2, lambda x1, x2: x1 if x1 < x2 else x2)
...: cores.head()
...:
Out[5]:
0 0.12525543165059660477567149428068660199642181...
1 0.55900360536554094537109449447598308324813842...
2 0.29982507800002611286771525556105189025402069...
3 0.23815494597845576763006647524889558553695678...
4 0.58513171193276769521673941198969259858131408...
dtype: decimal
In [6]: clres = s1.combine(s2, lambda x1, x2: (x1 < x2))
...: clres.head()
...:
Out[6]:
0 False
1 True
2 False
3 True
4 True
dtype: bool
Note that with the implementation as in this PR, we get a Series of dtype decimal
since the results are all Decimal
objects, and we get a boolean
Series when the results are logical.
The previous behavior would product object
, which is not what you want. Given a sequence s
, if you do pd.Series(s)
, the internals of pandas look at the sequence s
to determine the dtype
. But those internals won't pick things up correctly for things that are ExtensionDtype
. So somehow, we need to know to call ExtensionArray._from_sequence()
to get the result of combine
into the right type.
The implementation is already doing this element-by-element, so we are doing a full scan of both the left and right arrays. This is an extra scan on the result.
We can add an output_dtype
argument to combine
, but then we'd be changing the API of combine
and would need to decide what the default value of that parameter is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not do this here at all, prefering instead to dispatch to the EA itself for a .combine()
method, this will be much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing the is_sequence_of_dtype
check, an alternative might be to try: .. except
_from_sequence()
?
I would prefer not to add combine
logic to the extension array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche My most recent commit has this change you suggested and removes is_sequence_of_dtype
. It required an extra check to see if it is categorical. I also added a test which revealed a couple of other bugs.
In terms of whether combine
should be in the EA class or in series.py
, you guys will have to decide on that.
pandas/tests/test_window.py
Outdated
@@ -41,7 +41,7 @@ def win_types(request): | |||
return request.param | |||
|
|||
|
|||
@pytest.fixture(params=['kaiser', 'gaussian', 'general_gaussian', 'slepian']) | |||
@pytest.fixture(params=['kaiser', 'gaussian', 'general_gaussian']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something went wrong with git. I didn't touch that code, but did a rebase, and I think it pulled that in.
pandas/util/testing.py
Outdated
@@ -1118,10 +1118,12 @@ def assert_extension_array_equal(left, right): | |||
right_na = right.isna() | |||
assert_numpy_array_equal(left_na, right_na) | |||
|
|||
left_valid = left[~left_na].astype(object) | |||
right_valid = right[~right_na].astype(object) | |||
if len(left_na) > 0 and len(right_na) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the changes here? left_na
is an array of booleans, so it'll always be the same length as left
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If left
has length zero, then the statement that follows, i.e.
left_valid = left[~left_na].astype(object)
fails. There was some test case that was failing without this change (but I don't remember which one, and it may have been in the ops stuff). I will leave this change out and see if this fix for combine
still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this works
In [9]: pd.Series()[np.array([], dtype='bool')].astype(object)
Out[9]: Series([], dtype: object)
then we should ensure that Series[extension_array] works too. If you find a reproducible example, could you make a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this code out of this PR. It's not needed here. When we get to the ops stuff, I'll have to reintroduce it, and figure out the boundary case that required this particular case.
pandas/util/testing.py
Outdated
@@ -1224,6 +1226,9 @@ def assert_series_equal(left, right, check_dtype=True, | |||
left = pd.IntervalIndex(left) | |||
right = pd.IntervalIndex(right) | |||
assert_index_equal(left, right, obj='{obj}.index'.format(obj=obj)) | |||
elif (is_extension_array_dtype(left) and not is_categorical_dtype(left) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the not is_categorical_dtype
needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was left over from the ops stuff, so I have removed it.
Hello @Dr-Irv! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 07, 2018 at 15:45 Hours UTC |
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -47,6 +47,7 @@ Bug Fixes | |||
~~~~~~~~~ | |||
|
|||
- tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`) | |||
- :meth:`Series.combine()` no longer fails with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved it to 0.24
pandas/core/series.py
Outdated
new_name = self.name | ||
|
||
if (self_is_ext and self.values.is_sequence_of_dtype(new_values)): | ||
new_values = self._values._from_sequence(new_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not do this here at all, prefering instead to dispatch to the EA itself for a .combine()
method, this will be much simpler.
@jreback You write:
Are you suggesting that there is a default implementation of |
no I would suggest the default impl is NotImplemented, this is getting too messy here with all of these if/thens |
pandas/core/series.py
Outdated
name = ops.get_op_result_name(self, other) | ||
|
||
if (is_extension_array_dtype(this_vals) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stepping back a bit here, why does ExtensionArray get a fallback to [func(a, b) for a, b in zip(...)]
, when ndarray doesn't? This confuses the expected signature for func
: does it that two arrays a
and b
, or two scalars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
And I also don't really understand how this is related to combine
, as _binop
does not seem to be used there. I think _binop
is for implementing the flex arithmetic ops like .add()
etc. But for that I think we should do this together with the actual ops change, and probably end up with the same conclusion (no explicit element-wise fallback but rely on the ExtensionArray implementation to do this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree here. I am puzzled why this change at all. If the extension array supports ops this would work as is, if not it would raise a TypeError. So not sure what you are trying to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake. When I separated out this PR from #20889, I should have put back the old _binop
. Fixed in my next commit.
pandas/core/series.py
Outdated
@@ -2227,20 +2245,36 @@ def combine(self, other, func, fill_value=np.nan): | |||
Series.combine_first : Combine Series values, choosing the calling | |||
Series's values first | |||
""" | |||
self_is_ext = is_extension_array_dtype(self.values) | |||
if fill_value is None: | |||
if self_is_ext: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thikn na_value_for_dtype
is appropriate here, probably with compat=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made that change.
pandas/core/series.py
Outdated
new_values = func(self._values, other) | ||
if not self_is_ext: | ||
with np.errstate(all='ignore'): | ||
new_values = func(self._values, other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't really understand (but this is related with the current implementation, not your changes) why we don't do it element-wise here (no loop over the values as is the case if other
is Series).
For me, this seems like a bug in the current implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche You're correct. I created a new issue #21248 . I will fix that here.
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -174,6 +174,7 @@ Reshaping | |||
Other | |||
^^^^^ | |||
|
|||
- :meth:`Series.combine()` no longer fails with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be positive. say it works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/series.py
Outdated
name = ops.get_op_result_name(self, other) | ||
|
||
if (is_extension_array_dtype(this_vals) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree here. I am puzzled why this change at all. If the extension array supports ops this would work as is, if not it would raise a TypeError. So not sure what you are trying to do here.
pandas/core/series.py
Outdated
new_name = self.name | ||
|
||
if self_is_ext and not is_categorical_dtype(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am puzzled by all of these element-by-element operations. Is there a reason you don't simply define .combine
on an EA array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback combine
is by definition an element-by-element operation. We do exactly the same for other dtypes, see 10 lines above.
This check here only tries to convert the scalars back to an ExtensionArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much cleaner!
pandas/core/series.py
Outdated
new_name = self.name | ||
|
||
if self_is_ext and not is_categorical_dtype(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The not is_categorical_dtype(self.values)
signals again that we need a way to pass more information on the dtype to _from_sequence
. xref #20747, but not for this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls don't use self_is_ext
simply use is_extension_type
and write this out so its readable
if is_categorical_dtype(...):
pass
elif is_extension_type(...):
....
why is the try/except here? that is very odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the if test as you suggested.
@jreback wrote:
why is the try/except here? that is very odd
The idea is to first try to coerce to the same type as the extension dtype, and if that doesn't work, just call the regular constructor for the Series.
@@ -138,6 +138,22 @@ def test_value_counts(self, all_data, dropna): | |||
|
|||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_combine(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put this in the base class? (and then if needed skip it for the json tests)
And make use of the fixtures? (so passing all_data
as an argument to the test function, instead of calling DecimalArray(make_data())
in the test function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That took a bit of work to do, since I need two vectors that are different. Found discussion at pytest-dev/pytest#2703 about how to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's clever .. :-)
pandas/tests/extension/conftest.py
Outdated
"""Return different versions of data for count times""" | ||
def gen(count): | ||
for _ in range(count): | ||
yield NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NotImplemented -> NotImplementedError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pandas/core/series.py
Outdated
@@ -2227,20 +2229,31 @@ def combine(self, other, func, fill_value=np.nan): | |||
Series.combine_first : Combine Series values, choosing the calling | |||
Series's values first | |||
""" | |||
self_is_ext = is_extension_array_dtype(self.values) | |||
if fill_value is None: | |||
fill_value = na_value_for_dtype(self.dtype, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the keyword for False so it’s clearer what’s being specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that is done.
Lgtm otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly minor comments. looks pretty good. this is a nice target changed.
pandas/core/series.py
Outdated
new_name = self.name | ||
|
||
if self_is_ext and not is_categorical_dtype(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls don't use self_is_ext
simply use is_extension_type
and write this out so its readable
if is_categorical_dtype(...):
pass
elif is_extension_type(...):
....
why is the try/except here? that is very odd
@@ -154,6 +163,10 @@ class TestMethods(base.BaseMethodsTests): | |||
def test_value_counts(self, all_data, dropna): | |||
pass | |||
|
|||
@pytest.mark.skip(reason="combine add for categorical not supported") | |||
def test_combine_add(self, data_repeated): | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it raise? if so pls tests that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. Since combine
by definition in the docs does an element by element operation, if you do something like:
In [1]: import pandas as pd
In [2]: cat1 = pd.Categorical(values=["one","two","three","three","two","one"],
...: categories=["one","two","three"], ordered=True)
In [3]: s1 = pd.Series(cat1)
In [4]: s1
Out[4]:
0 one
1 two
2 three
3 three
4 two
5 one
dtype: category
Categories (3, object): [one < two < three]
In [5]: s1.combine("abc", lambda x,y : x+y)
In pandas 0.23.0, this fails. But the binary operation is defined on an element by element basis, so the new implementation will return:
Out[4]:
0 oneabc
1 twoabc
2 threeabc
3 threeabc
4 twoabc
5 oneabc
dtype: object
IMHO, this is correct, because of what combine
is supposed to do. So I will test that.
@@ -103,3 +103,36 @@ def test_factorize_equivalence(self, data_for_grouping, na_sentinel): | |||
|
|||
tm.assert_numpy_array_equal(l1, l2) | |||
self.assert_extension_array_equal(u1, u2) | |||
|
|||
def test_combine_le(self, data_repeated): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give a 1-liner explaining what this is testing. the name of the test is uninformative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -2208,6 +2208,8 @@ def combine(self, other, func, fill_value=np.nan): | |||
func : function | |||
Function that takes two scalars as inputs and return a scalar | |||
fill_value : scalar value | |||
The default specifies to use the appropriate NaN value for | |||
the underlying dtype of the Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need a versionchanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no change in behaviour for normal Series I think, as the na_value_for_dtype
will give NaN/NaT (which was the default before). It's only for extension arrays that it might give another value, depending on what the extension array defined its missing value to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jorisvandenbossche
new_name = self.name | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a comment on what is going on here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Dr-Irv also can you make sure that the cases in your OP are covered here (I think so, but pls verify) |
@jreback wrote:
Here you go: In [1]: import pandas as pd
In [2]: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
...: a
In [3]: da1= make_data()
In [4]: da2 = make_data()
In [5]: s1 = pd.Series(DecimalArray(da1))
In [6]: s2 = pd.Series(DecimalArray(da2))
In [7]: s1.combine(s2, lambda x1, x2: x1 if x1 < x2 else x2)
Out[7]:
0 0.08098492909619514623642544393078424036502838...
1 0.38231358539972493115755014514434151351451873...
2 0.28683621751796128940270591556327417492866516...
3 0.68315691040978210324396968644578009843826293...
4 0.37188279135446988821200875463546253740787506...
5 0.29391070580371847498213355720508843660354614...
6 0.00281662212378319676275850724778138101100921...
7 0.68424426414243522120983698187046684324741363...
8 0.82298917655718306640721948497230187058448791...
9 0.104520052984668154749670065939426422119140625
10 0.36541026835682677287309161329176276922225952...
11 0.15450295421108961591016850434243679046630859375
12 0.28227045216414337058807859648368321359157562...
13 0.07428041682799846334717130957869812846183776...
14 0.76112360264534761888910452398704364895820617...
15 0.31991420667129122357152937183855101466178894...
16 0.43808559346706621440148410329129546880722045...
17 0.81185756018222499097447553140227682888507843...
18 0.18755848298291555309447176114190369844436645...
19 0.45064358633799639353156862853211350739002227...
20 0.01908941180337764276231382609694264829158782...
21 0.05994588216042351369594598509138450026512145...
22 0.45670931996552022180679841767414472997188568...
23 0.21696297680411980035586338999564759433269500...
24 0.77792921984939356061516946283518336713314056...
25 0.10732191741882657343154505724669434130191802...
26 0.26112839619732719498301776184234768152236938...
27 0.78917786400767853116633432364324107766151428...
28 0.11041422426646729793020540455472655594348907...
29 0.48501200782261966182318246865179389715194702...
...
70 0.01573854854098000188855621672701090574264526...
71 0.44412972763803593156950455522746779024600982...
72 0.25386583135624463114510263039846904575824737...
73 0.18052436704533048050791421701433137059211730...
74 0.01768961191491325024571779067628085613250732...
75 0.19783353177328266703227654943475499749183654...
76 0.64275305832443418996291484290850348770618438...
77 0.56556945132078839666434078026213683187961578...
78 0.45692610540309075428666574225644581019878387...
79 0.54208092770195082099604633185663260519504547...
80 0.04475845682326173857745743589475750923156738...
81 0.02586933176340200368770183558808639645576477...
82 0.38495809393040258949980625402531586587429046...
83 0.09473826636927540345567422264139167964458465...
84 0.09800918702868577359055279885069467127323150...
85 0.06037114142095312274705065647140145301818847...
86 0.54181647584734216049895394462510012090206146...
87 0.39820409063224215806542360951425507664680480...
88 0.08573080156333134915769278450170531868934631...
89 0.31841090690861806322686788917053490877151489...
90 0.05991365875060672419039065061951987445354461...
91 0.20527392489879703330046822884469293057918548...
92 0.68151285114935533648861110123107209801673889...
93 0.33294634130492617440921776506002061069011688...
94 0.50598008087709389624109235228388570249080657...
95 0.31112635294260559959411693853326141834259033...
96 0.15173983856659867264227159466827288269996643...
97 0.12529127428765418628131556033622473478317260...
98 0.15472130908820258543556747099501080811023712...
99 0.36042221705147114985123835140257142484188079...
Length: 100, dtype: decimal
In [8]: cat1 = pd.Categorical(values=["one","two","three","three","two","one"],
...:
...: ...: categories=["one","two","three"], ordered=True)
...: ...: cat2 = pd.Categorical(values=["three","two","one","one","two","
...: three"],
...: ...: categories=["one","two","three"], ordered=True)
...: ...: s1 = pd.Series(cat1)
...: ...: s2 = pd.Series(cat2)
...: ...: s1, s2
...:
Out[8]:
(0 one
1 two
2 three
3 three
4 two
5 one
dtype: category
Categories (3, object): [one < two < three], 0 three
1 two
2 one
3 one
4 two
5 three
dtype: category
Categories (3, object): [one < two < three])
In [9]: s1.combine(s2, lambda x1, x2: x1 <= x2)
Out[9]:
0 True
1 True
2 False
3 False
4 True
5 True
dtype: bool
In [10]: s = pd.Series([i*10 for i in range(5)])
In [11]: s.combine(22, lambda x,y: min(x,y))
Out[11]:
0 0
1 10
2 20
3 22
4 22
dtype: int64
In [12]: pd.__version__
Out[12]: '0.24.0.dev0+76.g2a21117c7.dirty' |
i mean can u verify that u have test coverage for those cases |
@jreback wrote:
The new tests subsume those cases. Should I explicitly include those cases? |
As long the case is part of the tests you added, that is fine. |
Regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. a question.
pandas/core/series.py
Outdated
if is_categorical_dtype(self.values): | ||
pass | ||
elif is_extension_array_dtype(self.values): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what kind of op hits the type error here? I find this a bit disconcerting that you need to catch a TypeError? is this a case that the combine op returns a result which is not an extension type (e.g. say its an int or someting), is that the reason? if so pls indicate via a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Here is an example:
In [1]: import pandas as pd
In [2]: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
...: a
In [3]: s = pd.Series(DecimalArray(make_data())).head()
In [4]: s.combine(0.5, lambda x, y: x < y)
Out[4]:
0 False
1 False
2 False
3 False
4 False
dtype: bool
Since the function passed to combine is an arbitrary function, it could return a result of any type, which may not be the type that will fit in the EA.
I'll add a comment.
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -181,6 +181,7 @@ Reshaping | |||
Other | |||
^^^^^ | |||
|
|||
- | |||
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move the ExtensionArray to a sub-section (create a new one) as lots of changes in EA (you can make a new bug fix section).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made a bug fix section
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -181,6 +181,7 @@ Reshaping | |||
Other | |||
^^^^^ | |||
|
|||
- | |||
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`) | |||
- :meth:`Series.combine()` with scalar argument now works for any function type (:issue:`21248`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one can go in reshaping bug fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or into EA bug fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In EA bug fixes
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -181,6 +181,7 @@ Reshaping | |||
Other | |||
^^^^^ | |||
|
|||
- | |||
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`) | |||
- :meth:`Series.combine()` with scalar argument now works for any function type (:issue:`21248`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or into EA bug fixes
thanks @Dr-Irv nice patch! thanks for the patience. non-trivial things take more time :-D |
closes Series.combine() with a scalar only works if function is compatible with (vec, scalar) operation #21248
git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew.v0.24.0.txt
This is a split of #20889 to just fix the
combine
issue.