Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: support casting from bool array to EA Integer dtype #25265

Merged
merged 22 commits into from
May 14, 2019

Conversation

vladserkoff
Copy link
Contributor

@vladserkoff vladserkoff commented Feb 11, 2019

Cast boolean array to int before casting to EA Integer dtype.

@codecov
Copy link

codecov bot commented Feb 12, 2019

Codecov Report

Merging #25265 into master will decrease coverage by 49.79%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25265      +/-   ##
==========================================
- Coverage   91.93%   42.13%   -49.8%     
==========================================
  Files         166      166              
  Lines       52187    52189       +2     
==========================================
- Hits        47978    21991   -25987     
- Misses       4209    30198   +25989
Flag Coverage Δ
#multiple ?
#single 42.13% <50%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/integer.py 41.19% <50%> (-55.13%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f73594...b3e1d76. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 12, 2019

Codecov Report

Merging #25265 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25265      +/-   ##
==========================================
- Coverage   91.68%   91.67%   -0.01%     
==========================================
  Files         174      174              
  Lines       50704    50706       +2     
==========================================
- Hits        46488    46486       -2     
- Misses       4216     4220       +4
Flag Coverage Δ
#multiple 90.18% <100%> (ø) ⬆️
#single 41.18% <50%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 93.67% <ø> (ø) ⬆️
pandas/core/arrays/integer.py 96.35% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.01% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 304d8d4...3bce8a1. Read the comment docs.

@gfyoung gfyoung requested a review from jreback February 12, 2019 10:08
@gfyoung
Copy link
Member

gfyoung commented Feb 12, 2019

#25211 (comment)

@vladserkoff : Don't worry about the Codecov. Windows failure looks weird though.

@gfyoung gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate ExtensionArray Extending pandas with custom dtypes or arrays. labels Feb 16, 2019
@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

this needs a test, pls merge master

@vladserkoff
Copy link
Contributor Author

@jreback, I've added a test. Casting boolean to integer as pandas.core.arrays.integer.integer_array(array, dtype=None)(note that dtype is set to None) will still fail, but it seems it's usually called with a dtype specified.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. can you add a whatsnew note, bug fixes, numeric is ok

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

pandas.core.arrays.integer.integer_array(array, dtype=None) should work as well, can you show what array is when it doesn't work? the dtype is optional (though maybe an explicit passed None is not well handled)

@vladserkoff
Copy link
Contributor Author

Regarding pandas.core.arrays.integer.integer_array(array, dtype=None) fails for boolean array without nans because this function just calls coerce_to_array. I don't handle cases when dtype is None.

elif is_bool_dtype(values) and is_integer_dtype(dtype):
values = np.array(values, dtype=int, copy=copy)

If I change the condition to elif is_bool_dtype(values) and (is_integer_dtype(dtype) or dtype is None) then all the test in TestComparisonOps.test_compare_[array|scalar] fail with following errors:

self = <pandas.tests.extension.test_integer.TestComparisonOps object at 0x1c1c3687b8>
s = 0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8     NaN
9      10
10     11
11     1...91
91     92
92     93
93     94
94     95
95     96
96     97
97    NaN
98     99
99    100
Length: 100, dtype: UInt64
op = <built-in function gt>
other = 0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    ...    1
89    1
90    1
91    1
92    1
93    1
94    1
95    1
96    1
97    1
98    1
99    1
Length: 100, dtype: int64
op_name = '__gt__', exc = None

    def _check_op(self, s, op, other, op_name, exc=NotImplementedError):
        if exc is None:
            result = op(s, other)
            expected = s.combine(other, op)
>           self.assert_series_equal(result, expected)
E           AssertionError: Attributes are different
E
E           Attribute "dtype" are different
E           [left]:  bool
E           [right]: Int64

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

yeah i think passing an explict dtype=None for just do inference, which I think we have a path for. If you want to take a stab here great, otherwise can do a followup (but create an issue in that case).

@vladserkoff
Copy link
Contributor Author

@jreback, not sure what you mean by taking a stab here 🙂

@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

@vladserkoff in fixing passing dtype=None

@vladserkoff
Copy link
Contributor Author

@jreback, I've tried to fix this but with no success. The issue lies here and I'm not sure how to handle this:
The tests that fail when trying to handle dtype=None is comparisonOps:

def test_compare_array(self, data, all_compare_operators):
op_name = all_compare_operators
other = pd.Series([0] * len(data))
self._compare_other(data, op_name, other)

_compare_other in turn calls
def _check_op(self, s, op, other, op_name, exc=NotImplementedError):
if exc is None:
result = op(s, other)
expected = s.combine(other, op)
self.assert_series_equal(result, expected)

while s.combine is a method in which

pandas/pandas/core/series.py

Lines 2644 to 2656 in 7c24ea8

if is_categorical_dtype(self.values):
pass
elif is_extension_array_dtype(self.values):
# The function can return something of any type, so check
# if the type is compatible with the calling EA.
try:
new_values = self._values._from_sequence(new_values)
except Exception:
# https://github.com/pandas-dev/pandas/issues/22850
# pandas has no control over what 3rd-party ExtensionArrays
# do in _values_from_sequence. We still want ops to work
# though, so we catch any regular Exception.
pass

method _from_sequence internally calls integer_array without specifying dtype. When passed boolean array it just fails to convert it to integer, and everything is okay as expected.
Cant understand why call integer_array on an array with no intention to actually cast it to integer.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

@vladserkoff ok, can you create an issue for this and we can deal with it separately.

@jreback jreback added this to the 0.25.0 milestone Mar 22, 2019
@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

see my comment

@pep8speaks
Copy link

pep8speaks commented Mar 25, 2019

Hello @vladserkoff! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-14 10:40:34 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good

@@ -187,6 +187,9 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
raise TypeError("{} cannot be converted to an IntegerDtype".format(
values.dtype))

elif is_bool_dtype(values) and (is_integer_dtype(dtype) or dtype is None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a 1-line comment here on acceptable dtypes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif is_bool_dtype(values) and (is_integer_dtype(dtype) or dtype is None):
# convert boolean to int if target dtype is integer or unspecified
elif is_bool_dtype(values) and (is_integer_dtype(dtype) or dtype is None):

Ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking more for the reason we need to check here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more about the reason you need to check, and when/how dtype is None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this? Couldn't fit it into one line, though.

Suggested change
elif is_bool_dtype(values) and (is_integer_dtype(dtype) or dtype is None):
if the array is boolean, we first want to convert it to int to continue execution,
if the target dtype is unspecified Int64 would be used later.
elif is_bool_dtype(values) and (is_integer_dtype(dtype) or dtype is None):

@@ -187,6 +187,9 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
raise TypeError("{} cannot be converted to an IntegerDtype".format(
values.dtype))

elif is_bool_dtype(values) and (is_integer_dtype(dtype) or dtype is None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking more for the reason we need to check here

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

ping when updated

@vladserkoff
Copy link
Contributor Author

@jreback, pls check out my latest commit, the main idea here in pandas.core.series is to pass the resulting dtype to the EA constrictor. If you think that the logic here is ok, then I'll try to find a better way to get the expected dtype.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this solution looks good, see my comments though

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

can you merge master

@@ -187,6 +187,9 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
raise TypeError("{} cannot be converted to an IntegerDtype".format(
values.dtype))

elif is_bool_dtype(values) and (is_integer_dtype(dtype) or dtype is None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more about the reason you need to check, and when/how dtype is None

@@ -187,6 +187,9 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
raise TypeError("{} cannot be converted to an IntegerDtype".format(
values.dtype))

elif is_bool_dtype(values) and (is_integer_dtype(dtype) or dtype is None):
values = np.array(values, dtype=int, copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.asarray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's like in other places in the code, e.g.

values = np.array(values, copy=copy)

@jreback
Copy link
Contributor

jreback commented May 7, 2019

can you merge master and i'll have a look

@vladserkoff
Copy link
Contributor Author

vladserkoff commented May 7, 2019

Hey, @jreback. Before I proceed, let me clear one thing a bit. The main difficulty here is handling the case when integer_array(...) is called with dtype=None. It is difficult because for some cases it should fail when dtype=None like in this place when op is some comparison operation:

result = op(s, other)
expected = s.combine(other, op)
self.assert_series_equal(result, expected)

If s is an Integer EA, this call to s.combine will subsequently call integer_array and it should not convert boolean array to ints.
What I propose is that we don't convert to int if it is not explicitly stated by specifying dtype, as I first suggested. This way we address this specific issue I was trying to resolve, meaning handling the cases like this one pd.Series([True, False, ...], dtype=pd.Int64()), AND not breaking any other cases.
Does it sound okay to you?
Otherwise I don't seem to be qualified enough to fix this in a general way :)

@jreback
Copy link
Contributor

jreback commented May 12, 2019

@vladserkoff ok let's try what you suggest above and see

@jreback
Copy link
Contributor

jreback commented May 12, 2019

also pls merge master

@vladserkoff
Copy link
Contributor Author

@jreback, done. Checks are ok.

@jreback jreback merged commit 612c244 into pandas-dev:master May 14, 2019
@jreback
Copy link
Contributor

jreback commented May 14, 2019

thanks @vladserkoff

happy to have an issue about integer_array(..., dtype=None) (and PR too)!

@vladserkoff vladserkoff deleted the integer-array-from-bool branch May 14, 2019 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

maybe_cast_to_integer_array fails when the input is all booleans
4 participants