Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior in cut() with nullable Int64 dtype #30787

Closed
Tracked by #3
sdmccabe opened this issue Jan 7, 2020 · 6 comments · Fixed by #51384
Closed
Tracked by #3

Unexpected behavior in cut() with nullable Int64 dtype #30787

sdmccabe opened this issue Jan 7, 2020 · 6 comments · Fixed by #51384
Assignees
Labels
cut cut, qcut ExtensionArray Extending pandas with custom dtypes or arrays. good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions

Comments

@sdmccabe
Copy link

sdmccabe commented Jan 7, 2020

Code Sample

import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]

breaks_cut = pd.cut(series, breaks)
breaks_cut
0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (0.0, 2.0]
7    (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]

Problem Description

When using the pd.Int64 nullable integer data type, pd.cut() unexpectedly bins the first non-np.nan value after an np.nan into the lowest interval. In the above example, the number 6 is binned into (0.0, 2.0].

Expected Output

0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (4.0, 6.0]
7    (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]

Note that using an IntervalIndex produces the expected output.

import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]
intervals = [pd.Interval(x, y) for x, y in zip(breaks[:-1], breaks[1:])]
interval_index = pd.IntervalIndex(intervals)

interval_cut = pd.cut(series, interval_index)
interval_cut

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.0.0-37-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.3
numpy            : 1.17.3
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 44.0.0.post20200102
Cython           : None
pytest           : 5.3.2
hypothesis       : None
sphinx           : 2.3.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.4.2
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.1
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.4.2
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
@jschendel
Copy link
Member

Looks like there's an additional complication on master with the introduction of pd.NA and its use as the default NA for nullable integer dtypes:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+1674.gbe1556c46'

In [2]: series = pd.Series([0, 1, 2, 3, 4, np.nan, 6, 7], dtype='Int64')

In [3]: breaks = [0, 2, 4, 6, 8]

In [4]: pd.cut(series, breaks)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

@jschendel jschendel added Bug ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 7, 2020
@jschendel jschendel added this to the Contributions Welcome milestone Jan 7, 2020
@parkerdgabel
Copy link

take

@parkerdgabel
Copy link

What should the expected behavior be?

@jschendel
Copy link
Member

Probably need to more generically fix #30944 before attempting to resolve the issue of the output being incorrect. It's possible that fixing #30944 may also fix this, in which case we'd just want to add some tests here.

@jorisvandenbossche jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 30, 2020
@mroeschke mroeschke added cut cut, qcut and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 28, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@phofl
Copy link
Member

phofl commented Jan 18, 2023

This works on main, may need tests

0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (4.0, 6.0]
7    (6.0, 8.0]

@phofl phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jan 18, 2023
@kkangs0226
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cut cut, qcut ExtensionArray Extending pandas with custom dtypes or arrays. good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
7 participants