Unexpected behavior in cut() with nullable Int64 dtype #30787

sdmccabe · 2020-01-07T16:45:16Z

Code Sample

import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]

breaks_cut = pd.cut(series, breaks)
breaks_cut

0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (0.0, 2.0]
7    (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]

Problem Description

When using the pd.Int64 nullable integer data type, pd.cut() unexpectedly bins the first non-np.nan value after an np.nan into the lowest interval. In the above example, the number 6 is binned into (0.0, 2.0].

Expected Output

0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (4.0, 6.0]
7    (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]

Note that using an IntervalIndex produces the expected output.

import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]
intervals = [pd.Interval(x, y) for x, y in zip(breaks[:-1], breaks[1:])]
interval_index = pd.IntervalIndex(intervals)

interval_cut = pd.cut(series, interval_index)
interval_cut

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.0.0-37-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.3
numpy            : 1.17.3
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 44.0.0.post20200102
Cython           : None
pytest           : 5.3.2
hypothesis       : None
sphinx           : 2.3.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.4.2
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.1
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.4.2
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None

The text was updated successfully, but these errors were encountered:

jschendel · 2020-01-07T20:21:05Z

Looks like there's an additional complication on master with the introduction of pd.NA and its use as the default NA for nullable integer dtypes:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+1674.gbe1556c46'

In [2]: series = pd.Series([0, 1, 2, 3, 4, np.nan, 6, 7], dtype='Int64')

In [3]: breaks = [0, 2, 4, 6, 8]

In [4]: pd.cut(series, breaks)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

parkerdgabel · 2020-01-07T20:48:14Z

take

parkerdgabel · 2020-01-09T01:04:59Z

What should the expected behavior be?

jschendel · 2020-01-12T18:20:22Z

Probably need to more generically fix #30944 before attempting to resolve the issue of the output being incorrect. It's possible that fixing #30944 may also fix this, in which case we'd just want to add some tests here.

phofl · 2023-01-18T21:36:30Z

This works on main, may need tests

0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (4.0, 6.0]
7    (6.0, 8.0]

kkangs0226 · 2023-02-12T04:16:53Z

take

jschendel added Bug ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 7, 2020

jschendel added this to the Contributions Welcome milestone Jan 7, 2020

github-actions bot assigned parkerdgabel Jan 7, 2020

alagappan97 mentioned this issue Jan 11, 2020

BUG30787 fixed unexpected behaviour by removing nullable values #30920

Closed

jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 30, 2020

mroeschke added cut cut, qcut and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 28, 2020

jbrockmendel mentioned this issue Dec 18, 2021

ROADMAP: Consistent missing value handling with new NA scalar #28095

Open

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jan 18, 2023

phofl mentioned this issue Jan 18, 2023

TST: Fixed issues that need tests noatamir/pyladies-berlin-sprints#3

Open

17 tasks

github-actions bot assigned kkangs0226 Feb 12, 2023

kkangs0226 mentioned this issue Feb 14, 2023

TEST: cut() with nullable Int64 dtype #51384

Merged

5 tasks

phofl closed this as completed in #51384 Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior in cut() with nullable Int64 dtype #30787

Unexpected behavior in cut() with nullable Int64 dtype #30787

sdmccabe commented Jan 7, 2020

jschendel commented Jan 7, 2020

parkerdgabel commented Jan 7, 2020

parkerdgabel commented Jan 9, 2020

jschendel commented Jan 12, 2020

phofl commented Jan 18, 2023

kkangs0226 commented Feb 12, 2023

Unexpected behavior in cut() with nullable Int64 dtype #30787

Unexpected behavior in cut() with nullable Int64 dtype #30787

Comments

sdmccabe commented Jan 7, 2020

Code Sample

Problem Description

Expected Output

Output of pd.show_versions()

jschendel commented Jan 7, 2020

parkerdgabel commented Jan 7, 2020

parkerdgabel commented Jan 9, 2020

jschendel commented Jan 12, 2020

phofl commented Jan 18, 2023

kkangs0226 commented Feb 12, 2023

Output of `pd.show_versions()`