Proper support of nullable dtypes as the Categorical dtype #50711

Dr-Irv · 2023-01-12T17:30:07Z

Now that Categorical depends on ExtensionArray, it makes more sense to return and output pd.NA as a missing value instead of np.nan.

Propose that we announce in 2.0 release that this will change in a future release. Not clear if/how we create a deprecation message here.

Current behavior:

>>> c = pd.Categorical( ["a", "a", "b", "c", "c"], ["a", "b", "c"])
>>> c
['a', 'a', 'b', 'c', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> s = pd.Series(c)
>>> s
0    a
1    a
2    b
3    c
4    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.iloc[2] = pd.NA
>>> s.iloc[2]
nan

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2023-01-12T17:33:00Z

Not clear how to track this in #50578 since we should decide that we want to do this.

jbrockmendel · 2023-01-12T18:02:41Z

id be more inclined to #29962, which would mean getting pd.NA in a targeted subset of cases.

Dr-Irv · 2023-01-12T18:18:32Z

id be more inclined to #29962, which would mean getting pd.NA in a targeted subset of cases.

The idea in #29962 is to make the NA value dependent on the underlying dtype of the categorical. But there is also a point made in a comment there (#29962 (comment)) that we should only use pd.NA independent of the underlying dtype. (I agree with this)

In either case, I think we need to make a decision and figure out how to do a deprecation notice. Or as you (@jbrockmendel ) suggested in another comment in that issue (#29962 (comment)), just bite the bullet and make the change now for 2.0.

jbrockmendel · 2023-01-12T18:42:30Z

that we should only use pd.NA independent of the underlying dtype

-1. As long as we distinguish between pd.NA and nan etc (xref #32265), this is a semantic change. Besides which getting pd.NA is a PITA.

Another alternative would be #37930 which would let users specify. That would likely be the biggest breaking change.

Dr-Irv · 2023-01-12T20:44:30Z

Besides which getting pd.NA is a PITA.

Can you explain why getting pd.NA is a PITA in this context?

And why would we need to distinguish between np.nan and pd.NA for a categorical?

jorisvandenbossche · 2023-02-08T16:36:08Z

The idea in #29962 is to make the NA value dependent on the underlying dtype of the categorical. But there is also a point made in a comment there (#29962 (comment)) that we should only use pd.NA independent of the underlying dtype. (I agree with this)

I also agree with that (that we should move to only use pd.NA for all dtypes), but as long as we still have dtypes for now that don't use pd.NA (which are actually the default), I think the most logical thing to do is to let Categorical follow the dtype of its categories. If it represents numpy-dtype based categories, use np.nan (and maybe np.NaT for datetime), and if it represents nullable data (a "nullable categorical dtype"), then we should use pd.NA as missing value scalar.

That makes that people who start using the nullable dtypes are ensured they keep nullable dtypes (and matching missing value scalars), while people who didn't opt in to nullable dtypes just keep the current behaviour.
(in theory, a user that just uses the defaults should currently never see pd.NA, as that is still fully opt-in)

(for a long time, a first step was actually support having nullable categories, but I suppose that was "fixed" now we can support EAs in the Index?)

Dr-Irv · 2023-02-08T18:36:43Z

I changed the title of the issue, and summarize here the discussion on 2/8/2023:
Joris: The way forward is to first properly support “nullable categorical dtypes” (categorical dtype with nullable categories): comparison operations, missing value scalar, convert_dtypes (and everywhere use_nullable_dtypes param exists)

jorisvandenbossche · 2023-02-08T19:04:57Z

So while a Categorical can now store categories using a nullable dtype, there are still a variety of aspects that don't follow the expected behaviour for "nullable dtypes" (see comment above). Just as a quick illustration of the comparison case:

In [25]: s = pd.Series([1, 2, pd.NA], dtype="Int64")

In [26]: s_cat = s.astype("category")

In [27]: s == 1
Out[27]: 
0     True
1    False
2     <NA>
dtype: boolean

In [28]: s_cat == 1
Out[28]: 
0     True
1    False
2    False
dtype: bool

I expect a "nullable" categorical column to give the same result as for the non-categorical s == 1.

WillAyd · 2024-07-05T17:54:30Z

This is another interesting one for #58988

I also agree with that (that we should move to only use pd.NA for all dtypes), but as long as we still have dtypes for now that don't use pd.NA (which are actually the default), I think the most logical thing to do is to let Categorical follow the dtype of its categories

I think the API here is unfortunate; given most of the pd.* types support NA

>>> pd.Series(["foo", "bar", pd.NA], dtype=pd.StringDtype())
0     foo
1     bar
2    <NA>
dtype: string

it is rather surprising that that same pattern does not get followed with the categorical type:

>>> pd.Series(["foo", "bar", pd.NA], dtype=pd.CategoricalDtype())
0    foo
1    bar
2    NaN
dtype: category
Categories (2, object): ['bar', 'foo']

Dr-Irv changed the title ~~DEP: For v2.0, Deprecate using np.nan as the missing value type in pd.Categorical and use pd.NA instead~~ DEPR: For v2.0, Deprecate using np.nan as the missing value type in pd.Categorical and use pd.NA instead Jan 12, 2023

phofl mentioned this issue Jan 15, 2023

ENH: Should pandas.CategoricalDtype use pandas.NA as the sentinel value instead of float('nan') ? #43836

Closed

lithomas1 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Feb 8, 2023

Dr-Irv changed the title ~~DEPR: For v2.0, Deprecate using np.nan as the missing value type in pd.Categorical and use pd.NA instead~~ Proper support of nullable dtypes as the Categorical dtype Feb 8, 2023

Dr-Irv removed Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Feb 8, 2023

Dr-Irv mentioned this issue Feb 13, 2023

QST: Categorical NaN moving to strings. Unexpected behavior? #51282

Closed

2 tasks

rhshadrach mentioned this issue Apr 16, 2023

API: use na_value from CategoricalDtype.categories in Categorical #52687

Closed

jbrockmendel added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper support of nullable dtypes as the Categorical dtype #50711

Proper support of nullable dtypes as the Categorical dtype #50711

Dr-Irv commented Jan 12, 2023

Dr-Irv commented Jan 12, 2023

jbrockmendel commented Jan 12, 2023

Dr-Irv commented Jan 12, 2023

jbrockmendel commented Jan 12, 2023

Dr-Irv commented Jan 12, 2023

jorisvandenbossche commented Feb 8, 2023

Dr-Irv commented Feb 8, 2023

jorisvandenbossche commented Feb 8, 2023

WillAyd commented Jul 5, 2024

Proper support of nullable dtypes as the Categorical dtype #50711

Proper support of nullable dtypes as the Categorical dtype #50711

Comments

Dr-Irv commented Jan 12, 2023

Dr-Irv commented Jan 12, 2023

jbrockmendel commented Jan 12, 2023

Dr-Irv commented Jan 12, 2023

jbrockmendel commented Jan 12, 2023

Dr-Irv commented Jan 12, 2023

jorisvandenbossche commented Feb 8, 2023

Dr-Irv commented Feb 8, 2023

jorisvandenbossche commented Feb 8, 2023

WillAyd commented Jul 5, 2024