-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper support of nullable dtypes as the Categorical dtype #50711
Comments
Not clear how to track this in #50578 since we should decide that we want to do this. |
id be more inclined to #29962, which would mean getting pd.NA in a targeted subset of cases. |
The idea in #29962 is to make the NA value dependent on the underlying dtype of the categorical. But there is also a point made in a comment there (#29962 (comment)) that we should only use In either case, I think we need to make a decision and figure out how to do a deprecation notice. Or as you (@jbrockmendel ) suggested in another comment in that issue (#29962 (comment)), just bite the bullet and make the change now for 2.0. |
-1. As long as we distinguish between pd.NA and nan etc (xref #32265), this is a semantic change. Besides which getting pd.NA is a PITA. Another alternative would be #37930 which would let users specify. That would likely be the biggest breaking change. |
Can you explain why getting And why would we need to distinguish between |
I also agree with that (that we should move to only use That makes that people who start using the nullable dtypes are ensured they keep nullable dtypes (and matching missing value scalars), while people who didn't opt in to nullable dtypes just keep the current behaviour. (for a long time, a first step was actually support having nullable categories, but I suppose that was "fixed" now we can support EAs in the Index?) |
I changed the title of the issue, and summarize here the discussion on 2/8/2023: |
So while a Categorical can now store categories using a nullable dtype, there are still a variety of aspects that don't follow the expected behaviour for "nullable dtypes" (see comment above). Just as a quick illustration of the comparison case:
I expect a "nullable" categorical column to give the same result as for the non-categorical |
This is another interesting one for #58988
I think the API here is unfortunate; given most of the pd.* types support NA >>> pd.Series(["foo", "bar", pd.NA], dtype=pd.StringDtype())
0 foo
1 bar
2 <NA>
dtype: string it is rather surprising that that same pattern does not get followed with the categorical type: >>> pd.Series(["foo", "bar", pd.NA], dtype=pd.CategoricalDtype())
0 foo
1 bar
2 NaN
dtype: category
Categories (2, object): ['bar', 'foo'] |
Now that
Categorical
depends onExtensionArray
, it makes more sense to return and outputpd.NA
as a missing value instead ofnp.nan
.Propose that we announce in 2.0 release that this will change in a future release. Not clear if/how we create a deprecation message here.
Current behavior:
The text was updated successfully, but these errors were encountered: