-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing Data #9
Comments
Do others see value in different kinds of missing? E.g. |
@markusweimer that's what this is about indeed; it would be good to be more explicit in the final description. "not-a-number" ( |
Yes, we should make the distinction between NA and NaN clear. There might be reason to support both within a single colum. For example, >>> a = DataFrame({"A": [0, 1, NA, np.nan]})
>>> b = DataFrame({"A": [0, 0, 0, 0]})
>>> a / b
DataFrame({"A": [nan, inf, NA, nan]}) # float dtype 0/0 is defined to be nan. We would be saying that This has implications for other parts of the API: does Discussion on the NA vs. NaN distinction at pandas-dev/pandas#32265. In particular, cudf users have reported that some users appreciate being able to store both NaN and NA values within a single column: pandas-dev/pandas#32265 (comment). |
What's I agree that having both might be useful, though I think I'm not entirely decided on whether it's necessary. They do have different semantics, but the cases where the different semantics change the outcome are pretty rare, right? Having both in a single column will certainly make life for downstream packages harder because now we might need to deal with two special cases everywhere. Unless they are both mapped to the same at the numeric level and only are different on the dataframe level? |
What do you mean by "change the outcome", or rather, how does that differ from "different semantics"? To me those sound the same :) (e.g.
Indeed, handling both might be difficult, or at least requires some thought. Scikit-Learn is choosing to treat |
The |
I feel like 'null' is a bit strange in the context of numbers, since it reminds me of pointers. I think missing is more fitting, though less neutral. But that is maybe what we want (give it an explicit meaning). In Vaex we defined I strongly dislike using sentinels/special values for missing values in a library since for integers there is basically no solution. This means you need to support a byte or bitmask anway to keep track of them. Mixing sentinels and missing values just makes life more complex. I see NaN (just a float number) as orthogonal to missing values, the only connection they have in Vaex is through the convenience methods |
On the name, I think my preference is for `NA`, primarily since that's what
pandas is using :) `missing` would be fine as well I think.
I like the idea of having dedicated methods or options to a single method
to handle `NA`, `NaN`, or both.
…On Thu, May 28, 2020 at 8:42 AM Maarten Breddels ***@***.***> wrote:
I think we can reasonably choose between NA, null, or missing as a
general name for "missing" values.
I feel like 'null' is a bit strange in the context of numbers, since it
reminds me of pointers. I think missing is more fitting, though less
neutral. But that is maybe what we want (give it an explicit meaning).
In Vaex we defined isna(x) as isnan(x) | ismissing(x), where ismissing(x)
means missing values (implemented as masked arrays or Arrow Arrays which
naturally have null bitmasks). isnan(x) just follows IEEE standards. So
isna is short for 'get rid of anything messy'.
I strongly dislike using sentinels/special values for missing values in a
library since for integers there is basically no solution. This means you
need to support a byte or bitmask anway to keep track of them. Mixing
sentinels and missing values just makes life more complex.
I see NaN (just a float number) as orthogonal to missing values, the only
connection they have in Vaex is through the convenience methods
isna/countna/fillna which follows the definition above. I also think
having both NaN and missing values in a column can indicate different
things, and a user should be able to distinguish between them.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIW6C7FJTCPXOVQENPTRTZS5PANCNFSM4NLCPXGA>
.
|
Is there an agreement on what 'NA' means, does it means 'Not available'? I would say the meaning of 'missing' is the least ambiguous (which has its cons and pros), also NaN has a very explicit meaning, the meaning of null and NA are less clear to me. |
Yes, I think "Not available". |
FWIW when I hear NA I think Not Applicable, but maybe I am just not used to the domain specific usage here. |
I'm not sure I follow this part, could you please elaborte? |
I see I guess for projects like numpy that can be important, but I don't think it is worth the trouble for dataframes. My opinion is that it'd make live easier for the users if |
Since integers don't have a special value like NaN, you cannot 'abuse' NaN as a missing value. You could use a special value, but that would cause trouble since if you happen to have data which includes that special value, you suddenly have an accidental missing value. A user might be able to get away with that I think, but having that solution as a building block for an ecosystem to build on does not sound like a good plan. I think that means you have to keep track of a mask (bit or bytemask). And I guess that's also what database do. They will not limit you to not use a special value for integers, because it's reserved for a 'missing value sentinel' (correct me if I'm wrong, but I'd be surprised). On top of that, NaN and missing values can have different meaning. A missing value can be a missing value, as it indicates, a NaN could mean a measurement gone wrong, a math 'error' etc. NaN and missing values are fundamentally different things, although one could group them (say call them NA). I think I fully agree with Apache Arrow's idea. Each array can have missing values, and in that case it has a bitmask attached to the array, but it's optional. If you compute on this, I think the plan is to just brute force compute over all the data (ignoring the array has missing values, since it's all victorized/SIMD down in the loops.) Note that using a bitmask is not that memory consuming. Say a 1 billion column of float64 would use 1e9B*8=~8GB, if it had a full mask (1 bit per element), it would require 1e9B/8=~125MB extra, 1/64=1.5%.
I disagree and agree here because I think you should be able to distinguish between them, but also have the "I don't care , so throw away any data that's NA/null/missing/NaN whatever". |
Following up on a question on a question on the call by @apaszke. This demonstrates why a typed We have that If we have an untyped, scalar NA then you you have to arbitrarily choose that In [17]: a # datetime
Out[17]:
A
0 2000-01-01
1 2000-01-01
In [18]: b # datetime
Out[18]:
A
0 NA
1 2000-01-01
In [19]: (a - b.iloc[0, 0]).dtypes
Out[19]:
A timedelta64[ns]
dtype: object But that loses the (I think desirable) property of knowing the result dtype of Having a typed NA scalar like |
Thanks @TomAugspurger, that's really useful to keep in mind. @teoliphant said it should be possible for ndarrays to have extra data added to it, like a mask (bit or byte). If normal ndarrays were more like numpy masked arrays and keep track of their masks, and numpy scalars would also hold this information (it would keep a single bit or byte), we could have masked scalar values. You wouldn't need sentinel value. I think masked arrays in numpy has to happen someday (not added on), instead of solving it in the DataFrame layer (recognizing that's probably an order of magnitude more difficult to get off the ground). |
There were many multiple long discussions about missing data that pre-dated Pandas from 1998 to 2011 while NumPy was being written and emerging in popularity. There were several NEPS and mailing list discussions that didn't result in large agreement. No-one was funded to work on this. I remember these debates well but could not facilitate them effectively because I was either running a consulting company or leaving that company to start Anaconda. I do remember getting two particularly active participants in the discussion together to further the conversation. The output of these efforts were the two participants, Mark and Nathaniel, writing a document that is published here: https://numpy.org/neps/nep-0026-missing-data-summary.html that goes into a lot of detail about the opportunities and challenges from the NumPy perspective. I think it's very important that we understand that much of the challenge they faced in coming to agreement about NumPy is that changing an existing library and working out all the details of what must be changed in the code is much harder than proposing an API based on existing work. Of course, for any reference to be relevant, it has to be used, and so it's not completely orthogonal. However, now there are many, many more array libraries and dataframe libraries. Our efforts here are to do our best to express the best API we can confidently describe and then work with projects to consume or produce these. My personal conclusion about the missing data APIs: the problem actually rests in the fact that NumPy only created an approximate type system (dtypes) and did not build well on Python's type system. A type system is what connects the bytes contained in a data-structure to how those types should be interpreted by code. Certainly the sentinel concept is clearly a new kind of type (in ndtypes we called it an optional type). Even the masked concept could be considered a kind of type (if you consider the mask-bits part of the element data -- even though the mask bits are stored elsewhere). It is probably better, though to consider a mask array as a separate container type that could be used for a data-frame with native support for missing data. NumPy has a nascent type system, but it is not easily extended (though you can do it in C with some effort). The type extension system is very different from the builtin types which means NumPy's types are somewhat related to Python 1.0 classes. If NumPy had a more easily extended type-system, then we could have had many more experiments with missing data and would be farther along. So, in my mind, the missing data problem is actually deeply connected to the "type" problem which does not have a great solution currently today in Python. I have ideas and designs about how to fix this fundamentally (anyone want to fund me to fix it?). There is even quite a bit of code in the xnd, ndtypes, and mtypes repositories (some of which may be useful). For the purposes of this consortium, however, I think we will have to effectively follow what Vaex is doing here (and it sounds like Pandas is heading to) and have both (NAN) and (NA) and leave it to libraries to comply with the standard. |
FYI, PySpark is following the NULL semantics that was defined in ANSI SQL. We documented our behaviors in http://spark.apache.org/docs/latest/sql-ref-null-semantics.html |
This issues is dedicated to discussing the large topic of "missing" data.
First, a bit on names. I think we can reasonably choose between
NA
,null
, ormissing
as a general name for "missing" values. We'd use that to inform decisions on method names likeDataFrame.isna()
vs.DataFrame.isnull()
vs. ...Pandas favors
NA
, databases might favornull
, Julia usesmissing
. I don't have a strong opinion here.Some topics of discussion:
I think we'd like that the introduction of missing data should not fundamentally change the dtype of a column.
This is not the case with pandas:
In pandas, for int-dtype data
NaN
is used as the missing value indicator.NaN
is a float, and so the column is cast to float64 dtype.Ideally
Out[9]
would preserve the int dtype forB
andC
. At this moment, I don't have a strong opinion on whether the dtype forB
should be a plainint64
, or something like aUnion[int64, NA]
.In general, missing values should propagate in arithmetic and comparison operations (using
<NA>
as a marker for a missing value)`.There might be a few exceptions. For example
0 ** NA
might be 1 rather thanNA
, since it doesn't matter exactly what valueNA
takes on.For boolean logical operations (and, or, xor), libraries should implement three-value or Kleene Logic. The pandas docs has a table
The short-version is that the result should be
NA
if it depends on whether theNA
operand being True or False. For example,True | NA
isTrue
, since it doesn't matter whether thatNA
is "really" True or False.Libraries might need to implement a scalar
NA
value, but I'm not sure. As a user, you would get this from indexing to get a scalar, or in an operation that produces an NA result.What semantics should this scalar NA have? In particular, should it be typed? This is something we've struggled with in recent versions of pandas. There's a desire to preserve a property along the lines of the following
Where the first value in the second array is
NA
. If you have a singleNA
without any dtype, you can't implement that property.There's a long thread on this at pandas-dev/pandas#28095.
The text was updated successfully, but these errors were encountered: