Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add NDFrame property to disallow duplicates #27108

Closed
TomAugspurger opened this issue Jun 28, 2019 · 4 comments
Closed

API: Add NDFrame property to disallow duplicates #27108

TomAugspurger opened this issue Jun 28, 2019 · 4 comments
Labels
API Design Enhancement Index Related to the Index class or subclasses

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 28, 2019

edit: see #27108 (comment)


I'd like to be able to have an index, and ensure that no operation introduces duplicates.

idx = pd.Index(..., allow_duplicates=False)
s = pd.Series(..., index=idx)

From here, any pandas operation that introduces duplicates (e.g. s.loc[['a', 'a']]) would raise, rather than return an Index with two values.

@TomAugspurger TomAugspurger added API Design Index Related to the Index class or subclasses labels Jun 28, 2019
@jorisvandenbossche
Copy link
Member

Discussion we had about this: should this "property" live on the Index, or on the DataFrame/Series object?

@TomAugspurger
Copy link
Contributor Author

Yeah, my example kinda exposes that difficulty, as the new index in s.loc[['a', 'a']] isn't the original index. So maybe it would be better on the dataframe?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 3, 2019

I played with this a bit today.

For a user-API, I think it makes the most since to have an allow_duplicate_labels keyword that can be passed to Series / DataFrame. IOW,

>>> pd.Series(data, index, allow_duplicate_labels=False)

rather than

>>> pd.Series(data, pd.Index(..., allow_duplicate_labels=False))

While the duplicate detection is done in the Index, it seems more ergonomic to have it on NDFrame.

A potential downside is that you can't disallow duplicates in the columns while allowing duplicates in the index. If we really wanted that, we can support setting allow_duplicate_labels=True/False/rows/columns. But let's leave that aside for now.


For an implementation, it seems like _metadata offers a non-invasive solution. Without having to touch rename, the following is possible

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(index=['a', 'A'], allow_duplicate_labels=False)

In [3]: df
Out[3]:
Empty DataFrame
Columns: []
Index: [a, A]

In [4]: df.rename(str.upper)
---------------------------------------------------------------------------
DuplicateLabelError                       Traceback (most recent call last)
<ipython-input-4-17c8fb0b7c7f> in <module>
----> 1 df.rename(str.upper)

~/sandbox/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    233         @wraps(func)
    234         def wrapper(*args, **kwargs) -> Callable[..., Any]:
--> 235             return func(*args, **kwargs)
    236
    237         kind = inspect.Parameter.POSITIONAL_OR_KEYWORD

...

~/sandbox/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
    562             # TODO: position, value, not too large.
    563             msg = "Index has duplicates."
--> 564             raise DuplicateLabelError(msg)
    565
    566     # --------------------------------------------------------------------

DuplicateLabelError: Index has duplicates.

Likewise for concat, __getitem__, .loc, etc. These all call __finalize__. Since I think we want propagation by default, it does exactly what we want.

The changes are at https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:unique-index?expand=1, but all I've needed so far is

  1. Adding allow_duplicate_labels to the NDFrame constructors
  2. Adding a getter / setter NDFrame.allows_duplicate_labels
  3. Adding allows_duplicate_labels to the NDFrame._metadata

@TomAugspurger TomAugspurger changed the title API: Add property to Index to disallow duplicates API: Add NDFrame property to disallow duplicates Sep 7, 2019
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 7, 2019
In preperation for pandas-dev#27108
(disallowing duplicates), we need to enhance our metadata propagation.

*We need a way for a particiular attribute to deterimine how it's
propagated for a particular method*. Our current method of metadata
propagation lacked two features

1. It only copies an attribute from a source NDFrame to a new NDFrame.
   There is no way to propagate metadata from a collection of NDFrames
   (say from `pd.concat`) to a new NDFrame.
2. It only and always copies the attribute. This is not always
   appropriate when dealing with a collection of input NDFrames, as the
   source attributes may differ. The resolution of conflicts will differ
   by attribute (for `Series.name` we might throw away the name. For
   `Series.allow_duplicates`, any Series disallowing duplicates should
   mean the output disallows duplicates)
@TomAugspurger
Copy link
Contributor Author

This was closed by #28394.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Index Related to the Index class or subclasses
Projects
None yet
Development

No branches or pull requests

3 participants