API: Add NDFrame property to disallow duplicates #27108

TomAugspurger · 2019-06-28T21:31:45Z

I'd like to be able to have an index, and ensure that no operation introduces duplicates.

idx = pd.Index(..., allow_duplicates=False)
s = pd.Series(..., index=idx)

From here, any pandas operation that introduces duplicates (e.g. s.loc[['a', 'a']]) would raise, rather than return an Index with two values.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-06-28T22:56:38Z

Discussion we had about this: should this "property" live on the Index, or on the DataFrame/Series object?

TomAugspurger · 2019-06-29T15:41:57Z

Yeah, my example kinda exposes that difficulty, as the new index in s.loc[['a', 'a']] isn't the original index. So maybe it would be better on the dataframe?

TomAugspurger · 2019-09-03T20:47:36Z

I played with this a bit today.

For a user-API, I think it makes the most since to have an allow_duplicate_labels keyword that can be passed to Series / DataFrame. IOW,

>>> pd.Series(data, index, allow_duplicate_labels=False)

rather than

>>> pd.Series(data, pd.Index(..., allow_duplicate_labels=False))

While the duplicate detection is done in the Index, it seems more ergonomic to have it on NDFrame.

A potential downside is that you can't disallow duplicates in the columns while allowing duplicates in the index. If we really wanted that, we can support setting allow_duplicate_labels=True/False/rows/columns. But let's leave that aside for now.

For an implementation, it seems like _metadata offers a non-invasive solution. Without having to touch rename, the following is possible

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(index=['a', 'A'], allow_duplicate_labels=False)

In [3]: df
Out[3]:
Empty DataFrame
Columns: []
Index: [a, A]

In [4]: df.rename(str.upper)
---------------------------------------------------------------------------
DuplicateLabelError                       Traceback (most recent call last)
<ipython-input-4-17c8fb0b7c7f> in <module>
----> 1 df.rename(str.upper)

~/sandbox/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    233         @wraps(func)
    234         def wrapper(*args, **kwargs) -> Callable[..., Any]:
--> 235             return func(*args, **kwargs)
    236
    237         kind = inspect.Parameter.POSITIONAL_OR_KEYWORD

...

~/sandbox/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
    562             # TODO: position, value, not too large.
    563             msg = "Index has duplicates."
--> 564             raise DuplicateLabelError(msg)
    565
    566     # --------------------------------------------------------------------

DuplicateLabelError: Index has duplicates.

Likewise for concat, __getitem__, .loc, etc. These all call __finalize__. Since I think we want propagation by default, it does exactly what we want.

The changes are at https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:unique-index?expand=1, but all I've needed so far is

Adding allow_duplicate_labels to the NDFrame constructors
Adding a getter / setter NDFrame.allows_duplicate_labels
Adding allows_duplicate_labels to the NDFrame._metadata

In preperation for pandas-dev#27108 (disallowing duplicates), we need to enhance our metadata propagation. *We need a way for a particiular attribute to deterimine how it's propagated for a particular method*. Our current method of metadata propagation lacked two features 1. It only copies an attribute from a source NDFrame to a new NDFrame. There is no way to propagate metadata from a collection of NDFrames (say from `pd.concat`) to a new NDFrame. 2. It only and always copies the attribute. This is not always appropriate when dealing with a collection of input NDFrames, as the source attributes may differ. The resolution of conflicts will differ by attribute (for `Series.name` we might throw away the name. For `Series.allow_duplicates`, any Series disallowing duplicates should mean the output disallows duplicates)

TomAugspurger · 2020-09-21T11:34:08Z

This was closed by #28394.

TomAugspurger added API Design Index Related to the Index class or subclasses labels Jun 28, 2019

TomAugspurger mentioned this issue Sep 3, 2019

Non-silently handle duplicate column names #28262

Closed

This was referenced Sep 4, 2019

Various methods don't call call __finalize__ #28283

Open

DOC/API: document how to use metadata #8572

Open

TomAugspurger changed the title ~~API: Add property to Index to disallow duplicates~~ API: Add NDFrame property to disallow duplicates Sep 7, 2019

TomAugspurger mentioned this issue Sep 7, 2019

REF/ENH: Refactor NDFrame finalization #28334

Closed

3 tasks

mroeschke added the Enhancement label May 5, 2020

TomAugspurger closed this as completed Sep 21, 2020

jorisvandenbossche mentioned this issue Mar 27, 2023

DEPR: flags #52165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add NDFrame property to disallow duplicates #27108

API: Add NDFrame property to disallow duplicates #27108

TomAugspurger commented Jun 28, 2019 •

edited

Loading

jorisvandenbossche commented Jun 28, 2019

TomAugspurger commented Jun 29, 2019

TomAugspurger commented Sep 3, 2019 •

edited

Loading

TomAugspurger commented Sep 21, 2020

API: Add NDFrame property to disallow duplicates #27108

API: Add NDFrame property to disallow duplicates #27108

Comments

TomAugspurger commented Jun 28, 2019 • edited Loading

jorisvandenbossche commented Jun 28, 2019

TomAugspurger commented Jun 29, 2019

TomAugspurger commented Sep 3, 2019 • edited Loading

TomAugspurger commented Sep 21, 2020

TomAugspurger commented Jun 28, 2019 •

edited

Loading

TomAugspurger commented Sep 3, 2019 •

edited

Loading