Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check If filter_metadata Function Results in Empty Metadata Table #164

Merged
merged 12 commits into from
Jun 12, 2024

Conversation

michaelmckinsey1
Copy link
Collaborator

This PR:

  • Refactors filter_metadata code and unit tests.
  • Raises an error in filter_metadata if the given function results in an empty metadata table.
  • Adds a unit test for the new check

This change was due to the observation that the current behavior of filter_metadata can be confusing for users. Say if the metadata value was the integer 1, but the user filters for the string "1". The current function will allow this filtering, but return an empty table because "1" does not exist in the table. Then the user will be confused why their table is empty. The new code catches this case and will print "The provided filter function resulted in an empty MetadataTable.", which will help the user catch their error sooner.

Note: Ideally, we would want to say that The value "1" does not exist in the metadata table, but filter_metadata takes a function as an argument, so parsing out the value is not easy.

@michaelmckinsey1 michaelmckinsey1 added area-tests Issues and PRs involving Thicket's automated tests area-thicket Issues and PRs involving Thicket's core Thicket datastructure and associated classes priority-normal Normal priority issues and PRs status-ready-for-review This PR is ready to be reviewed by assigned reviewers type-feature Requests for new features or PRs which implement new features labels May 21, 2024
@michaelmckinsey1 michaelmckinsey1 self-assigned this May 21, 2024
@michaelmckinsey1 michaelmckinsey1 requested a review from ilumsden May 22, 2024 16:39
Copy link
Collaborator

@ilumsden ilumsden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this looks good, but there's a major problem (with a simple fix) in the check for an empty metadata table.

]
# filter metadata table
filtered_rows = new_thicket.metadata.apply(select_function, axis=1)
if all(filtered_rows) is False:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three issues with this check (i.e., the entire all(filtered_rows) is False statement).

The first two issues center around the contents of filtered_rows. Assuming users pass a valid filter to filter_metadata, DataFrame.apply will always return a Series of booleans with 1 boolean per row of the metadata table.

As a result, the first issue with this check is that it will only return False (and thus skip the body of the if statement) if the filter won't remove anything from the metadata table. In other words, this check is not looking for the creation of an empty metadata table; it is looking for a no-op filter. This is because all only returns True if every element is True. In this case, True means that a row in the metadata table should be kept by the filter. So, if all returns True in this case, it means all existing rows in the metadata table will be kept. Conversely, if all returns False, it just means that some number of rows will be removed, not necessarily that all rows will be removed.

Instead, you want to use any. If any returns False, that means all rows will be removed.

The second issue is actually the way you're using all (this will apply to any too). Using the Python built-in all function on a Series is more-or-less undefined behavior. It may work, but there's no guarantee that it will work in the future. Instead, you should be using Series.all()/Series.any().

Finally, the last issue is the use of is False. Using is for a value comparison is undefined behavior. You should use == instead. is should be used exclusively for checking if two objects are the same. Internally, this actually checks if their object IDs (which are essentially integer representations of C pointers) are equal.

However, in this case, you don't even need to do an equivalence check. Just add not before the invocation of all/any.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I see what you mean. The correct version is if not filtered_rows.any(). What I meant to catch was the case when there are no True rows only, but with all I was catching the case where there are some True some False. I updated the conditional.
    image

  2. Done.

  3. Actually, using == in this case is an anti-pattern and Flake8 will complain in regard to E712 (I know because I tried it). So according to Flake8, if x is True is actually better practice than if x == True. But like you said in your last sentence, I agree with if x or if not x.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 3, Flake8 is actually wrong. It's supposed to be enforcing PEP 8, and according to that PEP (see image below), if x is True is worse than if x == True. However, according to the PEP, both are bad, and if x should be used instead.
image

Regardless, what you have now is correct. Just thought I'd share this.

filter_one_column(th, columns_values)
filter_multiple_and(th, columns_values)
filter_multiple_or(th, columns_values)
check_errors(th)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 4 functions should really be split into 4 different tests (i.e., functions with names starting with test_). If you want to, you can do it in this PR. Or, it can be done in a separate PR.

Copy link
Collaborator Author

@michaelmckinsey1 michaelmckinsey1 May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Done. Actually some of these functions are imported by

thicket/tests/test_concat_thickets.py
thicket/tests/test_groupby.py

So they need to be formatted in this way (they need to take a thicket and arguments instead of a fixture).

However, check_errors should be its own test

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these functions are needed by multiple test modules, they shouldn't be in test_filter_metadata.py. They should be testing utility functions that are available from some other module. And, regardless, the actual test fixtures for each of these functions should still be separate. That way, if one of them fails, we immediately know the specific fixture that's failing, rather than just knowing it's one of N fixtures.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this doesn't have to be done by this PR. I'll spin this discussion off into an issue, so we can track it separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking issue for this has been created (#166)

@michaelmckinsey1 michaelmckinsey1 requested a review from ilumsden May 29, 2024 17:54
@michaelmckinsey1 michaelmckinsey1 force-pushed the feature-improve_filtmeta_check branch from 386d4c1 to ce9d77c Compare May 29, 2024 17:58
Copy link
Collaborator

@ilumsden ilumsden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Approved!

@pearce8 pearce8 merged commit e618242 into LLNL:develop Jun 12, 2024
4 checks passed
@slabasan slabasan added this to the 2024.2.0 milestone Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-tests Issues and PRs involving Thicket's automated tests area-thicket Issues and PRs involving Thicket's core Thicket datastructure and associated classes priority-normal Normal priority issues and PRs status-ready-for-review This PR is ready to be reviewed by assigned reviewers type-feature Requests for new features or PRs which implement new features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants