Check If filter_metadata Function Results in Empty Metadata Table #164

michaelmckinsey1 · 2024-05-21T23:08:03Z

This PR:

Refactors filter_metadata code and unit tests.
Raises an error in filter_metadata if the given function results in an empty metadata table.
Adds a unit test for the new check

This change was due to the observation that the current behavior of filter_metadata can be confusing for users. Say if the metadata value was the integer 1, but the user filters for the string "1". The current function will allow this filtering, but return an empty table because "1" does not exist in the table. Then the user will be confused why their table is empty. The new code catches this case and will print "The provided filter function resulted in an empty MetadataTable.", which will help the user catch their error sooner.

Note: Ideally, we would want to say that The value "1" does not exist in the metadata table, but filter_metadata takes a function as an argument, so parsing out the value is not easy.

ilumsden

Most of this looks good, but there's a major problem (with a simple fix) in the check for an empty metadata table.

thicket/thicket.py

ilumsden · 2024-05-24T18:46:29Z

thicket/thicket.py

-                ]
+        # filter metadata table
+        filtered_rows = new_thicket.metadata.apply(select_function, axis=1)
+        if all(filtered_rows) is False:


There are three issues with this check (i.e., the entire all(filtered_rows) is False statement).

The first two issues center around the contents of filtered_rows. Assuming users pass a valid filter to filter_metadata, DataFrame.apply will always return a Series of booleans with 1 boolean per row of the metadata table.

As a result, the first issue with this check is that it will only return False (and thus skip the body of the if statement) if the filter won't remove anything from the metadata table. In other words, this check is not looking for the creation of an empty metadata table; it is looking for a no-op filter. This is because all only returns True if every element is True. In this case, True means that a row in the metadata table should be kept by the filter. So, if all returns True in this case, it means all existing rows in the metadata table will be kept. Conversely, if all returns False, it just means that some number of rows will be removed, not necessarily that all rows will be removed.

Instead, you want to use any. If any returns False, that means all rows will be removed.

The second issue is actually the way you're using all (this will apply to any too). Using the Python built-in all function on a Series is more-or-less undefined behavior. It may work, but there's no guarantee that it will work in the future. Instead, you should be using Series.all()/Series.any().

Finally, the last issue is the use of is False. Using is for a value comparison is undefined behavior. You should use == instead. is should be used exclusively for checking if two objects are the same. Internally, this actually checks if their object IDs (which are essentially integer representations of C pointers) are equal.

However, in this case, you don't even need to do an equivalence check. Just add not before the invocation of all/any.

I see what you mean. The correct version is if not filtered_rows.any(). What I meant to catch was the case when there are no True rows only, but with all I was catching the case where there are some True some False. I updated the conditional.

Done.

Actually, using == in this case is an anti-pattern and Flake8 will complain in regard to E712 (I know because I tried it). So according to Flake8, if x is True is actually better practice than if x == True. But like you said in your last sentence, I agree with if x or if not x.

Regarding 3, Flake8 is actually wrong. It's supposed to be enforcing PEP 8, and according to that PEP (see image below), if x is True is worse than if x == True. However, according to the PEP, both are bad, and if x should be used instead.

Regardless, what you have now is correct. Just thought I'd share this.

ilumsden · 2024-05-24T18:48:35Z

thicket/tests/test_filter_metadata.py

    filter_one_column(th, columns_values)
    filter_multiple_and(th, columns_values)
    filter_multiple_or(th, columns_values)
+    check_errors(th)


These 4 functions should really be split into 4 different tests (i.e., functions with names starting with test_). If you want to, you can do it in this PR. Or, it can be done in a separate PR.

~~I agree. Done.~~ Actually some of these functions are imported by

thicket/tests/test_concat_thickets.py thicket/tests/test_groupby.py

So they need to be formatted in this way (they need to take a thicket and arguments instead of a fixture).

However, check_errors should be its own test

If these functions are needed by multiple test modules, they shouldn't be in test_filter_metadata.py. They should be testing utility functions that are available from some other module. And, regardless, the actual test fixtures for each of these functions should still be separate. That way, if one of them fails, we immediately know the specific fixture that's failing, rather than just knowing it's one of N fixtures.

Again, this doesn't have to be done by this PR. I'll spin this discussion off into an issue, so we can track it separately.

Tracking issue for this has been created (#166)

ilumsden

Looks good. Approved!

michaelmckinsey1 added 2 commits May 21, 2024 17:54

Refactor

30745a9

Add check for empty query

e227308

michaelmckinsey1 self-assigned this May 21, 2024

michaelmckinsey1 added 3 commits May 21, 2024 18:12

Black

909a674

Refactor check to comply with E712

ecf3245

Refactor check to comply with E712

fa7b8fb

michaelmckinsey1 requested a review from ilumsden May 22, 2024 16:39

ilumsden requested changes May 24, 2024

View reviewed changes

michaelmckinsey1 added 6 commits May 28, 2024 12:36

Improve error check

fd41252

Black

a1bd9f9

Change conditional

92da87d

Add unit test for new check

360b4ad

fix conditional

4497ca7

Black

ce9d77c

michaelmckinsey1 requested a review from ilumsden May 29, 2024 17:54

michaelmckinsey1 force-pushed the feature-improve_filtmeta_check branch from 386d4c1 to ce9d77c Compare May 29, 2024 17:58

Refactor test

9c9beb6

ilumsden mentioned this pull request Jun 3, 2024

Split test fixtures in test_filter_metadata.py #166

Open

ilumsden approved these changes Jun 3, 2024

View reviewed changes

pearce8 merged commit e618242 into LLNL:develop Jun 12, 2024
4 checks passed

slabasan added this to the 2024.2.0 milestone Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check If filter_metadata Function Results in Empty Metadata Table #164

Check If filter_metadata Function Results in Empty Metadata Table #164

michaelmckinsey1 commented May 21, 2024

ilumsden left a comment

ilumsden May 24, 2024

michaelmckinsey1 May 28, 2024

ilumsden Jun 3, 2024

ilumsden May 24, 2024

michaelmckinsey1 May 29, 2024 •

edited

Loading

ilumsden Jun 3, 2024

ilumsden Jun 3, 2024

ilumsden Jun 3, 2024

ilumsden left a comment

Check If filter_metadata Function Results in Empty Metadata Table #164

Check If filter_metadata Function Results in Empty Metadata Table #164

Conversation

michaelmckinsey1 commented May 21, 2024

ilumsden left a comment

Choose a reason for hiding this comment

ilumsden May 24, 2024

Choose a reason for hiding this comment

michaelmckinsey1 May 28, 2024

Choose a reason for hiding this comment

ilumsden Jun 3, 2024

Choose a reason for hiding this comment

ilumsden May 24, 2024

Choose a reason for hiding this comment

michaelmckinsey1 May 29, 2024 • edited Loading

Choose a reason for hiding this comment

ilumsden Jun 3, 2024

Choose a reason for hiding this comment

ilumsden Jun 3, 2024

Choose a reason for hiding this comment

ilumsden Jun 3, 2024

Choose a reason for hiding this comment

ilumsden left a comment

Choose a reason for hiding this comment

michaelmckinsey1 May 29, 2024 •

edited

Loading