-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check If filter_metadata Function Results in Empty Metadata Table #164
Check If filter_metadata Function Results in Empty Metadata Table #164
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this looks good, but there's a major problem (with a simple fix) in the check for an empty metadata table.
thicket/thicket.py
Outdated
] | ||
# filter metadata table | ||
filtered_rows = new_thicket.metadata.apply(select_function, axis=1) | ||
if all(filtered_rows) is False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three issues with this check (i.e., the entire all(filtered_rows) is False
statement).
The first two issues center around the contents of filtered_rows
. Assuming users pass a valid filter to filter_metadata
, DataFrame.apply
will always return a Series
of booleans with 1 boolean per row of the metadata table.
As a result, the first issue with this check is that it will only return False
(and thus skip the body of the if statement) if the filter won't remove anything from the metadata table. In other words, this check is not looking for the creation of an empty metadata table; it is looking for a no-op filter. This is because all
only returns True
if every element is True
. In this case, True
means that a row in the metadata table should be kept by the filter. So, if all
returns True
in this case, it means all existing rows in the metadata table will be kept. Conversely, if all
returns False
, it just means that some number of rows will be removed, not necessarily that all rows will be removed.
Instead, you want to use any
. If any
returns False
, that means all rows will be removed.
The second issue is actually the way you're using all
(this will apply to any
too). Using the Python built-in all
function on a Series
is more-or-less undefined behavior. It may work, but there's no guarantee that it will work in the future. Instead, you should be using Series.all()
/Series.any()
.
Finally, the last issue is the use of is False
. Using is
for a value comparison is undefined behavior. You should use ==
instead. is
should be used exclusively for checking if two objects are the same. Internally, this actually checks if their object IDs (which are essentially integer representations of C pointers) are equal.
However, in this case, you don't even need to do an equivalence check. Just add not
before the invocation of all
/any
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I see what you mean. The correct version is
if not filtered_rows.any()
. What I meant to catch was the case when there are noTrue
rows only, but withall
I was catching the case where there are someTrue
someFalse
. I updated the conditional.
-
Done.
-
Actually, using
==
in this case is an anti-pattern and Flake8 will complain in regard to E712 (I know because I tried it). So according to Flake8,if x is True
is actually better practice thanif x == True
. But like you said in your last sentence, I agree withif x
orif not x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding 3, Flake8 is actually wrong. It's supposed to be enforcing PEP 8, and according to that PEP (see image below), if x is True
is worse than if x == True
. However, according to the PEP, both are bad, and if x
should be used instead.
Regardless, what you have now is correct. Just thought I'd share this.
filter_one_column(th, columns_values) | ||
filter_multiple_and(th, columns_values) | ||
filter_multiple_or(th, columns_values) | ||
check_errors(th) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 4 functions should really be split into 4 different tests (i.e., functions with names starting with test_
). If you want to, you can do it in this PR. Or, it can be done in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Done. Actually some of these functions are imported by
thicket/tests/test_concat_thickets.py
thicket/tests/test_groupby.py
So they need to be formatted in this way (they need to take a thicket and arguments instead of a fixture).
However, check_errors
should be its own test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these functions are needed by multiple test modules, they shouldn't be in test_filter_metadata.py
. They should be testing utility functions that are available from some other module. And, regardless, the actual test fixtures for each of these functions should still be separate. That way, if one of them fails, we immediately know the specific fixture that's failing, rather than just knowing it's one of N fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this doesn't have to be done by this PR. I'll spin this discussion off into an issue, so we can track it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking issue for this has been created (#166)
386d4c1
to
ce9d77c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Approved!
This PR:
filter_metadata
code and unit tests.filter_metadata
if the given function results in an empty metadata table.This change was due to the observation that the current behavior of
filter_metadata
can be confusing for users. Say if the metadata value was the integer1
, but the user filters for the string"1"
. The current function will allow this filtering, but return an empty table because"1"
does not exist in the table. Then the user will be confused why their table is empty. The new code catches this case and will print"The provided filter function resulted in an empty MetadataTable."
, which will help the user catch their error sooner.Note: Ideally, we would want to say that
The value "1" does not exist in the metadata table
, butfilter_metadata
takes afunction
as an argument, so parsing out the value is not easy.