Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/ufrm missing lucodes #698

Merged
merged 3 commits into from
Oct 26, 2021

Conversation

davemfish
Copy link
Contributor

This PR adds error-checking & handling to guard against lucodes that are present in the raster but missing from the biophysical table.

The approach taken here is heavily dictated by the fact that the biophysical table data is stored in a 2d sparse matrix. That seems like the right choice for the context of _lu_to_cn_op, but it does obfuscate things a bit when it comes to guarding against bad data. I'm interested to hear more opinions regarding those tradeoffs.

Fixes #686

Checklist

  • Updated HISTORY.rst (if these changes are user-facing)

  • Updated the user's guide (if needed)

@davemfish davemfish self-assigned this Oct 25, 2021
@davemfish davemfish added this to the 3.9.2 milestone Oct 25, 2021
@phargogh phargogh self-requested a review October 25, 2021 17:20
Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @davemfish ! I think there might be a small bug in the first ValueError check, but I could absolutely be wrong about this. Could you take a look and let me know what you think?

# Find the code that raised the IndexError, and possibly
# any others that also would have.
lucodes = numpy.unique(valid_lucodes)
missing_codes = lucodes[lucodes >= lucode_to_cn_table.shape[0]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be very wrong about this, but to my eye I think this might always retrieve the top n lucodes. When I make the following change to tests/test_ufrm.py (and also PDB'd into the stacktrace):

diff --git a/tests/test_ufrm.py b/tests/test_ufrm.py
index 8f8e2ccc1..a6487fd4b 100644
--- a/tests/test_ufrm.py
+++ b/tests/test_ufrm.py
@@ -182,7 +182,8 @@ class UFRMTests(unittest.TestCase):
         # These are codes that will raise an IndexError on
         # indexing into the CN table sparse matrix. The test
         # LULC raster has values from 0 to 21.
-        bad_cn_table = cn_table[cn_table['lucode'] < 15]
+        bad_cn_table = cn_table[
+            (cn_table['lucode'] > 3) & (cn_table['lucode'] < 15)]
         bad_cn_table.to_csv(bad_cn_table_path, index=False)
         args['curve_number_table_path'] = bad_cn_table_path

I get the top end of the range instead of [0, 2, 3, 16, 17, 18, 21]:

ValueError: The biophysical table is missing a row for lucode(s)[16, 17, 18, 21]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right, but also that's how I originally intended it. When it hits the IndexError it's finding only the codes that can raise that IndexError. We could also do the check that appears in the next block, where we check for codes that do not raise the IndexError but are still missing ([0, 2, 3] in your example). What do you think?

The other thing is, in this scope we only ever know the pixel values that appear in the current iterblocks block, right? So in either case (the IndexError or the subsequent check for empty rows) our error message might miss values that would get caught in a subsequent raster block.

Also, there were bugs in the assertRaises blocks of all these tests that were preventing the nested assertions from even being called. I fixed that and made some assertions a bit more explicit to better differentiate the two cases that are being tested in the one new test.

Copy link
Member

@phargogh phargogh Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

After poking around this a bit more, I think what you have here is actually a complete solution! I constructed a table (Biophysical_water_SF_bad.csv) to obviously be missing required values but have enough and large-enough values in the matrix to not trigger the IndexError. When the CSR matrix is indexed into (which does not raise IndexError), the second check for all-zero rows catches all of the missing internal values as we would hope:

ValueError: The biophysical table is missing a row for lucode(s)[0, 2, 3, 4, 5, 8, 10, 16, 17, 18, 21]

So I was wrong and this test is good to go!

The other thing is, in this scope we only ever know the pixel values that appear in the current iterblocks block, right? So in either case (the IndexError or the subsequent check for empty rows) our error message might miss values that would get caught in a subsequent raster block.

Yep, you're absolutely right ... the only way we'll know if our error message reflects all the missing values is by checking all of the raster values. In my opinion, the value (and relative ease of maintenance) of failing fast will probably outweigh the benefits of a truly complete error message.

Also, there were bugs in the assertRaises blocks of all these tests that were preventing the nested assertions from even being called. I fixed that and made some assertions a bit more explicit to better differentiate the two cases that are being tested in the one new test.

Oh man, thanks for catching that!

Comment on lines +788 to +790
# Even without an IndexError, still must guard against
# lucodes that can index into the sparse matrix but were
# missing from the biophysical table. They have rows of all 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very helpful.

Comment on lines +181 to +184
# drop rows with lucodes known to exist in lulc raster
# These are codes that will raise an IndexError on
# indexing into the CN table sparse matrix. The test
# LULC raster has values from 0 to 21.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary here of what's in the LULC raster is a very good idea and was helpful in digging into this!

@davemfish davemfish requested a review from phargogh October 25, 2021 19:04
Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong about my previous assessment of missing values within the bound of the sparse matrix (see #698 (comment)), so this is good to go. The one build failure is pretty clearly a github response error when fetching tags and shouldn't have anything to do with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UFRM: error-checking when LULC and biophysical table mismatch
2 participants