-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/ufrm missing lucodes #698
Bugfix/ufrm missing lucodes #698
Conversation
…rom the biophysical table. natcap#686.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @davemfish ! I think there might be a small bug in the first ValueError
check, but I could absolutely be wrong about this. Could you take a look and let me know what you think?
# Find the code that raised the IndexError, and possibly | ||
# any others that also would have. | ||
lucodes = numpy.unique(valid_lucodes) | ||
missing_codes = lucodes[lucodes >= lucode_to_cn_table.shape[0]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be very wrong about this, but to my eye I think this might always retrieve the top n
lucodes. When I make the following change to tests/test_ufrm.py
(and also PDB'd into the stacktrace):
diff --git a/tests/test_ufrm.py b/tests/test_ufrm.py
index 8f8e2ccc1..a6487fd4b 100644
--- a/tests/test_ufrm.py
+++ b/tests/test_ufrm.py
@@ -182,7 +182,8 @@ class UFRMTests(unittest.TestCase):
# These are codes that will raise an IndexError on
# indexing into the CN table sparse matrix. The test
# LULC raster has values from 0 to 21.
- bad_cn_table = cn_table[cn_table['lucode'] < 15]
+ bad_cn_table = cn_table[
+ (cn_table['lucode'] > 3) & (cn_table['lucode'] < 15)]
bad_cn_table.to_csv(bad_cn_table_path, index=False)
args['curve_number_table_path'] = bad_cn_table_path
I get the top end of the range instead of [0, 2, 3, 16, 17, 18, 21]
:
ValueError: The biophysical table is missing a row for lucode(s)[16, 17, 18, 21]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right, but also that's how I originally intended it. When it hits the IndexError
it's finding only the codes that can raise that IndexError. We could also do the check that appears in the next block, where we check for codes that do not raise the IndexError but are still missing ([0, 2, 3] in your example). What do you think?
The other thing is, in this scope we only ever know the pixel values that appear in the current iterblocks
block, right? So in either case (the IndexError or the subsequent check for empty rows) our error message might miss values that would get caught in a subsequent raster block.
Also, there were bugs in the assertRaises
blocks of all these tests that were preventing the nested assertions from even being called. I fixed that and made some assertions a bit more explicit to better differentiate the two cases that are being tested in the one new test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think?
After poking around this a bit more, I think what you have here is actually a complete solution! I constructed a table (Biophysical_water_SF_bad.csv) to obviously be missing required values but have enough and large-enough values in the matrix to not trigger the IndexError
. When the CSR matrix is indexed into (which does not raise IndexError
), the second check for all-zero rows catches all of the missing internal values as we would hope:
ValueError: The biophysical table is missing a row for lucode(s)[0, 2, 3, 4, 5, 8, 10, 16, 17, 18, 21]
So I was wrong and this test is good to go!
The other thing is, in this scope we only ever know the pixel values that appear in the current iterblocks block, right? So in either case (the IndexError or the subsequent check for empty rows) our error message might miss values that would get caught in a subsequent raster block.
Yep, you're absolutely right ... the only way we'll know if our error message reflects all the missing values is by checking all of the raster values. In my opinion, the value (and relative ease of maintenance) of failing fast will probably outweigh the benefits of a truly complete error message.
Also, there were bugs in the assertRaises blocks of all these tests that were preventing the nested assertions from even being called. I fixed that and made some assertions a bit more explicit to better differentiate the two cases that are being tested in the one new test.
Oh man, thanks for catching that!
# Even without an IndexError, still must guard against | ||
# lucodes that can index into the sparse matrix but were | ||
# missing from the biophysical table. They have rows of all 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very helpful.
# drop rows with lucodes known to exist in lulc raster | ||
# These are codes that will raise an IndexError on | ||
# indexing into the CN table sparse matrix. The test | ||
# LULC raster has values from 0 to 21. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The summary here of what's in the LULC raster is a very good idea and was helpful in digging into this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong about my previous assessment of missing values within the bound of the sparse matrix (see #698 (comment)), so this is good to go. The one build failure is pretty clearly a github response error when fetching tags and shouldn't have anything to do with this PR.
This PR adds error-checking & handling to guard against lucodes that are present in the raster but missing from the biophysical table.
The approach taken here is heavily dictated by the fact that the biophysical table data is stored in a 2d sparse matrix. That seems like the right choice for the context of
_lu_to_cn_op
, but it does obfuscate things a bit when it comes to guarding against bad data. I'm interested to hear more opinions regarding those tradeoffs.Fixes #686
Checklist
Updated HISTORY.rst (if these changes are user-facing)
Updated the user's guide (if needed)