Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/ufrm missing lucodes #698

Merged
merged 3 commits into from
Oct 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ Unreleased Changes
* Crop Production Regression
* Corrected a misspelled column name. The fertilization rate table column
must now be named ``phosphorus_rate``, not ``phosphorous_rate``.
* Urban Flood Risk
* Fixed a bug where lucodes present in the LULC raster but missing from
the biophysical table would either raise a cryptic IndexError or silently
apply invalid curve numbers. Now a helpful ValueError is raised.

3.9.1 (2021-09-22)
------------------
Expand Down
24 changes: 23 additions & 1 deletion src/natcap/invest/urban_flood_risk_mitigation.py
Original file line number Diff line number Diff line change
Expand Up @@ -773,8 +773,30 @@ def _lu_to_cn_op(
# pixel and the rows are the curve number index for the landcover
# type under that pixel (0..3 are CN_A..CN_D and 4 is "unknown")
valid_lucodes = lucode_array[valid_mask].astype(int)

try:
cn_matrix = lucode_to_cn_table[valid_lucodes]
except IndexError:
# Find the code that raised the IndexError, and possibly
# any others that also would have.
lucodes = numpy.unique(valid_lucodes)
missing_codes = lucodes[lucodes >= lucode_to_cn_table.shape[0]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be very wrong about this, but to my eye I think this might always retrieve the top n lucodes. When I make the following change to tests/test_ufrm.py (and also PDB'd into the stacktrace):

diff --git a/tests/test_ufrm.py b/tests/test_ufrm.py
index 8f8e2ccc1..a6487fd4b 100644
--- a/tests/test_ufrm.py
+++ b/tests/test_ufrm.py
@@ -182,7 +182,8 @@ class UFRMTests(unittest.TestCase):
         # These are codes that will raise an IndexError on
         # indexing into the CN table sparse matrix. The test
         # LULC raster has values from 0 to 21.
-        bad_cn_table = cn_table[cn_table['lucode'] < 15]
+        bad_cn_table = cn_table[
+            (cn_table['lucode'] > 3) & (cn_table['lucode'] < 15)]
         bad_cn_table.to_csv(bad_cn_table_path, index=False)
         args['curve_number_table_path'] = bad_cn_table_path

I get the top end of the range instead of [0, 2, 3, 16, 17, 18, 21]:

ValueError: The biophysical table is missing a row for lucode(s)[16, 17, 18, 21]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right, but also that's how I originally intended it. When it hits the IndexError it's finding only the codes that can raise that IndexError. We could also do the check that appears in the next block, where we check for codes that do not raise the IndexError but are still missing ([0, 2, 3] in your example). What do you think?

The other thing is, in this scope we only ever know the pixel values that appear in the current iterblocks block, right? So in either case (the IndexError or the subsequent check for empty rows) our error message might miss values that would get caught in a subsequent raster block.

Also, there were bugs in the assertRaises blocks of all these tests that were preventing the nested assertions from even being called. I fixed that and made some assertions a bit more explicit to better differentiate the two cases that are being tested in the one new test.

Copy link
Member

@phargogh phargogh Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

After poking around this a bit more, I think what you have here is actually a complete solution! I constructed a table (Biophysical_water_SF_bad.csv) to obviously be missing required values but have enough and large-enough values in the matrix to not trigger the IndexError. When the CSR matrix is indexed into (which does not raise IndexError), the second check for all-zero rows catches all of the missing internal values as we would hope:

ValueError: The biophysical table is missing a row for lucode(s)[0, 2, 3, 4, 5, 8, 10, 16, 17, 18, 21]

So I was wrong and this test is good to go!

The other thing is, in this scope we only ever know the pixel values that appear in the current iterblocks block, right? So in either case (the IndexError or the subsequent check for empty rows) our error message might miss values that would get caught in a subsequent raster block.

Yep, you're absolutely right ... the only way we'll know if our error message reflects all the missing values is by checking all of the raster values. In my opinion, the value (and relative ease of maintenance) of failing fast will probably outweigh the benefits of a truly complete error message.

Also, there were bugs in the assertRaises blocks of all these tests that were preventing the nested assertions from even being called. I fixed that and made some assertions a bit more explicit to better differentiate the two cases that are being tested in the one new test.

Oh man, thanks for catching that!

raise ValueError(
f'The biophysical table is missing a row for lucode(s) '
f'{missing_codes.tolist()}')

# Even without an IndexError, still must guard against
# lucodes that can index into the sparse matrix but were
# missing from the biophysical table. They have rows of all 0.
Comment on lines +788 to +790
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very helpful.

if not cn_matrix.sum(1).all():
empty_rows = numpy.where(lucode_to_cn_table.sum(1) == 0)
missing_codes = numpy.intersect1d(valid_lucodes, empty_rows)
raise ValueError(
f'The biophysical table is missing a row for lucode(s) '
f'{missing_codes.tolist()}')

per_pixel_cn_array = (
lucode_to_cn_table[valid_lucodes].toarray().reshape(
cn_matrix.toarray().reshape(
(-1, 4))).transpose()

# this is the soil type array with values ranging from 0..4 that will
Expand Down
51 changes: 47 additions & 4 deletions tests/test_ufrm.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,10 +149,53 @@ def test_ufrm_value_error_on_bad_soil(self):

with self.assertRaises(ValueError) as cm:
urban_flood_risk_mitigation.execute(args)
actual_message = str(cm.exception)
expected_message = (
'Check that the Soil Group raster does not contain')
self.assertTrue(expected_message in actual_message)

actual_message = str(cm.exception)
expected_message = (
'Check that the Soil Group raster does not contain')
self.assertTrue(expected_message in actual_message)

def test_ufrm_value_error_on_bad_lucode(self):
"""UFRM: assert exception on missing lucodes."""
import pandas
from natcap.invest import urban_flood_risk_mitigation
args = self._make_args()

bad_cn_table_path = os.path.join(
self.workspace_dir, 'bad_cn_table.csv')
cn_table = pandas.read_csv(args['curve_number_table_path'])

# drop a row with an lucode known to exist in lulc raster
# This is a code that will successfully index into the
# CN table sparse matrix, but will not return valid data.
bad_cn_table = cn_table[cn_table['lucode'] != 0]
bad_cn_table.to_csv(bad_cn_table_path, index=False)
args['curve_number_table_path'] = bad_cn_table_path

with self.assertRaises(ValueError) as cm:
urban_flood_risk_mitigation.execute(args)

actual_message = str(cm.exception)
expected_message = (
f'The biophysical table is missing a row for lucode(s) {[0]}')
self.assertEqual(expected_message, actual_message)

# drop rows with lucodes known to exist in lulc raster
# These are codes that will raise an IndexError on
# indexing into the CN table sparse matrix. The test
# LULC raster has values from 0 to 21.
Comment on lines +183 to +186
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary here of what's in the LULC raster is a very good idea and was helpful in digging into this!

bad_cn_table = cn_table[cn_table['lucode'] < 15]
bad_cn_table.to_csv(bad_cn_table_path, index=False)
args['curve_number_table_path'] = bad_cn_table_path

with self.assertRaises(ValueError) as cm:
urban_flood_risk_mitigation.execute(args)

actual_message = str(cm.exception)
expected_message = (
f'The biophysical table is missing a row for lucode(s) '
f'{[16, 17, 18, 21]}')
self.assertEqual(expected_message, actual_message)

def test_ufrm_string_damage_to_infrastructure(self):
"""UFRM: handle str(int) structure indices.
Expand Down