-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove build_lookup_from_csv
and consolidate into read_csv_to_dataframe
#1334
Conversation
tests/test_coastal_blue_carbon.py
Outdated
@@ -358,7 +358,6 @@ def test_read_transition_matrix(self): | |||
transition_csv.write('b,,NCC,accum\n') | |||
transition_csv.write('c,accum,,NCC\n') | |||
transition_csv.write(',,,\n') | |||
transition_csv.write(',legend,,') # simulate legend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one minor functional change. This was testing that extra stuff in the table would be ignored. I don't think we allow that in other tables in invest, so it seemed easier to drop it from the test than figure out how to accomplish that with read_csv_to_dataframe
. But let me know if there's a reason to keep it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you're right that this table legend is unique among the InVEST models! It's also weird in that it breaks from the normal table structure only to provide some extra metadata that's also described in the UG chapter. Personally I'm OK with removing it, but because this is a notable functional change we should also be sure to
- remove the legend-generating code from the preprocessor
- Note the functional change in the model's HISTORY
- remove the legend from the
transitions_template.csv
in the sample data repo
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize that the preprocessor and sample data relied on this. In the interest of self-contained PRs, I'd rather not have to make all those changes here.
How about this for now? I made a small change in coastal_blue_carbon.py
so that it will skip over rows that have nothing in the lulc-class
column (which should be all the legend rows). That way the functionality is the same, but read_csv_to_dataframe
doesn't have to handle it. And we can remove the legend in a separate issue if it's worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a fine solution to me! I took the liberty of describing this in #1336
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @emlys ! I left a few questions, comments and suggestions. And thanks for having this feature implementation as a nice, atomic PR in the bigger picture of where these table changes are heading :)
src/natcap/invest/utils.py
Outdated
cols_to_lower (bool): if True, convert all column names to lowercase | ||
vals_to_lower (bool): if True, convert all table values to lowercase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see from the docstring that cols_to_lower
and vals_to_lower
are supposed to be of type bool
, but the name of the parameters makes it sound like they are supposed to be lists of columns. Is there a precedent for this sort of argument name in pandas
? If not, might there be an alternate parameter name that indicates that this is a boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed these to convert_cols_to_lower
and convert_vals_to_lower
, hopefully that's clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, much clearer. Thanks!
src/natcap/invest/utils.py
Outdated
dataframe = pandas.read_csv( | ||
path, sep=sep, engine=engine, encoding=encoding, **kwargs) | ||
path, index_col=False, sep=sep, engine=engine, encoding=encoding, **kwargs) | ||
except UnicodeDecodeError as error: | ||
LOGGER.error( | ||
f'{path} must be encoded as UTF-8 or ASCII') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this line isn't part of the PR, but is your impression that this error is about the filepath or about the contents of the table? And would it be worth it to clarify this in the error message? Just thinking about this in relation to the encoding parameter we're passing where we're defaulting to utf-8-sig
but a different encoding could be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is about the contents of the table. I updated this to say "The file {path} must be..."
src/natcap/invest/utils.py
Outdated
# convert table values to lowercase | ||
if vals_to_lower: | ||
dataframe = dataframe.applymap( | ||
lambda x: x.lower() if isinstance(x, str) else x) | ||
|
||
# expand paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where we would want to lowercase values except for the paths that need to be expanded? As written, it seems like we could find ourselves in a case where filepaths are expanded only after the values have been lowercased.
(Sorry, I can't highlight the expand_path_cols
block because it's collapsed in the github UI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there definitely are cases where a table contains a path column that we don't want to lowercase, but other columns that we do want to lowercase. The read_csv_to_dataframe
approach has not given us the option to distinguish between columns, so we've had to set to_lower=False
on any table that contains paths.
When #1328 is implemented, that will give us the chance to apply different processing to each column. For now I don't think this is an issue, because vals_to_lower
should always be False
if expand_paths
are defined.
tests/test_coastal_blue_carbon.py
Outdated
@@ -358,7 +358,6 @@ def test_read_transition_matrix(self): | |||
transition_csv.write('b,,NCC,accum\n') | |||
transition_csv.write('c,accum,,NCC\n') | |||
transition_csv.write(',,,\n') | |||
transition_csv.write(',legend,,') # simulate legend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you're right that this table legend is unique among the InVEST models! It's also weird in that it breaks from the normal table structure only to provide some extra metadata that's also described in the UG chapter. Personally I'm OK with removing it, but because this is a notable functional change we should also be sure to
- remove the legend-generating code from the preprocessor
- Note the functional change in the model's HISTORY
- remove the legend from the
transitions_template.csv
in the sample data repo
What do you think?
tests/test_utils.py
Outdated
|
||
with open(csv_file, 'w') as file_obj: | ||
file_obj.write(textwrap.dedent( | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you have a .strip()
there at the end (which is nice!) so this isn't a request to change anything, just an FYI! I recently learned that you can also do this to avoid the leading newline:
"""\
HEADER,
A,
b
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is much nicer, I made this change everywhere in test_utils.py
src/natcap/invest/utils.py
Outdated
index_col (str): name of column to use as the dataframe index. If | ||
``cols_to_lower``, this column name and the dataframe column names | ||
will be lowercased before they are compared. If ``usecols`` | ||
is defined, this must be included in ``usecols``. | ||
usecols (list(str)): list of column names to subset from the dataframe. | ||
If ``cols_to_lower``, these names and the dataframe column names | ||
will be lowercased before they are compared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since index_col
is required to be in usecols
, would it make sense to simply add it into usecols
if it isn't already there? It seems like this might be a minor bit of de-duplication when writing out a function call, but I'm curious what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That certainly could make sense! I was going for consistency with pandas.read_csv
, which will raise an error if you don't include the value of index_col
in your usecols
list. I think it also parallels nicely with the structure of the MODEL_SPEC
- when we implement #1328, we might pipe in the spec columns
, which already include the index column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all makes sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @emlys ! I don't have any further comments or changes to request, so I'm adding my approval.
One thing I am seeing though is that the mac binary build is failing when it was passing at the beginning of the PR. Could you just take a peek at that and make sure it isn't failing because of something in the PR here? If it isn't, please feel free to merge away. Thanks!
Thanks @phargogh! I'm seeing the same error in the latest build on |
Description
Fixes #1327
Consolidated functionality into
read_csv_to_dataframe
and replaced uses ofbuild_lookup_from_csv(...)
withread_csv_to_dataframe(...).to_dict(orient='index')
.Split the
to_lower
functionality into two separate args,cols_to_lower
andvals_to_lower
. Some of this might be obsoleted by #1328, but I wanted to take it one step at a time.Checklist