Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove build_lookup_from_csv and consolidate into read_csv_to_dataframe #1334

Merged
merged 11 commits into from
Jun 22, 2023

Conversation

emlys
Copy link
Member

@emlys emlys commented Jun 15, 2023

Description

Fixes #1327
Consolidated functionality into read_csv_to_dataframe and replaced uses of build_lookup_from_csv(...) with read_csv_to_dataframe(...).to_dict(orient='index').

Split the to_lower functionality into two separate args, cols_to_lower and vals_to_lower. Some of this might be obsoleted by #1328, but I wanted to take it one step at a time.

Checklist

  • Updated HISTORY.rst and link to any relevant issue (if these changes are user-facing)
  • Updated the user's guide (if needed)
  • Tested the affected models' UIs (if relevant)

@@ -358,7 +358,6 @@ def test_read_transition_matrix(self):
transition_csv.write('b,,NCC,accum\n')
transition_csv.write('c,accum,,NCC\n')
transition_csv.write(',,,\n')
transition_csv.write(',legend,,') # simulate legend
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one minor functional change. This was testing that extra stuff in the table would be ignored. I don't think we allow that in other tables in invest, so it seemed easier to drop it from the test than figure out how to accomplish that with read_csv_to_dataframe. But let me know if there's a reason to keep it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right that this table legend is unique among the InVEST models! It's also weird in that it breaks from the normal table structure only to provide some extra metadata that's also described in the UG chapter. Personally I'm OK with removing it, but because this is a notable functional change we should also be sure to

  • remove the legend-generating code from the preprocessor
  • Note the functional change in the model's HISTORY
  • remove the legend from the transitions_template.csv in the sample data repo

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize that the preprocessor and sample data relied on this. In the interest of self-contained PRs, I'd rather not have to make all those changes here.

How about this for now? I made a small change in coastal_blue_carbon.py so that it will skip over rows that have nothing in the lulc-class column (which should be all the legend rows). That way the functionality is the same, but read_csv_to_dataframe doesn't have to handle it. And we can remove the legend in a separate issue if it's worth it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a fine solution to me! I took the liberty of describing this in #1336

@emlys emlys requested a review from phargogh June 16, 2023 00:02
@emlys emlys marked this pull request as ready for review June 16, 2023 00:02
@emlys emlys self-assigned this Jun 16, 2023
Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emlys ! I left a few questions, comments and suggestions. And thanks for having this feature implementation as a nice, atomic PR in the bigger picture of where these table changes are heading :)

Comment on lines 623 to 624
cols_to_lower (bool): if True, convert all column names to lowercase
vals_to_lower (bool): if True, convert all table values to lowercase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see from the docstring that cols_to_lower and vals_to_lower are supposed to be of type bool, but the name of the parameters makes it sound like they are supposed to be lists of columns. Is there a precedent for this sort of argument name in pandas? If not, might there be an alternate parameter name that indicates that this is a boolean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed these to convert_cols_to_lower and convert_vals_to_lower, hopefully that's clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much clearer. Thanks!

dataframe = pandas.read_csv(
path, sep=sep, engine=engine, encoding=encoding, **kwargs)
path, index_col=False, sep=sep, engine=engine, encoding=encoding, **kwargs)
except UnicodeDecodeError as error:
LOGGER.error(
f'{path} must be encoded as UTF-8 or ASCII')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this line isn't part of the PR, but is your impression that this error is about the filepath or about the contents of the table? And would it be worth it to clarify this in the error message? Just thinking about this in relation to the encoding parameter we're passing where we're defaulting to utf-8-sig but a different encoding could be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is about the contents of the table. I updated this to say "The file {path} must be..."

Comment on lines 686 to 691
# convert table values to lowercase
if vals_to_lower:
dataframe = dataframe.applymap(
lambda x: x.lower() if isinstance(x, str) else x)

# expand paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where we would want to lowercase values except for the paths that need to be expanded? As written, it seems like we could find ourselves in a case where filepaths are expanded only after the values have been lowercased.

(Sorry, I can't highlight the expand_path_cols block because it's collapsed in the github UI)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there definitely are cases where a table contains a path column that we don't want to lowercase, but other columns that we do want to lowercase. The read_csv_to_dataframe approach has not given us the option to distinguish between columns, so we've had to set to_lower=False on any table that contains paths.

When #1328 is implemented, that will give us the chance to apply different processing to each column. For now I don't think this is an issue, because vals_to_lower should always be False if expand_paths are defined.

@@ -358,7 +358,6 @@ def test_read_transition_matrix(self):
transition_csv.write('b,,NCC,accum\n')
transition_csv.write('c,accum,,NCC\n')
transition_csv.write(',,,\n')
transition_csv.write(',legend,,') # simulate legend
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right that this table legend is unique among the InVEST models! It's also weird in that it breaks from the normal table structure only to provide some extra metadata that's also described in the UG chapter. Personally I'm OK with removing it, but because this is a notable functional change we should also be sure to

  • remove the legend-generating code from the preprocessor
  • Note the functional change in the model's HISTORY
  • remove the legend from the transitions_template.csv in the sample data repo

What do you think?


with open(csv_file, 'w') as file_obj:
file_obj.write(textwrap.dedent(
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have a .strip() there at the end (which is nice!) so this isn't a request to change anything, just an FYI! I recently learned that you can also do this to avoid the leading newline:

"""\
HEADER,
A,
b
"""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is much nicer, I made this change everywhere in test_utils.py

Comment on lines 616 to 622
index_col (str): name of column to use as the dataframe index. If
``cols_to_lower``, this column name and the dataframe column names
will be lowercased before they are compared. If ``usecols``
is defined, this must be included in ``usecols``.
usecols (list(str)): list of column names to subset from the dataframe.
If ``cols_to_lower``, these names and the dataframe column names
will be lowercased before they are compared.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since index_col is required to be in usecols, would it make sense to simply add it into usecols if it isn't already there? It seems like this might be a minor bit of de-duplication when writing out a function call, but I'm curious what you think!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That certainly could make sense! I was going for consistency with pandas.read_csv, which will raise an error if you don't include the value of index_col in your usecols list. I think it also parallels nicely with the structure of the MODEL_SPEC - when we implement #1328, we might pipe in the spec columns, which already include the index column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all makes sense to me!

@emlys emlys requested a review from phargogh June 20, 2023 20:47
@phargogh phargogh mentioned this pull request Jun 22, 2023
4 tasks
Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emlys ! I don't have any further comments or changes to request, so I'm adding my approval.

One thing I am seeing though is that the mac binary build is failing when it was passing at the beginning of the PR. Could you just take a peek at that and make sure it isn't failing because of something in the PR here? If it isn't, please feel free to merge away. Thanks!

@emlys
Copy link
Member Author

emlys commented Jun 22, 2023

Thanks @phargogh! I'm seeing the same error in the latest build on main (https://github.com/natcap/invest/actions/runs/5317455763/jobs/9628004338), so I think it's unrelated.

@emlys emlys merged commit 1a6e314 into natcap:main Jun 22, 2023
@emlys emlys deleted the feature/1327 branch October 3, 2024 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants