Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Improve division of functionality between build_lookup_from_csv, read_csv_to_dataframe, pandas.read_csv #1327

Closed
emlys opened this issue Jun 14, 2023 · 1 comment · Fixed by #1334
Assignees
Labels
in progress This issue is actively being worked on proposal Internal software team proposal

Comments

@emlys
Copy link
Member

emlys commented Jun 14, 2023

read_csv_to_dataframe is a wrapper around pandas.read_csv that adds this functionality:

  • Strip whitespace from column names
  • Convert column names to lowercase
  • Strip whitespace from table values
  • Expand relative paths

build_lookup_from_csv reformats the table as a dictionary indexed by a particular column, and also adds this functionality:

  • Convert table values to lowercase
  • Only use certain columns
  • Drop empty rows
  • Replace NA values with an empty string

It would make more sense if the additional functionality of build_lookup_from_csv was moved to read_csv_to_dataframe. Particularly the lowercase conversion options are confusing: the to_lower arg of build_lookup_from_csv affects column names AND values. the to_lower arg of read_csv_to_dataframe affects column names only.

Pseudocode outlining the proposed division of functionality:

def build_lookup_from_csv(path, index_col, **kwargs):
    df = read_csv_to_dataframe(path, **kwargs)
    return df reformatted as a dictionary indexed by index_col

def read_csv_to_dataframe(path, cols_to_lower=True, vals_to_lower=True, 
        expand_path_cols=[], column_list=None, sep=None, engine='python', encoding='utf-8-sig', **kwargs):
    df = pandas.read_csv(path, **kwargs)
    use only the columns in column_list
    drop empty rows
    strip whitespace from column names and values
    expand relative paths in expand_path_cols
    replace NA values with an empty string
    if cols_to_lower:
        convert column names to lowercase
    if vals_to_lower:
        convert values to lowercase
    return df

The data processing that can be done with arguments to pandas.read_csv (such as selecting columns, dropping empty rows, and applying converters to column values) should be moved there to avoid duplication.

@emlys emlys added the proposal Internal software team proposal label Jun 14, 2023
@emlys
Copy link
Member Author

emlys commented Jun 15, 2023

From 6/15 coffee call: Once this is implemented, we might not need build_lookup_from_csv at all. If it's reduced to basically just a call to df.to_dict, we can replace build_lookup_from_csv with read_csv_to_dataframe().to_dict() everywhere.

@emlys emlys self-assigned this Jun 16, 2023
@emlys emlys added the in progress This issue is actively being worked on label Jun 16, 2023
emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023
emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023
emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023
emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023
emlys added a commit that referenced this issue Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress This issue is actively being worked on proposal Internal software team proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant