Proposal: Improve division of functionality between `build_lookup_from_csv`, `read_csv_to_dataframe`, `pandas.read_csv` #1327

emlys · 2023-06-14T00:51:17Z

read_csv_to_dataframe is a wrapper around pandas.read_csv that adds this functionality:

Strip whitespace from column names
Convert column names to lowercase
Strip whitespace from table values
Expand relative paths

build_lookup_from_csv reformats the table as a dictionary indexed by a particular column, and also adds this functionality:

Convert table values to lowercase
Only use certain columns
Drop empty rows
Replace NA values with an empty string

It would make more sense if the additional functionality of build_lookup_from_csv was moved to read_csv_to_dataframe. Particularly the lowercase conversion options are confusing: the to_lower arg of build_lookup_from_csv affects column names AND values. the to_lower arg of read_csv_to_dataframe affects column names only.

Pseudocode outlining the proposed division of functionality:

def build_lookup_from_csv(path, index_col, **kwargs):
    df = read_csv_to_dataframe(path, **kwargs)
    return df reformatted as a dictionary indexed by index_col

def read_csv_to_dataframe(path, cols_to_lower=True, vals_to_lower=True, 
        expand_path_cols=[], column_list=None, sep=None, engine='python', encoding='utf-8-sig', **kwargs):
    df = pandas.read_csv(path, **kwargs)
    use only the columns in column_list
    drop empty rows
    strip whitespace from column names and values
    expand relative paths in expand_path_cols
    replace NA values with an empty string
    if cols_to_lower:
        convert column names to lowercase
    if vals_to_lower:
        convert values to lowercase
    return df

The data processing that can be done with arguments to pandas.read_csv (such as selecting columns, dropping empty rows, and applying converters to column values) should be moved there to avoid duplication.

The text was updated successfully, but these errors were encountered:

emlys · 2023-06-15T16:51:08Z

From 6/15 coffee call: Once this is implemented, we might not need build_lookup_from_csv at all. If it's reduced to basically just a call to df.to_dict, we can replace build_lookup_from_csv with read_csv_to_dataframe().to_dict() everywhere.

…sv_to_dataframe #1327

#1327

emlys added the proposal Internal software team proposal label Jun 14, 2023

emlys mentioned this issue Jun 15, 2023

Remove build_lookup_from_csv and consolidate into read_csv_to_dataframe #1334

Merged

3 tasks

emlys self-assigned this Jun 16, 2023

emlys added the in progress This issue is actively being worked on label Jun 16, 2023

emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023

escape first newline rather than calling .strip() natcap#1327

246e94d

emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023

clarify encoding error natcap#1327

58a2109

emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023

rename to_lower arguments to clarify that they are boolean natcap#1327

5d39781

emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023

rename to_lower args natcap#1327

f545a6e

emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023

skip transition table rows that are part of the legend natcap#1327

b0e7548

emlys added a commit to emlys/invest that referenced this issue Jun 20, 2023

missed some to_lower args natcap#1327

8c45efc

emlys closed this as completed in #1334 Jun 22, 2023

emlys added a commit that referenced this issue Jun 22, 2023

move data cleaning functionality of build_lookup_from_csv into read_c…

278bb13

…sv_to_dataframe #1327

emlys added a commit that referenced this issue Jun 22, 2023

update usages of read_csv_to_dataframe to new api #1327

9e07853

emlys added a commit that referenced this issue Jun 22, 2023

remove build_lookup_from_csv and consolidate into read_csv_to_dataframe

2f78045

#1327

emlys added a commit that referenced this issue Jun 22, 2023

clean up and add notes to history #1327

8d61c18

emlys added a commit that referenced this issue Jun 22, 2023

verify unique keys when setting index #1327

bbef555

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Improve division of functionality between `build_lookup_from_csv`, `read_csv_to_dataframe`, `pandas.read_csv` #1327

Proposal: Improve division of functionality between `build_lookup_from_csv`, `read_csv_to_dataframe`, `pandas.read_csv` #1327

emlys commented Jun 14, 2023

emlys commented Jun 15, 2023

Proposal: Improve division of functionality between build_lookup_from_csv, read_csv_to_dataframe, pandas.read_csv #1327

Proposal: Improve division of functionality between build_lookup_from_csv, read_csv_to_dataframe, pandas.read_csv #1327

Comments

emlys commented Jun 14, 2023

emlys commented Jun 15, 2023

Proposal: Improve division of functionality between `build_lookup_from_csv`, `read_csv_to_dataframe`, `pandas.read_csv` #1327

Proposal: Improve division of functionality between `build_lookup_from_csv`, `read_csv_to_dataframe`, `pandas.read_csv` #1327