Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trb analysis #78

Closed
wants to merge 35 commits into from
Closed

Trb analysis #78

wants to merge 35 commits into from

Conversation

zackAemmer
Copy link
Contributor

Includes the code used to test and generate plots of replacement mode model accuracy and F1 for choosing replacement mode features:
e-mission/e-mission-server#890

zackAemmer and others added 30 commits July 26, 2022 07:37
- split out uuids into all, stage and non stage
- fixed the uuid check to split the confirmed trips also into stage and non stage
- found missing IDs and confirmed that they had no data
- created confirmed and expanded confirmed trips dataframes separately for stage and non stage as well
Don't think we can use inferred labels given that they are also the output of
an algorithm with its own error.

+ also ignore "Prefer Not To Say".

```
data = data[~data['available_modes'].isin(['None', 'Prefer not to say'])]
```

Without the change, I get the error:

```
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-28-ba060be6a019> in <module>
    129
    130 # Add availability variables to data
--> 131 data = add_mode_availability(data, availability_codes, 'available_modes', 'Mode_confirm', 'Replaced_mode')

<ipython-input-28-ba060be6a019> in add_mode_availability(data, availability_codes, availability_col, choice_col, replaced_col)
    115                 i+=1
    116                 continue
--> 117             options = [availability_codes[x] for x in available.split(';')]
    118             # Chosen mode must be in the available modes list, if mode was chosen it is assumed available
    119             # SWAP THIS LINE TO INCLUDE REPLACED MODE IN THE CHOICE SET (FOR VISUALS AT END)

<ipython-input-28-ba060be6a019> in <listcomp>(.0)
    115                 i+=1
    116                 continue
--> 117             options = [availability_codes[x] for x in available.split(';')]
    118             # Chosen mode must be in the available modes list, if mode was chosen it is assumed available
    119             # SWAP THIS LINE TO INCLUDE REPLACED MODE IN THE CHOICE SET (FOR VISUALS AT END)

KeyError: 'Prefer not to say'
```
Fix data read by splitting into all, stage and non-stage
Flip no_inferred_label and inferred_label + start investigating infer…
+ put the switch to flip at the top and flip based on it in the model creation
+ create a function to pull out the sensed primary mode, but don't use it yet
…input

The three options are:
- ONLY_LABELED: for only the labeled subset. for the sensitivity analysis, this would be a user who labeled all their trips
- ONLY_SENSED: using only the sensed modes (walk,bike,car,bus,train). for the sensitivity analysis, this would to simulate a user with no labels
- BEST_AVAILABLE: user label, falling back to best label assist, falling back to best sensed label.for the sensitivity analysis, this would to simulate a user with partial labels

+ change the mapping of the primary sensed mode to correspond with label assist
so that the downstream code (e.g. mode_map from mode_confirm to Mode_confirm)
works properly

Note that if many parts of the code will NOT work for sensed labels since some kinds of labels are missing (e.g. ridehail)

+ minor code fixes around refactoring the display of numbers
+ @zackAemmer, you might want to add more of these as you work through the analyses as yet another sanity check

Testing done:
- Ran with all three options
- `ONLY_LABELED` and `BEST_AVAILABLE` run through
- `ONLY_SENSED` takes a long time, so I ran it on the first 100 trips

```
elif input_dataset == "ONLY_SENSED":
    expanded_ct = expanded_ct.head(100)
    expanded_ct.mode_confirm = expanded_ct.apply(lambda row: get_primary_sensed_mode(row), axis=1)
```

and it fails with

```
KeyError: "['tt_ridehail', 'tt_transit', 'tt_walk', 'tt_s_micro'] not in index"
```

while training the random forest model

```
X = df_train[feature_list].values
```

Note also that I had a preliminary fix for this; sharing it here in case it helps

```
if input_dataset == "ONLY_SENSED":
    # Remove features that don't exist
    remove_list = []
    for fn in feature_list:
        if fn not in df_train.columns:
            print("NO DATA FOR FEATURE %s" % fn)
            remove_list.append(fn)
    for rn in remove_list:
        feature_list.remove(rn)
```
Make the runs more configurable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants