-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Id function improvised #234
Conversation
Hi @Priyansdesai, Thanks for the PR! Could you please pull in the latest code from master so that code changes reflect your most recent changes? For the dataset, we generally put the datasets to lux-datasets and use a link for the tests. Many of the datasets that you've used for testing is already in that folder (absenteeism, census, spotify), so it would be great if we could just reuse the same dataset. |
… changed to lux-datasets links
Hi @dorisjlee! I have added all the relevant datasets to lux-datasets and raised a PR in that repo. I have also changed the links in the test_type.py file to github links as well as pulled new changes. |
Hi @Priyansdesai , I'm still seeing some files that were changed that shouldn't be part of the PR. Could you use the Files Changed view to check that the PR only contains the intended changes? Thanks! |
Hi @dorisjlee! The files car.csv might have changed because I took the dataset from lux-datasets, so that extra record might have been added. Other than that, I haven't changed files other than test_type.py and utils.py. Those changes might have been the result due to merging the new changes in the main repo. |
Hi @Priyansdesai, It would be great if you can make sure that the new changes (even the merged ones) are up-to-date with the latest (i.e., does not show up as color in "Files Changed"). Like you mentioned the PR should only contain changes for |
@dorisjlee For the datasets, I re-downloaded the master branch and copied the same versions of the dataset - car.csv and college.csv |
lux/utils/utils.py
Outdated
return high_cardinality and (attribute_contain_id or almost_all_vals_unique) | ||
if len(df) >= 2: | ||
diff = df[attribute].diff() | ||
evenly_spaced = all(diff.loc[1:] == diff.loc[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dorisjlee! I have fixed the Index issue now. But, one test test_check_datetime_numeric_value is failing in test_type.py. The code miscategorizes the "Year"column to be Nominal instead of Temporal. I do not know where this is coming from, but I think it is not related to the check_id_like function. I have not yet pushed the code. I just wanted to post an update. |
@dorisjlee, I have added the support for Index columns for ID function. The tests that are failing, these are also failing in the Master branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the one line; please do so before merging as it simplifies control flow logic.
lux/utils/utils.py
Outdated
evenly_spaced = True | ||
if attribute_contain_id: | ||
almost_all_vals_unique = df.cardinality[attribute] >= 0.75 * len(df) | ||
return high_cardinality and (almost_all_vals_unique or evenly_spaced) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line can be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to delete all lines from 100 to 103?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerry - Done the required changes and pushed as well.
@jerrysong1324 - Changes added. The tests that fail, fail in master 2. A test fails in test_type.py, but that does not have anything to do with my code for id detection. |
Thank you for merging this! |
Added more regex checking within the ID checking to cover more cases. Also, added to check if the difference between consecutive ID values is the same - even intervals.
Had a query regarding falsely identifying columns that are NOT IDs but are still recognized as IDs.
Added more datasets for testing.
Update 12:23AM PST 16th Jan: Complete