Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Make sampling parameter a global config #183

Closed
bilalmussaukfd opened this issue Dec 21, 2020 · 4 comments
Closed

[ENH] Make sampling parameter a global config #183

bilalmussaukfd opened this issue Dec 21, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request priority high priority tasks (for dev)

Comments

@bilalmussaukfd
Copy link

Hiya
Is it possible to remove the 30000 rows random sampling so that I can see the entirety of my data albeit it will be a bit slow to process?

image

@dorisjlee
Copy link
Member

Hi @bilalmussaukfd,

Yeah, we can definitely add this in as a configuration parameter that you can turn on and off.
We can also expose a parameter to let you adjust the number of samples and the number of rows that we chose to sample.
As you mentioned, the main concern is that turning it off might make things run really slow, especially when lots of things are plotted on the frontend. We can work on this and include this as an optional configuration in the next release.

As a temporary patch, if you want to try turning off the sampling right now, you can simply replace this one line inside PandasExecutor.py with ldf._sampled = ldf, as follows:

#PandasExecutor.execute_sampling(ldf)
ldf._sampled = ldf

If you are not sure where your lux is installed, you can print out the source location by:

import lux
lux.__file__

and then access lux/executor/PandasExecutor.py. The changes should be reflected after you save the source and restart your notebook to import lux again. Let us know if you have any questions in the meanwhile. Thanks!

@dorisjlee dorisjlee changed the title Remove the capping of sample at 30000 rows. [ENH] Make sampling parameter a global config Dec 21, 2020
@dorisjlee dorisjlee added enhancement New feature or request priority high priority tasks (for dev) labels Dec 21, 2020
@bilalmussaukfd
Copy link
Author

bilalmussaukfd commented Dec 22, 2020 via email

@dorisjlee
Copy link
Member

Hi @bilalmussaukfd,

Thanks for your question. Scalability in Lux is something that we've been working on, but the performance is still far from perfect for big datasets. We have a performance test on the census dataset which is 32k rows, 14 columns, around 4MB. For this dataset, Lux takes about 4 seconds to compute the whole widget visualization. The performance for various datasets really depends on a host of factors, including the number of rows, columns, data type of columns, and the input intent, so the dataset that I mentioned above just serves as one point of comparison. I'm hoping that we will have a faster version of Lux in the coming months, will definitely keep you updated when we release an improved version!

@dorisjlee
Copy link
Member

Hi @bilalmussaukfd,
Based on your concern, we've added a new parameter lux.config.sampling to make turning off sampling easier. Check out this page for more details!
You can access these updated changes by upgrading to the latest version of Lux:

pip install --upgrade lux-api
jupyter nbextension install --py luxwidget
jupyter nbextension enable --py luxwidget

As I mentioned above, we'll continue to work on the scalability of Lux to support larger datasets. Stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority high priority tasks (for dev)
Projects
None yet
Development

No branches or pull requests

3 participants