-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Make sampling parameter a global config #183
Comments
Hi @bilalmussaukfd, Yeah, we can definitely add this in as a configuration parameter that you can turn on and off. As a temporary patch, if you want to try turning off the sampling right now, you can simply replace this one line inside PandasExecutor.py with #PandasExecutor.execute_sampling(ldf)
ldf._sampled = ldf If you are not sure where your lux is installed, you can print out the source location by: import lux
lux.__file__ and then access |
Hi Doris,
I have made that change but it has slowed it down drastically. My dataset is only 34k (I trimmed it down from the initial 587k rows) rows so not sure if I am testing its limits?
Is Lux better for smaller datasets in this instance?
Regards
Bilal
From: Doris Lee <notifications@github.com>
Sent: 21 December 2020 15:37
To: lux-org/lux <lux@noreply.github.com>
Cc: Bilal Mussa <bilalmussa@ukflooringdirect.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [lux-org/lux] Remove the capping of sample at 30000 rows. (#183)
Hi @bilalmussaukfd<https://github.com/bilalmussaukfd>,
Yeah, we can definitely add this in as a configuration parameter that you can turn on and off.
We can also expose a parameter to let you adjust the number of samples and the number of rows that we chose to sample.
As you mentioned, the main concern is that turning it off might make things run really slow, especially when lots of things are plotted on the frontend. We can work on this and include this as an optional configuration in the next release.
As a temporary patch, if you want to try turning off the sampling right now, you can simply replace this one line inside PandasExecutor.py<https://github.com/lux-org/lux/blob/master/lux/executor/PandasExecutor.py#L83> with ldf._sampled = ldf, as follows:
#PandasExecutor.execute_sampling(ldf)
ldf._sampled = ldf
If you are not sure where your lux is installed, you can print out the source location by:
import lux
lux.__file__
and then access lux/executor/PandasExecutor.py. The changes should be reflected after you save the source and restart your notebook to import lux again. Let us know if you have any questions in the meanwhile. Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#183 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AR7AEOV2BUS5G2YO2LLPBLTSV5TQFANCNFSM4VEJU5MQ>.
|
Hi @bilalmussaukfd, Thanks for your question. Scalability in Lux is something that we've been working on, but the performance is still far from perfect for big datasets. We have a performance test on the census dataset which is 32k rows, 14 columns, around 4MB. For this dataset, Lux takes about 4 seconds to compute the whole widget visualization. The performance for various datasets really depends on a host of factors, including the number of rows, columns, data type of columns, and the input intent, so the dataset that I mentioned above just serves as one point of comparison. I'm hoping that we will have a faster version of Lux in the coming months, will definitely keep you updated when we release an improved version! |
Hi @bilalmussaukfd, pip install --upgrade lux-api
jupyter nbextension install --py luxwidget
jupyter nbextension enable --py luxwidget As I mentioned above, we'll continue to work on the scalability of Lux to support larger datasets. Stay tuned! |
Hiya
Is it possible to remove the 30000 rows random sampling so that I can see the entirety of my data albeit it will be a bit slow to process?
The text was updated successfully, but these errors were encountered: