[ENH] Make sampling parameter a global config #183

bilalmussaukfd · 2020-12-21T14:39:09Z

Hiya
Is it possible to remove the 30000 rows random sampling so that I can see the entirety of my data albeit it will be a bit slow to process?

dorisjlee · 2020-12-21T15:36:14Z

Hi @bilalmussaukfd,

Yeah, we can definitely add this in as a configuration parameter that you can turn on and off.
We can also expose a parameter to let you adjust the number of samples and the number of rows that we chose to sample.
As you mentioned, the main concern is that turning it off might make things run really slow, especially when lots of things are plotted on the frontend. We can work on this and include this as an optional configuration in the next release.

As a temporary patch, if you want to try turning off the sampling right now, you can simply replace this one line inside PandasExecutor.py with ldf._sampled = ldf, as follows:

#PandasExecutor.execute_sampling(ldf)
ldf._sampled = ldf

If you are not sure where your lux is installed, you can print out the source location by:

import lux
lux.__file__

and then access lux/executor/PandasExecutor.py. The changes should be reflected after you save the source and restart your notebook to import lux again. Let us know if you have any questions in the meanwhile. Thanks!

bilalmussaukfd · 2020-12-22T08:59:43Z

Hi Doris, I have made that change but it has slowed it down drastically. My dataset is only 34k (I trimmed it down from the initial 587k rows) rows so not sure if I am testing its limits? Is Lux better for smaller datasets in this instance? Regards Bilal From: Doris Lee <notifications@github.com> Sent: 21 December 2020 15:37 To: lux-org/lux <lux@noreply.github.com> Cc: Bilal Mussa <bilalmussa@ukflooringdirect.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [lux-org/lux] Remove the capping of sample at 30000 rows. (#183) Hi @bilalmussaukfd<https://github.com/bilalmussaukfd>, Yeah, we can definitely add this in as a configuration parameter that you can turn on and off. We can also expose a parameter to let you adjust the number of samples and the number of rows that we chose to sample. As you mentioned, the main concern is that turning it off might make things run really slow, especially when lots of things are plotted on the frontend. We can work on this and include this as an optional configuration in the next release. As a temporary patch, if you want to try turning off the sampling right now, you can simply replace this one line inside PandasExecutor.py<https://github.com/lux-org/lux/blob/master/lux/executor/PandasExecutor.py#L83> with ldf._sampled = ldf, as follows: #PandasExecutor.execute_sampling(ldf) ldf._sampled = ldf If you are not sure where your lux is installed, you can print out the source location by: import lux lux.__file__ and then access lux/executor/PandasExecutor.py. The changes should be reflected after you save the source and restart your notebook to import lux again. Let us know if you have any questions in the meanwhile. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#183 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AR7AEOV2BUS5G2YO2LLPBLTSV5TQFANCNFSM4VEJU5MQ>.

dorisjlee · 2020-12-23T11:58:24Z

Hi @bilalmussaukfd,

Thanks for your question. Scalability in Lux is something that we've been working on, but the performance is still far from perfect for big datasets. We have a performance test on the census dataset which is 32k rows, 14 columns, around 4MB. For this dataset, Lux takes about 4 seconds to compute the whole widget visualization. The performance for various datasets really depends on a host of factors, including the number of rows, columns, data type of columns, and the input intent, so the dataset that I mentioned above just serves as one point of comparison. I'm hoping that we will have a faster version of Lux in the coming months, will definitely keep you updated when we release an improved version!

dorisjlee · 2021-01-09T11:40:51Z

Hi @bilalmussaukfd,
Based on your concern, we've added a new parameter lux.config.sampling to make turning off sampling easier. Check out this page for more details!
You can access these updated changes by upgrading to the latest version of Lux:

pip install --upgrade lux-api
jupyter nbextension install --py luxwidget
jupyter nbextension enable --py luxwidget

As I mentioned above, we'll continue to work on the scalability of Lux to support larger datasets. Stay tuned!

dorisjlee changed the title ~~Remove the capping of sample at 30000 rows.~~ [ENH] Make sampling parameter a global config Dec 21, 2020

dorisjlee added enhancement New feature or request priority high priority tasks (for dev) labels Dec 21, 2020

dorisjlee assigned westernguy2 Dec 21, 2020

westernguy2 mentioned this issue Dec 29, 2020

Add sampling parameters as a global config #192

Merged

dorisjlee closed this as completed Jan 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Make sampling parameter a global config #183

[ENH] Make sampling parameter a global config #183

bilalmussaukfd commented Dec 21, 2020

dorisjlee commented Dec 21, 2020

bilalmussaukfd commented Dec 22, 2020 via email

dorisjlee commented Dec 23, 2020

dorisjlee commented Jan 9, 2021

[ENH] Make sampling parameter a global config #183

[ENH] Make sampling parameter a global config #183

Comments

bilalmussaukfd commented Dec 21, 2020

dorisjlee commented Dec 21, 2020

bilalmussaukfd commented Dec 22, 2020 via email

dorisjlee commented Dec 23, 2020

dorisjlee commented Jan 9, 2021