-
-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset improvements #70
Comments
Hi, I would love to help with this, but I can't say that I understand completely what you have in mind. Could you explain it a little more? Is the following what you had in mind?
|
both points are right, we would just follow the ndarray model more closely. Because of the coming holidays, my plan was to schedule that for beginning of next year. You can give it a try, though it's rather diligence work as all algorithms have to be updated :D |
Well I'll try then and I'll let you know how it goes 👍🏻 |
Hi, is there some specific reason why some datasets in the |
no this should default to |
I have one doubt about
But at that point, the content of the various items would overlap, and calling collect on that iterator would yield a mess right? I mean if one wanted to use two items from the iterator at the same time then there would be a conflict on the contents of the slices composing the two items right? Wouldn't such implementation need something like this proposal? I'm asking because as a rust noob I just can't wrap my head around it |
We can't ensure that somebody is holding a reference while modifying the content. One solution would involve RefCell, which basically moves the borrow check to the runtime and panics when the content is atm borrowed for mutation. |
And what would a function like the second one you mention return? The |
It should return |
Ok so I thought I had it because now it complies but there's one tiny problem. I tried to implement |
mh this was introduced to associate the correct lifetime for the kernel matrix here. We could copy the kernel matrix and avoid the lifetime there |
I've been a little busy lately and I'm proceeding slowly so I thought I'd give a little "status-report" here, also to get a second opinion on what I've been doing. First part:
Second part: Here things get a little less straightforward. I didn't want to force people using SVMs to clone their kernel every time, so I am trying to change SVMs to accept either an owned kernel or a "view" of one. To that end I changed
This way instead of taking a reference to the kernel one can call At this point the problem is that I would need to be able to differentiate between an SVM that has a view of data that lives as long as itself and one that owns the data in itself. I don't know if that can be done with just one type, I'm trying not to complicate the code too much. Now I still need to modify |
I'm currently re-locating to a different city and this skipped completely my radar, sorry for the late response.
👍
this seems to be the straightforward way to have a view on a kernel matrix.
you are here in a similar spot as the std::borrow::Cow implementation because you cannot suppress the lifetime in the type signature
no you cannot avoid this without introducing I think that it would be reasonable to copy only those sample points which are support vectors. The whole point of SVM is to find a few SV's which can describe the separating hyperplane and we can easily copy those. (btw. this is only the case for the non-linear case, otherwise we do not need to cache the samples and have memory complexity O(n)) The following changes would be reasonable:
hope this is plausible, I think we should not make the API more complex as it already is |
One way I've found is to define a trait "Inner" with all methods concerning only inner matrices and implement it for pub enum KernelInnerBase<K1: Inner, K2:Inner> {
Dense(K1),
Sparse(k2),
} Now the definition of kernel becomes: pub struct Kernel<R: Records, K1: Inner, K2: Inner> {
....
inner: KernelInner<K1,K2>,
....
} So I can define With this kind of implementation, all methods like I agree that the API should stay as simple as possible. Right now I'm deep into studying/taking my finals so I'll slowly get to the points you made about SVM in the next weeks. 👍🏻 |
What if in SVMs the kernel was specified as a parameter? I mean something like this: Svm::params().with_kernel(Kernel::params()).fit(&dataset) That way let kernel_matrix = self.kernel.transform(dataset);
...
svm.kernel = kernel_matrix;
svm.dataset = dataset;
... without the need to store |
this should be called
sounds like a good solution 👍 |
in Elements of Statistical Learning the training vectors are sometimes called |
To me, calling |
Good point! ESL uses predictors as a synonym for training data in the |
I agree, even if it may not be the best possible name I think that it is general enough to at least avoid conflicts 👍🏻 |
I've run into a problem when trying to implement the proposed changes to the regression metrics traits. If I use associated types it gets even worse, because I can only provide one implementation (and so only one option for the associated type) for each I think that the problem can't be overcome in rust since there is no concept of function overloading. This is what I've tried so far: Case 1: no associated typespub trait SingleTargetRegression<F,T> : AsTargets<F> {
fn max_error(&self, &T) -> ... {...}
}
pub trait MultiTargetRegression<F,T> : AsTargets<F> {
fn max_error(&self, &T) -> ... {...}
}
impl<F,type1> SingleTargetRegression<F,type1> for type {}
impl<F,type2> MultiTargetRegression<F,type2> for type {}
// error: multiple methods with the same name
type.max_error(..) This means that I can't have Case 2: Associated typesDefining type in trait templatepub trait SingleTargetRegression<F,T> : AsTargets<F> {
// error: associated type defaults are unstable
type Tar = T;
fn max_error(&self, &T) -> ... {...}
} Defining type in impl templatepub trait SingleTargetRegression<F> : AsTargets<F> {
type Tar : AsTargets<F>;
fn max_error(&self, &T) -> ... {...}
}
// error: unconstrained type parameter
impl<type2> SingleTargetRegression<F> for type {
type Tar = type2;
} implementing twicepub trait SingleTargetRegression<F> : AsTargets<F> {
type Tar : AsTargets<F>;
fn max_error(&self, &T) -> ... {...}
}
// error: conflicting implementation
impl SingleTargetRegression<F> for type {
type Tar = base_type;
}
impl SingleTargetRegression<F> for type {
type Tar = base_type2;
} |
mh I wasn't aware of the Array2/Array1 implementation, but yeah I think it is not possible with the current trait system (though I'm absolutely not an expert there) the only workaround I see is to implement
and hope that most people are using |
Yes it is implemented like that, just allowing the two arrays to have different data, the only issue is that (In this comment and in the ones before i am using |
a comment under the reddit post mentioned polars and I'm now thinking about adding a dataframe type: pub type Dataframe<T> = DatasetBase<polars::DataFrame, polars::ChunkedArray<T>> observe the missing type on |
Wow, I didn't know linfa was being posted on Reddit 🚀 |
the latter for example it makes no sense for ICA to support categorical features, but we could extend the scope of some algorithms with categorical types |
I guess that dataframes would also be useful for count vectorizers right? |
you can write a transformer |
I think that |
Most of the time the dataset is implemented for
ArrayBase<D, Ix2>
for records andArrayBase<D, Ix1>
for targets. We could simplify a lot ofimpl
s by:Dataset
toDatasetBase
Dataset
andDatasetView
Dataframe
Implement for new types
iter_fold
Improve ergonomics:
The text was updated successfully, but these errors were encountered: