-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PooledDataVecs should have a user-specifiable type parameter allowing 1, 2, 4, or 8-byte levels #58
Comments
Hm, my original intention was for PooledDataVec to be an implementation of AbstractDataVec, with different performance characteristics but (mostly or maybe entirely) identical interfaces to DataVec. Thus the UInt16 decision and the proposal for auto-conversion. Note that a PooledDataVec was NOT intended to be either necessary or sufficient for Factor-types, just the typical underlying representation. I'd prefer that we build appropriate meta-data and methods for either type that allows it to be used as a categorical/factor, with arbitrary element types, which would address Chris' UserID use case, I think. If you've got unique values in each row, the additional indirection doesn't help performance any. (cc @tshort , @StefanKarpinski for comments...) |
Aside from the typical number of levels, what is your view on the difference between PooledDataVecs and Factors/CategoricalVec? Do you see one as a subset of the other? For me, the differences are currently mostly just semantics. On the other hand, we may want Factor to contain some additional metadata over what PooledDataVecs currently contain (e.g. perhaps a default contrast?). To help this conversation along, would you mind elaborating on some of your motivating cases you have in mind for PooledDataVecs? |
For the record, I think that automatic conversion on overflow is a bad way to go. It adds checking overhead left and right and it will make all sorts of type inference things go completely haywire. It's much better, imo, to just let programmer pick how big a type they want, and change it easily. If and when you encounter the case that you need more levels, change the type in the user code. I think Harlan's got the right idea about abstracting categorical data as an abstract type and having pooled vectors be a specific implementation. It's a perfectly valid use case to have unique categorical values and want to treat them categorically, but not pool their storage (since there's only one of each value). I'm not sure, but I'm guessing most cases with 65K+ levels would be in that case, although maybe not. Adding all those additional pointers (8 bytes each!) is not going to help when dealing with big data. That was why it seemed likely to me that having pooled vectors use |
Now that I realize that we're planning a separate type for categorical data, I agree that automatic conversion is a bad direction. Maybe I should close this? Is there an existing issue for developing/discussing this new type? |
OK, yes, let's call #6 the issue for better supporting categorical values. I'll make a note there about the issues that arose here. |
And this issue should be to allow PDV's to have parametric sizes, but still defaulting to 2 bytes. |
This shouldn't too hard if we make PDA's a parametric type whose parameter is the type of the refs. I already pulled all of the hardcoded references to |
Closing here. At least partially resolved upstream. |
Support AltRep SXP and related fixes
The motivation: in some instances PooledDataVecs might have many levels (i.e. .more than 65000). The plan is to automatically convert the refs to Uint32 if we are about to overflow. One option is to make PooledDataVec a parametric type depending both on the type in the vector but also on the type of the reference, e.g. PooledDataVec{U, T}.
See #55 for a discussion.
The text was updated successfully, but these errors were encountered: