PooledDataVecs should have a user-specifiable type parameter allowing 1, 2, 4, or 8-byte levels #58

doobwa · 2012-09-13T21:02:29Z

The motivation: in some instances PooledDataVecs might have many levels (i.e. .more than 65000). The plan is to automatically convert the refs to Uint32 if we are about to overflow. One option is to make PooledDataVec a parametric type depending both on the type in the vector but also on the type of the reference, e.g. PooledDataVec{U, T}.

See #55 for a discussion.

HarlanH · 2012-09-16T23:00:54Z

Hm, my original intention was for PooledDataVec to be an implementation of AbstractDataVec, with different performance characteristics but (mostly or maybe entirely) identical interfaces to DataVec. Thus the UInt16 decision and the proposal for auto-conversion. Note that a PooledDataVec was NOT intended to be either necessary or sufficient for Factor-types, just the typical underlying representation. I'd prefer that we build appropriate meta-data and methods for either type that allows it to be used as a categorical/factor, with arbitrary element types, which would address Chris' UserID use case, I think. If you've got unique values in each row, the additional indirection doesn't help performance any. (cc @tshort , @StefanKarpinski for comments...)

doobwa · 2012-09-17T03:57:46Z

Aside from the typical number of levels, what is your view on the difference between PooledDataVecs and Factors/CategoricalVec? Do you see one as a subset of the other? For me, the differences are currently mostly just semantics. On the other hand, we may want Factor to contain some additional metadata over what PooledDataVecs currently contain (e.g. perhaps a default contrast?). To help this conversation along, would you mind elaborating on some of your motivating cases you have in mind for PooledDataVecs?

StefanKarpinski · 2012-09-17T16:40:59Z

For the record, I think that automatic conversion on overflow is a bad way to go. It adds checking overhead left and right and it will make all sorts of type inference things go completely haywire. It's much better, imo, to just let programmer pick how big a type they want, and change it easily. If and when you encounter the case that you need more levels, change the type in the user code.

I think Harlan's got the right idea about abstracting categorical data as an abstract type and having pooled vectors be a specific implementation. It's a perfectly valid use case to have unique categorical values and want to treat them categorically, but not pool their storage (since there's only one of each value). I'm not sure, but I'm guessing most cases with 65K+ levels would be in that case, although maybe not. Adding all those additional pointers (8 bytes each!) is not going to help when dealing with big data.

That was why it seemed likely to me that having pooled vectors use Uint16 would be sufficient. But you can always have pooled vector types with a type parameter (using only one byte when there are ≤ 256 levels is a nice win) and have separate unpooled categorical vectors so that you can pick the best implementation for each use case.

doobwa · 2012-09-17T17:21:52Z

Now that I realize that we're planning a separate type for categorical data, I agree that automatic conversion is a bad direction. Maybe I should close this? Is there an existing issue for developing/discussing this new type?

HarlanH · 2012-09-23T14:29:25Z

OK, yes, let's call #6 the issue for better supporting categorical values. I'll make a note there about the issues that arose here.

HarlanH · 2012-09-23T14:32:28Z

And this issue should be to allow PDV's to have parametric sizes, but still defaulting to 2 bytes.

johnmyleswhite · 2013-01-25T17:02:56Z

This shouldn't too hard if we make PDA's a parametric type whose parameter is the type of the refs. I already pulled all of the hardcoded references to Uint16 out of the code, which will hopefully make this change easier.

garborg · 2015-01-12T23:54:02Z

Closing here. At least partially resolved upstream.

Support AltRep SXP and related fixes

HarlanH closed this as completed Sep 23, 2012

HarlanH mentioned this issue Sep 23, 2012

optionally make AbstractDataVec be like an R factor #6

Closed

HarlanH reopened this Sep 23, 2012

HarlanH mentioned this issue Jan 25, 2013

PooledDataArray's with more than 2^16 levels #172

Closed

garborg closed this as completed Jan 12, 2015

nalimilan pushed a commit that referenced this issue May 26, 2022

Merge pull request #58 from alyst/ast/altrep_sxp

95173c6

Support AltRep SXP and related fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PooledDataVecs should have a user-specifiable type parameter allowing 1, 2, 4, or 8-byte levels #58

PooledDataVecs should have a user-specifiable type parameter allowing 1, 2, 4, or 8-byte levels #58

doobwa commented Sep 13, 2012

HarlanH commented Sep 16, 2012

doobwa commented Sep 17, 2012

StefanKarpinski commented Sep 17, 2012

doobwa commented Sep 17, 2012

HarlanH commented Sep 23, 2012

HarlanH commented Sep 23, 2012

johnmyleswhite commented Jan 25, 2013

garborg commented Jan 12, 2015

PooledDataVecs should have a user-specifiable type parameter allowing 1, 2, 4, or 8-byte levels #58

PooledDataVecs should have a user-specifiable type parameter allowing 1, 2, 4, or 8-byte levels #58

Comments

doobwa commented Sep 13, 2012

HarlanH commented Sep 16, 2012

doobwa commented Sep 17, 2012

StefanKarpinski commented Sep 17, 2012

doobwa commented Sep 17, 2012

HarlanH commented Sep 23, 2012

HarlanH commented Sep 23, 2012

johnmyleswhite commented Jan 25, 2013

garborg commented Jan 12, 2015