-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add series describe function #414
Conversation
Can't win the message |
Why not just put all the statistics into a frame, with row keys "min", "max",..? This has the additional advantage that you can use the same procedure also for frames, not just for series. Edit To get rid of the warning, change the |
I want to keep |
Well at the moment most of the |
Actually I think that the result of |
Ok, sure if you like that better. But why not simply stick to strings as keys? |
I think the most appropriate design would be either all dynamic (following Pandas) or all static (using what F# offers). I would be slightly in favour of having a nice F# type as the result, because that works well with other F# code and it can be turned into a data frame easily using
That said, returning a data frame would also be okay - I can see that if one is working with data frames, then it makes sense to keep the structure. That way, I'd prefer to emulate the pandas patterns and just use string as the key. The idea of having a DU as a key is a neat trick and I can see how that might be a nice general pattern to use, but if it is a pattern that we are using in just one place in the entire library, then I think it might cause some confusion. |
Pandas supports two kinds of describe For numerical series, it returns count, mean, std, min, 25%, 50%, 75%, max. For categorical series, it returns count, unique, top, freq. Deedle series could have either numerical values or other values such as string or other objects. It would be good to separate describe for two cases. I would opt for using string as keys as users use describe mostly at exploring stage. It's just used interactively to see a quick output. The result structure is not that important. Keeping key as string would be more flexible while we add more keys in describe output. |
Another benefit of returning data frame instead of FSharp record is to display result in good format in Jupyter notebook. Data frame has a good formatter already. That’s key for Deedle to be adopted by more data scientists. |
Actually I wanted to have some blance between pandas and f# features. That's why the the result of String as a key also could be possible, but I think that if we could say more strictly what we have in series - is good. Also as @zyzhu mentioned, having Series as result is typically for pandas and in Deedle having result as Series would have a good formating. Need to check how @zyzhu about two kinds of describe. As I understand it could be one type, but different part filled. For example for the panda use This arguments made me to have exactly such changes what in this PR. Looking forward for interesting discussion :) |
I completely agree with @zyzhu about |
Sorry, maybe I don't understand correctly.
@tpetricek suggestions
Table of simple comparison.
I also kept in mind solutions with string and record when I coding the current approach. But I made a such comaprison and found that the current decision has more advantages. |
@alexpantyukhin Your table looks good to me. I think the only disadvantage of the current PR version is that it introduces a new pattern that we are not using anywhere else in the library. It's not necessarily a bad pattern, but if we only use it in one place, it will appear strange and people will have to figure out what it actually means (and how to work with the DU values). Both records and frame/series with string keys are more straightforward to understand. |
@tpetricek Oh, I see, then the result table would be something like that :)
Then I vote for having string keys, and have it in the docs. |
@alexpantyukhin That's roughly what I was thinking! Although there is the usual caveat that "clear for user" is very vaguely defined property, so it really is "tomas thinks it'll be clear for user". But I'm afraid that's all we can get without running a controlled experiment on our users or something like that :-). |
@alexpantyukhin I really like your table analysis. We shall apply similar logic when we add new functionality or refactor current methods. |
@alexpantyukhin I've just merged my pull which make stats functions inline to support inputs such as integer or decimals. Could you follow what I did and make describe function inline and output series/frame with string keys? |
@zyzhu yes, sure! I will. |
b0d7bc3
to
4ee9729
Compare
@zyzhu Done! rebased and fixed. |
Looks good. I'll see what I can do about min and max to make it not optional and then we can refactor this. |
thanks! what if to make it nan? |
/// | ||
/// [category:Series statistics] | ||
static member inline describe (series:Series<'K, 'V>) = | ||
match series with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting a warning here saying that this will make 'V
a float
:
warning FS0064: This construct causes code to be less generic than indicated by the type annotations. The type variable 'V has been constrained to be type 'float'.
You can fix that by adding box
, but then you also need to specify the type argument for the series key in the pattern matching:
match box series with
| :? Series<'K, float> as floatSeries ->
Also, I think the returned series should probably not be Series<string, float option>
but just Series<string, float>
because series has missing values built-in. To do that, you can replace Series.ofObservations
with Series.ofOptionalObservations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tpetricek I already refactored it in 6ba06fd. There's no longer warning. It will output Series<string, float>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Sorry, I missed that! Sounds good.
#398