-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] Add PrettyTables.jl backend for printing DataFrames #2429
Conversation
This is a major issue. Can you check if disabling loading of CategoricalArrays.jl in DataFrames.jl helps? (if you do not know how to do it I will check as we anyway plan to remove CategoricalArrays.jl dependency) |
In general my objective is to have DataFrames.jl loading time under 1 second and first |
I think this problem is related to loading and precompiling PrettyTables.jl, because I do not see the problem with the traditional backend. Take a look: julia> @time setdisplay_prettytables()
4.767420 seconds (14.88 M allocations: 767.076 MiB, 5.23% gc time) This is the time necessary to load PrettyTables.jl and precompile the functions. I do not see how to improve this without adding precompile statements. |
Just for the record: if I print a DataFrame using PrettyTables.jl itself, it takes 2s in my computer to print it the first time. |
Also as a general recommendation from my experience - better use a branch from a DataFrames.jl fork to create a PR rather than master (as this way you will be able later to easier synchronize the PR with upstream master if some PRs get merged there) |
I have checked. Removing CategoricalArrays.jl does not solve all problems. But it seems that having PrettyTables.jl as a dependency always (non-conditional as now) is a bit better. Could you check it also please (then we do not need |
Ok! Sorry, I will do this in the next PR.
Yes! If PrettyTables.jl is loaded at the beginning and without |
I could decrease the time to first print in PrettyTables by 50% if I use |
I added a lot of annotations to private functions of PrettyTables.jl and could save more time when running EDIT: One important information to understand those timings. In my computer, this code: julia> df = DataFrame(a = 1) in a fresh environment takes 4.8s to print the first table, whereas this code: julia> df = DataFrame(a = 1); DataFrames._pretty_table(stdout,df) takes about 5.1s. Hence, PrettyTables.jl is adding a delay of 0.3s. |
Thank you for the investigations. I think that:
|
Regarding making the PrettyTables.jl a default backend we also should consider its dependencies:
|
Thanks! IMO if we add a PrettyTable backend, we should make it the default, and remove the old code. It doesn't sound too useful to have a backend that you need to enable manually. That would make the code simpler, and maybe simplify benchmarking the "time to first print". Can you also measure the time required to do |
The only problem is that we would not support
In the long run (as now a lot of this code will still be used in HTML and LaTeX output till we switch this also) |
I don't have a strong opinion on |
I am in agreement with @nalimilan I am also not keep on introducing global state of which backend currently configured to used. And if doing that, then what is the point of integrating it? PrettyTables should just be made to work nice with DataFrames via Tables.jl. Also i would be happy to drop |
This is what e.g. Plots.jl currently does, but I get the point. So in general in my opinion the direction is we will switch to PrettyTables.jl I think when we are sure we can safely do so. |
Yes, this was addressed. I constantly test the package against 1.0 and 1.5, everything seems fine!
Since the beginning I developed PrettyTables.jl depending on some functionalities of Parameters.jl. For example, to define a text table format, you need 13 parameters. Hence, let's say you want the default configuration, but without vertical lines. Then, because of Parameters.jl, you can create a new format using: julia> new_format = TextFormat(unicode, vlines = :all) Then it will copy all the fields in
The loading time of
Well, if you accept some changes in how HTML is printed (classes names, etc.), then I think PrettyTables.jl is pretty much (pun intended) ready. The problem I mentioned if that the current printing system of PrettyTables.jl is not configurable enough to make it looks exactly like the output of DataFrames. Check those two outputs: DataFrames.jl <table class="data-frame">
<thead>
<tr>
<th></th>
<th>a</th>
<th>b</th>
</tr>
<tr>
<th></th>
<th>Int64</th>
<th>Int64</th>
</tr>
</thead>
<tbody>
<p>1 rows × 2 columns</p>
<tr>
<th>1</th>
<td>1</td>
<td>2</td>
</tr>
</tbody>
</table> PrettyTables.jl <table>
<tr class = header>
<th style = "text-align: right; ">a</th>
<th style = "text-align: right; ">b</th>
</tr>
<tr class = "subheader headerLastRow">
<th style = "text-align: right; ">Int64</th>
<th style = "text-align: right; ">Int64</th>
</tr>
<tr>
<td style = "text-align: right; ">1</td>
<td style = "text-align: right; ">2</td>
</tr>
</table> (It will be very easy to use The same can be said to LaTeX. DataFrames.jl \begin{tabular}{r|cc}
& a & b\\
\hline
& Int64 & Int64\\
\hline
1 & 1 & 2 \\
\end{tabular} PrettyTables.jl \begin{table}
\begin{tabular}{rr}
\hline
\textbf{a} & \textbf{b} \\
\texttt{Int64} & \texttt{Int64} \\\hline
1 & 2 \\\hline
\end{tabular}
\end{table} The only problem here is that those two backends need tests. Currently, only the Text backend has a very complete test set.
Good! So, what actions should I do now? |
I think not - and it is the use case for Parameters.jl 😄. Simply it is not worth to use it in simple scenarios (but I guess your is complex)
This is OK, but the question is what would be the timing if you loaded PrettyTables.jl immediately.
We need to do a bit of testing I think and sleep with the decision to prune the legacy printing in text/plain (as it is really a big change). |
Sorry! I was not clear enough. This is the timing when I load PrettyTables.jl immediately. If I load on-demand inside the function
OK, but what about a global configuration state? A good point of using PrettyTables.jl is that the user can do a lot of tweaking when printing the DataFrame. Like:
Good! So I will wait here and continue testing to see if I find something wrong with this initial version until you reach a decision. |
Great. Thank you! Once CI passes I will recheck everything and approve. @nalimilan - now I need #2464 to get approval/comments and merge (there will be merge conflicts with this PR, but they should be easy so it does not matter which one si merged first). After these two are merged we do feature freeze for 0.22 release. We only need to add two things then:
|
@ronisbr - tests fail on : 32-bit platforms due to |
Yes! I just saw that. I will fix this in a minute. |
I think the tests are passing now! |
Thank you. Let us get rolling with this PR. While we wait for @nalimilan to do an approval of this PR (but I am OK if you add it later):
Thank you! |
I have resolved merge conflicts. So we are waiting for @nalimilan to approve this PR and we can merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was going to approve and then I spotted a small glitch with the handling of the always-problematic -0.0
. :-)
# These vectors contain the number of the row and the padding that must be | ||
# applied so that the float number is aligned with on the decimal point | ||
indices = Vector{Int}[] | ||
padding = Vector{Int}[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No hurry to change this now, but I wonder whether it wouldn't be faster to avoid storing this information. Instead, the padding can be computed very cheaply in _pretty_tables_float_formatter
once you have obtained the string representation of the number using show
: it's just the difference between the string length and the maximum size (since numbers are pure ASCII it's even the number of codeunits, which is super fast to compute).
(Actually I must say I don't completely get why this algorithm is slow. Yes, it requires going over the data twice, but usually printing is much slower than this step which only involves quite fast operations. Isn't the slowness due to the fact that _pretty_tables_float_formatter
is quite inefficient as it checks the row and column index for each row?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No hurry to change this now, but I wonder whether it wouldn't be faster to avoid storing this information. Instead, the padding can be computed very cheaply in
_pretty_tables_float_formatter
once you have obtained the string representation of the number usingshow
: it's just the difference between the string length and the maximum size (since numbers are pure ASCII it's even the number of codeunits, which is super fast to compute).(Actually I must say I don't completely get why this algorithm is slow. Yes, it requires going over the data twice, but usually printing is much slower than this step which only involves quite fast operations. Isn't the slowness due to the fact that
_pretty_tables_float_formatter
is quite inefficient as it checks the row and column index for each row?)
I am not entirely certain that I understood it correctly. Let's say you have just two rows: 100.0 and 0.001. The maximum size would be 5 in both. Hence, the padding in this case obtained by the algorithm you proposed would be 0. However, the padding in the first case must be 0 and in the second case must be 2. Am I missing something?
(Actually I must say I don't completely get why this algorithm is slow. Yes, it requires going over the data twice, but usually printing is much slower than this step which only involves quite fast operations. Isn't the slowness due to the fact that
_pretty_tables_float_formatter
is quite inefficient as it checks the row and column index for each row?)
Yes, it can be! Do you have any suggestions to improve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely certain that I understood it correctly. Let's say you have just two rows: 100.0 and 0.001. The maximum size would be 5 in both. Hence, the padding in this case obtained by the algorithm you proposed would be 0. However, the padding in the first case must be 0 and in the second case must be 2. Am I missing something?
Woops, indeed that doesn't work. You'd need to recompute the order. Tough that should be quite fast so maybe it's not worse than allocating a full array.
Yes, it can be! Do you have any suggestions to improve this?
Without changing the current design based on passing a function to PrettyTables, it seems to me that the main way to make things faster would be to avoid having to do a search of the row and column index for each cell. It should be much faster to store Boolean vectors with one entry for each row/column and index into these with i
and j
. Though for rows that means storing a quite large vector; another approach would be to pass the ranges of rows as two separate arguments (checking for inclusion in a range is optimized to simple comparisons).
It could also be faster to call show(io, ...)
directly on the output stream rather than using sprint
to get a string. But I'm not sure whether it's possible with PrettyTables.
@nalimilan - thank you for the comments. It is better to wait 1 or 2 days, but be sure all is clean. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Done! I think I modified the code considering all the suggestions and also added a test for -0.0. |
Bam! Let us have the ball rolling. Thank you for working on this. |
Thanks @ronisbr! I guess we'll have to update the manual to use the new printing? Ideally we would be able to turn everything into doctests so that it can be automated, but last time we tried it didn't work. |
I opened #2522 for it and will work on it. I will not turn on doctests yet, as it will probably be too problematic now. It can be done post 0.22 release. |
Yes, I think we need to convert the manual examples to the new layout and add a section showing how it can be extended. |
Yes - as I have commented earlier, when I finish #2522 and it is merged it would be great if you add two things:
Thank you! |
Congratulations @ronisbr on getting this merged! It was a huge effort with a lengthy review and it's been super impressive how you've handled it all every step of the way! |
Hi!
This is the first version of the PrettyTables.jl backend for DataFrames, which implements the new printing mechanism only for text.
The first thing needed to change the backend is to run this function:
setdisplay_prettytables()
You can pass any option supported by PrettyTables.jl so that the global configuration will be changed automatically. The only options that are not supported are
crop
,maximum_columns_width
,row_number_column_title
,nosubheader
, andtitle
because they are selected using the other native options in DataFrames'show
. Thus,will always print DataFrames without the row number and with all vertical lines.
After that, every DataFrame in
text/plain
will be printed using PrettyTables. Furthermore, ifshow
is used, then any configuration option of PrettyTables can be used.I think we can start with this code to begin the tuning process for this new backend.
The only major difference from how the traditional system works is how Markdown is handled. In case of PrettyTables.jl, we actually render Markdown:
Problems
I found three problem that I have no idea how to handle:
master
of PrettyTables.jl. So, to test it, you need to manually addmaster
of PrettyTables.jl usingPkg
.setdisplay_prettytables()
, then the first table would only be printed after 6s. I think this is too much and I do not know how we can improve this.splitcols
because I do not receive frompretty_table
how many columns were printed.EDIT: I did not add any documentation or tests, because the way I am doing things will likely change a lot.