[BREAKING] Add PrettyTables.jl backend for printing DataFrames #2429

ronisbr · 2020-09-13T19:33:57Z

Hi!

This is the first version of the PrettyTables.jl backend for DataFrames, which implements the new printing mechanism only for text.

The first thing needed to change the backend is to run this function:

setdisplay_prettytables()

You can pass any option supported by PrettyTables.jl so that the global configuration will be changed automatically. The only options that are not supported are crop, maximum_columns_width, row_number_column_title, nosubheader, and title because they are selected using the other native options in DataFrames' show. Thus,

setdisplay_prettytables(vlines = :all, show_row_number = false)

will always print DataFrames without the row number and with all vertical lines.

After that, every DataFrame in text/plain will be printed using PrettyTables. Furthermore, if show is used, then any configuration option of PrettyTables can be used.

julia> using DataFrames

julia> setdisplay_prettytables()

julia> df = DataFrame(A=Int64.(1:10), B = Vector{Any}(undef, 10));

julia> df.B[1:8] = [df, # DataFrame
                        df[1,:], # DataFrameRow
                        view(df,1:1, :), # SubDataFrame
                        eachrow(df), # DataFrameColumns
                        eachcol(df), # DataFrameRows
                        groupby(df, :A),missing,nothing] # GroupedDataFrame;

julia> df

julia> show(df, hlines = :all)

I think we can start with this code to begin the tuning process for this new backend.

The only major difference from how the traditional system works is how Markdown is handled. In case of PrettyTables.jl, we actually render Markdown:

julia> using Markdown

julia> df = DataFrame(a = md"""
       # Title

           test

       **bold** *italic*""", b = 1, c = 2)

julia> show(df, linebreaks = true)

Problems

I found three problem that I have no idea how to handle:

I had absolutely no idea how to make this commit depends on master of PrettyTables.jl. So, to test it, you need to manually add master of PrettyTables.jl using Pkg.
It takes a little bit to precompile all the functions. If I did not add a precompilation to setdisplay_prettytables(), then the first table would only be printed after 6s. I think this is too much and I do not know how we can improve this.
Using the current code, I was not able to support the option splitcols because I do not receive from pretty_table how many columns were printed.

EDIT: I did not add any documentation or tests, because the way I am doing things will likely change a lot.

ronisbr · 2020-09-13T19:53:21Z

I just found a minor bug in PrettyTables.jl in which we were printing a blank vertical line when a vertical line should not be drawn. I fixed in master and now the output is more compact:

Project.toml

src/abstractdataframe/prettytables.jl

src/abstractdataframe/show.jl

src/abstractdataframe/prettytables.jl

bkamins · 2020-09-13T20:29:00Z

then the first table would only be printed after 6s

This is a major issue. Can you check if disabling loading of CategoricalArrays.jl in DataFrames.jl helps? (if you do not know how to do it I will check as we anyway plan to remove CategoricalArrays.jl dependency)

bkamins · 2020-09-13T20:33:01Z

In general my objective is to have DataFrames.jl loading time under 1 second and first show also under 1 second hopefully.

ronisbr · 2020-09-13T20:36:14Z

This is a major issue. Can you check if disabling loading of CategoricalArrays.jl in DataFrames.jl helps? (if you do not know how to do it I will check as we anyway plan to remove CategoricalArrays.jl dependency)

I think this problem is related to loading and precompiling PrettyTables.jl, because I do not see the problem with the traditional backend. Take a look:

julia> @time setdisplay_prettytables()
  4.767420 seconds (14.88 M allocations: 767.076 MiB, 5.23% gc time)

This is the time necessary to load PrettyTables.jl and precompile the functions. I do not see how to improve this without adding precompile statements.

ronisbr · 2020-09-13T20:45:41Z

Just for the record: if I print a DataFrame using PrettyTables.jl itself, it takes 2s in my computer to print it the first time.

bkamins · 2020-09-13T21:08:42Z

Also as a general recommendation from my experience - better use a branch from a DataFrames.jl fork to create a PR rather than master (as this way you will be able later to easier synchronize the PR with upstream master if some PRs get merged there)

bkamins · 2020-09-13T21:31:23Z

I have checked. Removing CategoricalArrays.jl does not solve all problems. But it seems that having PrettyTables.jl as a dependency always (non-conditional as now) is a bit better. Could you check it also please (then we do not need invokelatest).

ronisbr · 2020-09-14T00:18:53Z

Also as a general recommendation from my experience - better use a branch from a DataFrames.jl fork to create a PR rather than master (as this way you will be able later to easier synchronize the PR with upstream master if some PRs get merged there)

Ok! Sorry, I will do this in the next PR.

I have checked. Removing CategoricalArrays.jl does not solve all problems. But it seems that having PrettyTables.jl as a dependency always (non-conditional as now) is a bit better. Could you check it also please (then we do not need invokelatest).

Yes! If PrettyTables.jl is loaded at the beginning and without invokelast, the time it took here in setdisplay_prettytables() decrease 50% (to 2s). I will update the PR.

ronisbr · 2020-09-14T01:00:33Z

I could decrease the time to first print in PrettyTables by 50% if I use @optlevel 1 and SnoopCompile.jl to precompile everything. The former is simple and I am really thinking to make this default in PrettyTables since we do not have any gain using those optimizations. The latter is more problematic. I need to make scripts for each version and OS.

ronisbr · 2020-09-14T02:30:11Z

I added a lot of annotations to private functions of PrettyTables.jl and could save more time when running setdisplay_prettytables(). It went from almost 4s in the original commit to 1.7s using all the modifications (without invokelast, with @optlevel 1, and with those new annotations).

EDIT: One important information to understand those timings. In my computer, this code:

julia> df = DataFrame(a = 1)

in a fresh environment takes 4.8s to print the first table, whereas this code:

julia> df = DataFrame(a = 1); DataFrames._pretty_table(stdout,df)

takes about 5.1s. Hence, PrettyTables.jl is adding a delay of 0.3s.

bkamins · 2020-09-14T07:11:05Z

Thank you for the investigations.

I think that:

PrettyTables.jl in general can signal Julia not to optimize code (as you have said - this is a package that does not do anything performance critical); we just need to make sure that the code is written in a way that it can be still run in Julia 1.0
The key decision is if we switch to PrettyTables.jl as the only printing backend for text/plain or we have this dual-backend (as it is proposed now - which is great for testing on its own right); @nalimilan, @oxinabox - do you have an opinion here; if we have this "general direction of the decision we can optimize against this direction (in particular if we switch to PrettyTables.jl fully then we can probably have a bit faster loading times)

bkamins · 2020-09-14T07:17:59Z

Regarding making the PrettyTables.jl a default backend we also should consider its dependencies:

Crayons.jl: this is safe to be used and is needed I think
Formatting.jl: this is safe to be used and is needed I think
Parameters.jl: I am not sure how much you need it (what do you think?), but probably it is OK to keep it as it is a relatively small package and not likely to change much in the future
Reexport.jl: this is OK to have I think
Tables.jl: this is clearly needed

nalimilan · 2020-09-14T08:45:34Z

Thanks! IMO if we add a PrettyTable backend, we should make it the default, and remove the old code. It doesn't sound too useful to have a backend that you need to enable manually. That would make the code simpler, and maybe simplify benchmarking the "time to first print".

Can you also measure the time required to do using DataFrames? Adding more dependencies can increase that.

src/abstractdataframe/prettytables.jl

bkamins · 2020-09-14T08:47:56Z

The only problem is that we would not support splitcols kwarg that we currently do, but maybe it is not that useful and we can drop it.

That would make the code simpler

In the long run (as now a lot of this code will still be used in HTML and LaTeX output till we switch this also)

nalimilan · 2020-09-14T08:52:18Z

I don't have a strong opinion on splitcols, but my position is that we should only add the PrettyTables backend when we are OK to make it the default. If that implies dropping existing arguments that we think aren't so useful, so be it.

oxinabox · 2020-09-14T11:08:07Z

I am in agreement with @nalimilan
Not worth adding it if we are going to still have to maintain the old code as well.

I am also not keep on introducing global state of which backend currently configured to used.
I think it should be specified each time it is used what you want to do.
Else it is spooky action at a distance.
I would not at all be surprised if someone some where calls sprint(show, df) and parses the results or otherwise intracts programatically with the result.
Not a wise thing to do, but I can image it being abused to do slight tweaks to make a report; or they have their own type with a dataframe field that they call show on to display something.
And having that output change depending on global state is just going to lead to confusion.
The output format should be specified when you want the output

And if doing that, then what is the point of integrating it? PrettyTables should just be made to work nice with DataFrames via Tables.jl.
And/or we should change to just have it as the only way to make output.

Also i would be happy to drop splitcols, its not something i ever use.
Or even was aware of until i looked it up now.
(I always set my terminal as superwide)

bkamins · 2020-09-14T12:20:44Z

introducing global state of which backend currently configured to used

This is what e.g. Plots.jl currently does, but I get the point.

So in general in my opinion the direction is we will switch to PrettyTables.jl I think when we are sure we can safely do so.

ronisbr · 2020-09-14T13:02:27Z

PrettyTables.jl in general can signal Julia not to optimize code (as you have said - this is a package that does not do anything performance critical); we just need to make sure that the code is written in a way that it can be still run in Julia 1.0

Yes, this was addressed. I constantly test the package against 1.0 and 1.5, everything seems fine!

Regarding making the PrettyTables.jl a default backend we also should consider its dependencies:

Parameters.jl: I am not sure how much you need it (what do you think?), but probably it is OK to keep it as it is a relatively small package and not likely to change much in the future

Since the beginning I developed PrettyTables.jl depending on some functionalities of Parameters.jl. For example, to define a text table format, you need 13 parameters. Hence, let's say you want the default configuration, but without vertical lines. Then, because of Parameters.jl, you can create a new format using:

julia> new_format = TextFormat(unicode, vlines = :all)

Then it will copy all the fields in unicode except for vlines. Is there an easy way to do this in Base?

Can you also measure the time required to do using DataFrames? Adding more dependencies can increase that.

The loading time of master here is 0.70s and the loading time with this PR is 0.84s. Is this an issue?

In the long run (as now a lot of this code will still be used in HTML and LaTeX output till we switch this also)

Well, if you accept some changes in how HTML is printed (classes names, etc.), then I think PrettyTables.jl is pretty much (pun intended) ready. The problem I mentioned if that the current printing system of PrettyTables.jl is not configurable enough to make it looks exactly like the output of DataFrames. Check those two outputs:

DataFrames.jl

<table class="data-frame">
<thead>
<tr>
<th></th>
<th>a</th>
<th>b</th>
</tr>

<tr>
<th></th>
<th>Int64</th>
<th>Int64</th>
</tr>
</thead>
<tbody>
<p>1 rows × 2 columns</p>
<tr>
<th>1</th>
<td>1</td>
<td>2</td>
</tr>
</tbody>
</table>

PrettyTables.jl

<table>
  <tr class = header>
    <th style = "text-align: right; ">a</th>
    <th style = "text-align: right; ">b</th>
  </tr>
  <tr class = "subheader headerLastRow">
    <th style = "text-align: right; ">Int64</th>
    <th style = "text-align: right; ">Int64</th>
  </tr>
  <tr>
    <td style = "text-align: right; ">1</td>
    <td style = "text-align: right; ">2</td>
  </tr>
</table>

(It will be very easy to use thead and tbody, however I have no idea what these do)

The same can be said to LaTeX.

DataFrames.jl

\begin{tabular}{r|cc}
	& a & b\\
	\hline
	& Int64 & Int64\\
	\hline
	1 & 1 & 2 \\
\end{tabular}

PrettyTables.jl

\begin{table}
  \begin{tabular}{rr}
    \hline
    \textbf{a} & \textbf{b} \\
    \texttt{Int64} & \texttt{Int64} \\\hline
    1 & 2 \\\hline
  \end{tabular}
\end{table}

The only problem here is that those two backends need tests. Currently, only the Text backend has a very complete test set.

So in general in my opinion the direction is we will switch to PrettyTables.jl I think when we are sure we can safely do so.

Good! So, what actions should I do now?

bkamins · 2020-09-14T13:12:51Z

Is there an easy way to do this in Base?

I think not - and it is the use case for Parameters.jl 😄. Simply it is not worth to use it in simple scenarios (but I guess your is complex)

The loading time of master here is 0.70s and the loading time with this PR is 0.84s. Is this an issue?

This is OK, but the question is what would be the timing if you loaded PrettyTables.jl immediately.

So, what actions should I do now?

We need to do a bit of testing I think and sleep with the decision to prune the legacy printing in text/plain (as it is really a big change).

ronisbr · 2020-09-14T13:22:08Z

This is OK, but the question is what would be the timing if you loaded PrettyTables.jl immediately.

Sorry! I was not clear enough. This is the timing when I load PrettyTables.jl immediately. If I load on-demand inside the function setdisplay_prettytables() then I see no measurable change.

I am also not keep on introducing global state of which backend currently configured to used.

OK, but what about a global configuration state? A good point of using PrettyTables.jl is that the user can do a lot of tweaking when printing the DataFrame. Like:

We need to do a bit of testing I think and sleep with the decision to prune the legacy printing in text/plain (as it is really a big change).

Good! So I will wait here and continue testing to see if I find something wrong with this initial version until you reach a decision.

bkamins · 2020-11-05T21:34:53Z

Great. Thank you! Once CI passes I will recheck everything and approve.

@nalimilan - now I need #2464 to get approval/comments and merge (there will be merge conflicts with this PR, but they should be easy so it does not matter which one si merged first).

After these two are merged we do feature freeze for 0.22 release. We only need to add two things then:

redo docstrings and manual to use the new printout format (I will do it to make sure all works well in the whole package)
decide if we wait for CategoricalArrays.jl release improving the latency problems (I leave it up to you to decide, we will drop the dependency in 1.0 release anyway)

bkamins · 2020-11-05T21:37:25Z

@ronisbr - tests fail on : 32-bit platforms due to Int32 and on Julia 1.0 (showing of Irrational). Both should be easy to fix.

ronisbr · 2020-11-05T21:38:46Z

@ronisbr - tests fail on : 32-bit platforms due to Int32 and on Julia 1.0 (showing of Irrational). Both should be easy to fix.

Yes! I just saw that. I will fix this in a minute.

ronisbr · 2020-11-06T01:03:04Z

I think the tests are passing now!

bkamins · 2020-11-06T07:56:33Z

Thank you. Let us get rolling with this PR.

While we wait for @nalimilan to do an approval of this PR (but I am OK if you add it later):

one more set of tests could be added for each type that supports printing using PrettyTables.jl can you please check if kwargs are correctly propagated - i.e. if you change some defaults how PrettyTables.jl should print things is this properly reflected in the output (such a test will make sure that we do not break something in the future).
A new section in the manual showing how to tweak the ouptut using PrettyTables.jl (so that users can learn what power they now have at their disposal)

Thank you!

bkamins · 2020-11-06T21:59:07Z

I have resolved merge conflicts. So we are waiting for @nalimilan to approve this PR and we can merge it.

nalimilan

Sorry I was going to approve and then I spotted a small glitch with the handling of the always-problematic -0.0. :-)

src/abstractdataframe/show.jl

nalimilan · 2020-11-07T11:20:27Z

src/abstractdataframe/show.jl

+    # These vectors contain the number of the row and the padding that must be
+    # applied so that the float number is aligned with on the decimal point
+    indices = Vector{Int}[]
+    padding = Vector{Int}[]


No hurry to change this now, but I wonder whether it wouldn't be faster to avoid storing this information. Instead, the padding can be computed very cheaply in _pretty_tables_float_formatter once you have obtained the string representation of the number using show: it's just the difference between the string length and the maximum size (since numbers are pure ASCII it's even the number of codeunits, which is super fast to compute).

(Actually I must say I don't completely get why this algorithm is slow. Yes, it requires going over the data twice, but usually printing is much slower than this step which only involves quite fast operations. Isn't the slowness due to the fact that _pretty_tables_float_formatter is quite inefficient as it checks the row and column index for each row?)

No hurry to change this now, but I wonder whether it wouldn't be faster to avoid storing this information. Instead, the padding can be computed very cheaply in _pretty_tables_float_formatter once you have obtained the string representation of the number using show: it's just the difference between the string length and the maximum size (since numbers are pure ASCII it's even the number of codeunits, which is super fast to compute).

(Actually I must say I don't completely get why this algorithm is slow. Yes, it requires going over the data twice, but usually printing is much slower than this step which only involves quite fast operations. Isn't the slowness due to the fact that _pretty_tables_float_formatter is quite inefficient as it checks the row and column index for each row?)

I am not entirely certain that I understood it correctly. Let's say you have just two rows: 100.0 and 0.001. The maximum size would be 5 in both. Hence, the padding in this case obtained by the algorithm you proposed would be 0. However, the padding in the first case must be 0 and in the second case must be 2. Am I missing something?

(Actually I must say I don't completely get why this algorithm is slow. Yes, it requires going over the data twice, but usually printing is much slower than this step which only involves quite fast operations. Isn't the slowness due to the fact that _pretty_tables_float_formatter is quite inefficient as it checks the row and column index for each row?)

Yes, it can be! Do you have any suggestions to improve this?

I am not entirely certain that I understood it correctly. Let's say you have just two rows: 100.0 and 0.001. The maximum size would be 5 in both. Hence, the padding in this case obtained by the algorithm you proposed would be 0. However, the padding in the first case must be 0 and in the second case must be 2. Am I missing something?

Woops, indeed that doesn't work. You'd need to recompute the order. Tough that should be quite fast so maybe it's not worse than allocating a full array.

Yes, it can be! Do you have any suggestions to improve this?

Without changing the current design based on passing a function to PrettyTables, it seems to me that the main way to make things faster would be to avoid having to do a search of the row and column index for each cell. It should be much faster to store Boolean vectors with one entry for each row/column and index into these with i and j. Though for rows that means storing a quite large vector; another approach would be to pass the ranges of rows as two separate arguments (checking for inclusion in a range is optimized to simple comparisons).

It could also be faster to call show(io, ...) directly on the output stream rather than using sprint to get a string. But I'm not sure whether it's possible with PrettyTables.

src/abstractdataframe/show.jl

bkamins · 2020-11-07T12:03:05Z

@nalimilan - thank you for the comments. It is better to wait 1 or 2 days, but be sure all is clean.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

ronisbr · 2020-11-07T15:07:45Z

Done! I think I modified the code considering all the suggestions and also added a test for -0.0.

bkamins · 2020-11-07T21:20:08Z

Bam! Let us have the ball rolling. Thank you for working on this.

nalimilan · 2020-11-07T22:22:42Z

Thanks @ronisbr!

I guess we'll have to update the manual to use the new printing? Ideally we would be able to turn everything into doctests so that it can be automated, but last time we tried it didn't work.

bkamins · 2020-11-07T22:31:19Z

I opened #2522 for it and will work on it. I will not turn on doctests yet, as it will probably be too problematic now. It can be done post 0.22 release.

ronisbr · 2020-11-07T22:31:30Z

Yes, I think we need to convert the manual examples to the new layout and add a section showing how it can be extended.

bkamins · 2020-11-08T07:26:38Z

add a section showing how it can be extended.

Yes - as I have commented earlier, when I finish #2522 and it is merged it would be great if you add two things:

a section in the manual on printing (short term goal)
more tests that check that kwargs from PrettyTables.jl are working (short term goal)
porting of LaTeX and HTML output to PrettyTables.jl (long term goal)

Thank you!

quinnj · 2020-11-10T00:04:40Z

Congratulations @ronisbr on getting this merged! It was a huge effort with a lengthy review and it's been super impressive how you've handled it all every step of the way!

bkamins added display priority labels Sep 13, 2020

bkamins added this to the 1.0 milestone Sep 13, 2020