Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section headings and row-by-row construction example #1416

Merged
merged 10 commits into from
Jul 11, 2018

Conversation

oxinabox
Copy link
Contributor

@oxinabox oxinabox commented Jun 4, 2018

I think it is easier to navigate with section headings.

And the row-by-row construction is something I always forget how to do.
(and I think more useful than column by column)

@nalimilan
Copy link
Member

I'm not sure we should recommend creating data frames row by row, as it's really inefficient. At least we should explain how to create it by columns first.

Can you also fix the typography (code in backticks, spaces in "data frame", consistency in case, etc.)?

@oxinabox
Copy link
Contributor Author

oxinabox commented Jun 5, 2018

I'm not sure we should recommend creating data frames row by row, as it's really inefficient. At least we should explain how to create it by columns first.

Here is what I am thinking:
3 ways to construct a DataFrame.

  • All at once
  • Row By Row
  • Column By Column

Now let me detail (what I think) the use case for each is:

All at once

This is basically only for if you are loading data generated else-where.
The actual constructor (all at once) is just turning data in one form into a dataframe.
It is basically indended to be supplied its args by a loading packages like CSV.jl etc.

Row by Row

To me, row by row is the only way anyone ever would create a dataframe live.
One row represents the results of one simulation.

While appending a row is slow it is vanishingly small compared to the simulation time.
If a simulation takes 10 minutes to run, and appending a row takes n_rows^2 microseconds,
it just doesn't matter, because by the time I have enough rows for the time factor to reach a factor of the simulation time...

The possible exception to this being the only way,
is "Predeclared Row by Row", where one initially constructs the whole dataframe but with all the values missing,
then fills in the blanks by running simulations.
But since there is no example of that I'll leave it out of here.

Column by Column

This isn't actually used to construct a dataframe.
It is used to enrich a dataframe, based on the information already in it.
For example:

students[:zscore] = (students[:score] .- mean(students[:score])) ./ std(students[:score])
students[:honorslist] = students[:zscore]  .> 2

Maybe, a possible case where is is used to actually construct a dataframe might be if you are pulling columns out of a database, based on the existing df column (or Vector) of keys.
Though that is not too different from my example of standardizing scores.

@nalimilan
Copy link
Member

I still disagree. The presentation logically starts by presenting the main DataFrame constructor, which takes a series of columns. And it happens in real life to assemble a few vectors in a DataFrame. OTC, constructing a data frame row by row requires using push!, so it should come after, with a mention regarding poor performance (which can matter since the process generating rows does not always take 10 minutes).

@oxinabox
Copy link
Contributor Author

oxinabox commented Jun 5, 2018

Alright, I've said my piece and failed to convince you, so fair enough.
Changed.

@nalimilan
Copy link
Member

Thanks. Can you also fix the spacing and syntax?

@oxinabox
Copy link
Contributor Author

I think this is all good now?

@nalimilan
Copy link
Member

Sorry, there are still lots of typos and of inconsistencies in casing and in the way DataFrame is written.

@oxinabox
Copy link
Contributor Author

I'll give it another check over.
Writing without typos is something I am really bad at.
(There are actual reasons for that but not relevant here)

@oxinabox
Copy link
Contributor Author

bump

@oxinabox
Copy link
Contributor Author

oxinabox commented Jul 2, 2018

What do I need to do?

@nalimilan
Copy link
Member

There are still typos, weird uses of semicolons, inconsistent casing in headings and missing blank lines after headings. Plus lines should be under 92 characters.

@oxinabox
Copy link
Contributor Author

oxinabox commented Jul 2, 2018

There are still typos,

Hopefully I've got them all now

weird uses of semicolons,

I have not added any semicolons at all AFAICT. Maybe you mean colons? I do tend to over use those. (idk how I missed them) Fixed

inconsistent casing in headings

Ok, I've made all the heading start every words with upper-case.

and missing blank lines after headings.

Fixed.

Plus lines should be under 92 characters.

I was under the impression that that convention is not being followed for this file.
There are 24 lines on master that are over 93+ characters.
In this PR there are 27 lines.

I feel like line-breaking the whole file can be its own PR.

@oxinabox
Copy link
Contributor Author

oxinabox commented Jul 2, 2018

Thanks.
I think this is one of those "It is easier to fix than to explain how to fix" situations.

@nalimilan nalimilan merged commit e731982 into JuliaData:master Jul 11, 2018
pdeffebach pushed a commit to pdeffebach/DataFrames.jl that referenced this pull request Jul 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants