Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In docs, note subsets are copies (unless of columns)? #2224

Closed
nickeubank opened this issue May 4, 2020 · 8 comments · Fixed by #2226
Closed

In docs, note subsets are copies (unless of columns)? #2224

nickeubank opened this issue May 4, 2020 · 8 comments · Fixed by #2226
Labels
Milestone

Comments

@nickeubank
Copy link
Contributor

Loving updated docs, and the (clear!) discussion up front about views v. copies (my greatest pet peeve about pandas).

Are people open to noting in Indexing that subsets of columns with ! are non-copies, while subsets of rows always generate copies (I assume?). Can PR if supported.

@bkamins bkamins added the doc label May 4, 2020
@bkamins bkamins added this to the 1.x milestone May 4, 2020
@bkamins
Copy link
Member

bkamins commented May 4, 2020

Sure - go ahead with a PR please. And yes - getindex always creates a copy if row index is anything except ! or a single row selected using an integer (in which case it is either a single value if a single column is selected or a DataFrameRow if multiple rows are selected - in the last case note that DataFrameRow is a view). Feel free how much of this complexity should go to "getting started" and what can be left as "advanced stuff" 😄.

@nickeubank
Copy link
Contributor Author

Thanks @bkamins! I definitely don't want to over complicate -- for me though, knowing when you have a view and when you have a copy has profound implications for that most critical of things -- data integrity -- so I think it's worth pushing up front.

@nickeubank
Copy link
Contributor Author

But df[!, [firstcolumn, secondcolumn]] isn't a view, right? That's a copy?

julia> df = DataFrame("a" => [42, 47], "b" => ["y", "z"])
2×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 42    │ y      │
│ 2   │ 47    │ z      │


julia> df[!,["a", "b"]] === df
false

@bkamins
Copy link
Member

bkamins commented May 4, 2020

df[!, [firstcolumn, secondcolumn]] is a copy of a wrapper but not a copy of the contents.

As a data frame is a container we have THREE types of selection:

  • df[:, [:a, :b]] - copy of columns in a freshly allocated DataFrame
  • df[!, [:a, :b]] - alias (no copy, no view) of columns in a freshly allocated DataFrame (old and new data frames share columns :aand:band the columns pass===` test if compared)
  • df2 = @view df[:, [:a, :b]] or df2 = @view df[!, [:a, :b]] are the same and create a SubDataFrame (containing all rows and two columns of df), you have that for columns :a and :b that parent(df2.a) === df.a.

The details are described here. In general we have indexing and broadcasting design that is fully compatible with Base (+ an extension of !) while at the same time it allows you to have the ability to express any kind of indexing behaviour that can be needed (copy - safe, alias - fast and unsafe, view - fast and explicit that it is unsafe).

nickeubank added a commit to nickeubank/DataFrames.jl that referenced this issue May 5, 2020
@nickeubank
Copy link
Contributor Author

@bkamins PR with a "Note" on this topic. Thoughts?

nickeubank added a commit to nickeubank/DataFrames.jl that referenced this issue May 5, 2020
@bkamins bkamins linked a pull request May 5, 2020 that will close this issue
nickeubank added a commit to nickeubank/DataFrames.jl that referenced this issue May 5, 2020
nickeubank added a commit to nickeubank/DataFrames.jl that referenced this issue May 5, 2020
nickeubank added a commit to nickeubank/DataFrames.jl that referenced this issue May 5, 2020
@nickeubank
Copy link
Contributor Author

Thanks. Tweaked a few places you are definitely right, left language ambiguities for others to review.

@matthieugomez
Copy link
Contributor

matthieugomez commented May 5, 2020

Do we really need the “df[!, cols]” syntax in the end? It sounds like the behavior can be obtained either by df.”$x” (to get vector) or select (to get dataframes).

@bkamins
Copy link
Member

bkamins commented May 5, 2020

we need it for setindex! and broadcasting.

nickeubank added a commit to nickeubank/DataFrames.jl that referenced this issue May 7, 2020
nickeubank added a commit to nickeubank/DataFrames.jl that referenced this issue May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants