mapcols! should modify the parent of a SubDataFrame #3421

eoteroe · 2024-01-22T18:31:09Z

using DataFrames 
df=DataFrame( a=["a","b","c","d"],x=[2:2:8...],x2=[3:3:12...])
mapcols!(z->z.-2, df[!,2:3])  

df  #  the two columns aren't transformed because df[!,2:3] is a new data frame 

# what about a view of df ?
# mapcols! doesn't support SubDataFrames.

The text was updated successfully, but these errors were encountered:

bkamins · 2024-01-22T20:00:25Z

I am not sure what you propose here. It seems all you describe above works as indented.

eoteroe · 2024-01-23T00:19:57Z

Thank you for question. It seems for me that when mapcols! were conceived it aimed to work with the whole dataframe to apply the modification in-place, but what about when you want to apply a function to a bunch of columns with a specific criteria. I tried to use "view" to translate the modification to the origen, but the functions doesn't accept SubDataFrames. I am aware that 'transform!' should work just fine, but mapcols! is a great function too and fast to retrieve mentally. The goal in the example is to have the df modified directly with mapcols with the bang(!) operator even if we can do df[!,2:3] .= mapcols!(z->z.-2, df[!,2:3])

bkamins · 2024-01-23T06:45:55Z

what about when you want to apply a function to a bunch of columns with a specific criteria

Then mapcols! cannot work on such a case. The reason is that mapcols! can change the number of rows in a data frame, which would render in the "subsetting" case the result to be a corrupted data frame. Sa you commented - use transform! in these more complex cases.

Example of mapcols! resizing a data frame:

julia> df = DataFrame(a=1:4, b=11:14)
4×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14

julia> mapcols!(x -> x[2:end-1], df)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     2     12
   2 │     3     13

eoteroe · 2024-01-24T14:56:51Z

Thanks for the explanation. In such situation, I'd expect that reduction of the data frames as part of the modification, but I guess that would be dangerous for the minilanguage and could break elsewhere. Thanks!

eoteroe · 2024-01-24T14:59:53Z

"Not feasible for integrity"

bkamins · 2024-01-24T20:37:42Z

I'd expect that reduction of the data frames as part of the modification

Yes - assume that sub data frame has columns "a" and "b" but you also have columns "c" and "d". If you shorten columns "a" and "b" it is not clear what to do with columns "c" and "d".

eoteroe · 2024-01-25T04:58:11Z

I 'd say that whatever dimensional reduction you do of a part of the data frame should be translated to the rest . I mean whatever index reduction "a" and "b" suffer, "c" and "d" would respect whatever index is remaining after the transformation.

bkamins · 2024-01-25T06:50:32Z

whatever index is remaining after the transformation.

But this is impossible to determine. Again - assume original data frame has 10 rows and 4 columns. Assume that I mapcols! columns 1 and 2 and resize them to 2 rows. Which of the 10 rows in columns 3 and 4 is unclear: it could be first 2, last 2, or whatever other pair.

eoteroe · 2024-01-26T01:28:50Z

Thanks for the follow-up. When you create a view you are pointing to a particular index in the parent. And if you do another slicing on the view with mapcols, the result should be the two reduction in the parent. Let's see

julia> df = DataFrame([1:6;; 7:12;;13:18;; 'a':'f';; 'f':'k';; ones(6)],string.('a':'f'))
6×6 DataFrame
 Row │ a    b    c    d    e    f   
     │ Any  Any  Any  Any  Any  Any 
─────┼──────────────────────────────
   1 │ 1    7    13   a    f    1.0
   2 │ 2    8    14   b    g    1.0
   3 │ 3    9    15   c    h    1.0
   4 │ 4    10   16   d    i    1.0
   5 │ 5    11   17   e    j    1.0
   6 │ 6    12   18   f    k    1.0

julia> vdf = @view df[3:end, 2:4]
4×3 SubDataFrame
 Row │ b    c    d   
     │ Any  Any  Any 
─────┼───────────────
   1 │ 9    15   c
   2 │ 10   16   d
   3 │ 11   17   e
   4 │ 12   18   f

#let's emulate the view with data frame 
julia> dft = df[3:end, 2:4]
4×3 DataFrame
 Row │ b    c    d   
     │ Any  Any  Any 
─────┼───────────────
   1 │ 9    15   c
   2 │ 10   16   d
   3 │ 11   17   e
   4 │ 12   18   f

## the modification of the view ( this would damage the view: outbounding) 
julia> mapcols( x->x[3:4],dft)
2×3 DataFrame
 Row │ b    c    d   
     │ Any  Any  Any 
─────┼───────────────
   1 │ 11   17   e
   2 │ 12   18   f


#since here, it should be programmatically (method)  but I am doing it manually to illustrate the  output. 
ndf=DataFrame(a=[],b=[],c=[],d=[],e=[],f=[])

##expected modification of the original DataFrame  if mapcols! accepts subdataframe 
julia> append!(ndf , [df[5:6,[:a]]  mapcols( x->x[3:4],dft) df[5:6,[:e,:f]]]) ## see the remaining  original indexing after slicing
2×6 DataFrame
 Row │ a    b    c    d    e    f   
     │ Any  Any  Any  Any  Any  Any 
─────┼──────────────────────────────
   1 │ 5    11   17   e    j    1.0
   2 │ 6    12   18   f    k    1.0

Now, for doing this I'd guess there must be a tracking of the modified indexes. But what troubles me now is when you use an aggregating function and the only thing I came out with was that mapcols applies a group by for the columns it is not changing. I know that all this slicing can damage the view, but the analyst should be aware of this. I hope these thoughts help in some way.

bkamins · 2024-01-26T07:08:51Z

There are two issues with what you propose:

why in the output you assume that the filtered-out rows in sub-data frame should be dropped (but this is a design issue, and could be discussed - I would assume they should not be dropped)
Why you kept rows 5 and 6 from the original data frame (in the columns not in the view) - how could Julia know which rows to keep? To maybe better show the issue. Assume that mapcols! increases the number of rows, eg. the function modifying the data were x -> ones(100), what should then happen with the filtered-out columns?

eoteroe · 2024-01-27T04:00:59Z

for the first point, you are right, I am assuming it , but that could be an option ( something like materialize = true), on the contrary, we filter only the relative position in the slicing in mapcols to the parent, let s see the output:

4×6 DataFrame
 Row │ a    b    c    d    e    f   
     │ Any  Any  Any  Any  Any  Any 
─────┼──────────────────────────────
   1 │ 1    7    13   a    f    1.0
   2 │ 2    8    14   b    g    1.0
   3 │ 5    11   17   e    j    1.0
   4 │ 6    12   18   f    k    1.0

Now for the expansion , with that function, you would be substituting the values and expanding, and when you are out of your boundaries , you should have missing for the rest of the columns that are not in the view:


 Row │ a        b    c    d    e        f       
     │ Any      Any  Any  Any  Any      Any     
─────┼──────────────────────────────────────────
   1 │ 1        7    13   a    f        1.0
   2 │ 2        8    14   b    g        1.0
   3 │ 3        1    1    1    h        1.0
   4 │ 4        1    1    1    i        1.0
   5 │ 5        1    1    1    j        1.0
   6 │ 6        1    1    1    k        1.0
   7 │ missing  1    1    1    missing  missing 
   8 │ missing  1    1    1    missing  missing

you can have a keyword for filling.

eoteroe changed the title ~~mapcols! should modify the parent of a SubDataFrames~~ mapcols! should modify the parent of a SubDataFrame Jan 22, 2024

bkamins added the question label Jan 22, 2024

eoteroe closed this as completed Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mapcols! should modify the parent of a SubDataFrame #3421

mapcols! should modify the parent of a SubDataFrame #3421

eoteroe commented Jan 22, 2024 •

edited

Loading

bkamins commented Jan 22, 2024

eoteroe commented Jan 23, 2024 •

edited

Loading

bkamins commented Jan 23, 2024

eoteroe commented Jan 24, 2024

eoteroe commented Jan 24, 2024

bkamins commented Jan 24, 2024

eoteroe commented Jan 25, 2024

bkamins commented Jan 25, 2024

eoteroe commented Jan 26, 2024

bkamins commented Jan 26, 2024

eoteroe commented Jan 27, 2024

mapcols! should modify the parent of a SubDataFrame #3421

mapcols! should modify the parent of a SubDataFrame #3421

Comments

eoteroe commented Jan 22, 2024 • edited Loading

bkamins commented Jan 22, 2024

eoteroe commented Jan 23, 2024 • edited Loading

bkamins commented Jan 23, 2024

eoteroe commented Jan 24, 2024

eoteroe commented Jan 24, 2024

bkamins commented Jan 24, 2024

eoteroe commented Jan 25, 2024

bkamins commented Jan 25, 2024

eoteroe commented Jan 26, 2024

bkamins commented Jan 26, 2024

eoteroe commented Jan 27, 2024

eoteroe commented Jan 22, 2024 •

edited

Loading

eoteroe commented Jan 23, 2024 •

edited

Loading