Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pir+auto parallel] add reshard op for input when needed #63072

Merged
merged 4 commits into from
Mar 29, 2024

Conversation

zhiqiu
Copy link
Contributor

@zhiqiu zhiqiu commented Mar 28, 2024

PR Category

Auto Parallel

PR Types

New features

Description

[pir+auto parallel] add reshard op for input when needed

This PR adds a pass named apply_partition_pass, which will add reshard op for input when the value's dist_attr is not equal to the use_op's operand dist_attr

Pcard-76459

The program before,

{
    (%0) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"learning_rate_1",op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[]}},persistable:[true],place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[],stop_gradient:[true]} : () -> pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>
    (%1) = "builtin.parameter" () {is_distributed:[false],is_parameter:[true],need_clip:[true],op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[0,-1]}},parameter_name:"parameter_1",persistable:[true],stop_gradient:[false],trainable:[true]} : () -> pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>
    (%2) = "builtin.parameter" () {is_distributed:[false],is_parameter:[true],need_clip:[true],op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[-1,0]}},parameter_name:"parameter_0",persistable:[true],stop_gradient:[false],trainable:[true]} : () -> pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%3) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"input0",op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[-1,-1]}},place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[4,16],stop_gradient:[true]} : () -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%4) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"label0",op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[-1,-1]}},place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[4,8],stop_gradient:[true]} : () -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%5) = "pd_op.relu" (%3) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%6) = "pd_op.matmul" (%5, %2) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,0]},result(0):{dims_maping:[-1,0]}},stop_gradient:[false],transpose_x:false,transpose_y:false} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%7) = "pd_op.relu" (%6) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},result(0):{dims_maping:[-1,0]}},stop_gradient:[false]} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%8) = "pd_op.matmul" (%7, %1) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},operand(1):{dims_maping:[0,-1]},result(0):{dims_maping:[-1,-1],partial(0,SUM)}},stop_gradient:[false],transpose_x:false,transpose_y:false} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1], partial(0,SUM)>
    (%9) = "pd_op.relu" (%8) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1], partial(0,SUM)>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%10) = "pd_op.subtract" (%9, %4) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%11) = "pd_op.square" (%10) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%12) = "pd_op.mean" (%11) {axis:(pd_op.IntArray)[],keepdim:false,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},result(0):{dims_maping:[]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>
    () = "builtin.shadow_output" (%12) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[]}},output_name:"loss_0"} : (pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>) -> 
    (%13) = "pd_op.full" () {dtype:(pd_op.DataType)float32,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[-1]}},place:(pd_op.Place)Place(cpu),shape:(pd_op.IntArray)[1],stop_gradient:[true],value:(Float)1} : () -> pd_dist.tensor<1xf32, mesh_shape:[2],dims_mappings:[-1]>
    (%14) = "pd_op.full_like" (%12, %13) {dtype:(pd_op.DataType)float32,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[]},result(0):{dims_maping:[]}},place:(pd_op.Place)Place(undefined:0),stop_gradient:[false]} : (pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>, pd_dist.tensor<1xf32, mesh_shape:[2],dims_mappings:[-1]>) -> pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>
    (%15) = "pd_op.mean_grad" (%11, %14) {axis:(pd_op.IntArray)[],keepdim:false,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[]},result(0):{dims_maping:[-1,-1]}},reduce_all:false,stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%16) = "pd_op.square_grad" (%10, %15) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%17, %18) = "pd_op.subtract_grad" (%9, %4, %16) {axis:(Int32)-1,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},operand(2):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]},result(1):{dims_maping:[-1,-1]}},stop_gradient:[false,false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, <<NULL TYPE>>
    (%19) = "pd_op.relu_grad" (%9, %17) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%20, %21) = "pd_op.matmul_grad" (%7, %1, %19) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},operand(1):{dims_maping:[0,-1]},operand(2):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,0]},result(1):{dims_maping:[0,-1]}},stop_gradient:[false,false],transpose_x:false,transpose_y:false} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>
    (%22) = "pd_op.relu_grad" (%7, %20) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},operand(1):{dims_maping:[-1,0]},result(0):{dims_maping:[-1,0]}},stop_gradient:[false]} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%23, %24) = "pd_op.matmul_grad" (%5, %2, %22) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,0]},operand(2):{dims_maping:[-1,0]},result(0):{dims_maping:[-1,-1],partial(0,SUM)},result(1):{dims_maping:[-1,0]}},stop_gradient:[false,false],transpose_x:false,transpose_y:false} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1], partial(0,SUM)>, pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%25) = "pd_op.relu_grad" (%5, %23) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1], partial(0,SUM)>) -> <<NULL TYPE>>
    (%26, %27) = "pd_op.sgd_" (%1, %0, %21, <<NULL VALUE>>) {multi_precision:false,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[0,-1]},operand(1):{dims_maping:[]},operand(2):{dims_maping:[0,-1]},operand(3):{null},result(0):{dims_maping:[0,-1]},result(1):{null}},stop_gradient:[false,false]} : (pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>, pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>, pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>, <<NULL TYPE>>) -> pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>, <<NULL TYPE>>
    (%28, %29) = "pd_op.sgd_" (%2, %0, %24, <<NULL VALUE>>) {multi_precision:false,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},operand(1):{dims_maping:[]},operand(2):{dims_maping:[-1,0]},operand(3):{null},result(0):{dims_maping:[-1,0]},result(1):{null}},stop_gradient:[false,false]} : (pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>, pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, <<NULL TYPE>>) -> pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, <<NULL TYPE>>
}

The program after,

{
    (%0) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"learning_rate_1",op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[]}},persistable:[true],place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[],stop_gradient:[true]} : () -> pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>
    (%1) = "builtin.parameter" () {is_distributed:[false],is_parameter:[true],need_clip:[true],op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[0,-1]}},parameter_name:"parameter_1",persistable:[true],stop_gradient:[false],trainable:[true]} : () -> pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>
    (%2) = "builtin.parameter" () {is_distributed:[false],is_parameter:[true],need_clip:[true],op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[-1,0]}},parameter_name:"parameter_0",persistable:[true],stop_gradient:[false],trainable:[true]} : () -> pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%3) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"input0",op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[-1,-1]}},place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[4,16],stop_gradient:[true]} : () -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%4) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"label0",op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[-1,-1]}},place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[4,8],stop_gradient:[true]} : () -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%5) = "pd_op.relu" (%3) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%6) = "pd_op.matmul" (%5, %2) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,0]},result(0):{dims_maping:[-1,0]}},stop_gradient:[false],transpose_x:false,transpose_y:false} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%7) = "pd_op.relu" (%6) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},result(0):{dims_maping:[-1,0]}},stop_gradient:[false]} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%8) = "pd_op.matmul" (%7, %1) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},operand(1):{dims_maping:[0,-1]},result(0):{dims_maping:[-1,-1],partial(0,SUM)}},stop_gradient:[false],transpose_x:false,transpose_y:false} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1], partial(0,SUM)>
    (%9) = "dist_op.reshard" (%8) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1],partial(0,SUM)},result(0):{dims_maping:[-1,-1]}}} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1], partial(0,SUM)>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%10) = "pd_op.relu" (%9) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%11) = "pd_op.subtract" (%10, %4) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%12) = "pd_op.square" (%11) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%13) = "pd_op.mean" (%12) {axis:(pd_op.IntArray)[],keepdim:false,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},result(0):{dims_maping:[]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>
    () = "builtin.shadow_output" (%13) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[]}},output_name:"loss_0"} : (pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>) -> 
    (%14) = "pd_op.full" () {dtype:(pd_op.DataType)float32,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},result(0):{dims_maping:[-1]}},place:(pd_op.Place)Place(cpu),shape:(pd_op.IntArray)[1],stop_gradient:[true],value:(Float)1} : () -> pd_dist.tensor<1xf32, mesh_shape:[2],dims_mappings:[-1]>
    (%15) = "pd_op.full_like" (%13, %14) {dtype:(pd_op.DataType)float32,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[]},result(0):{dims_maping:[]}},place:(pd_op.Place)Place(undefined:0),stop_gradient:[false]} : (pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>, pd_dist.tensor<1xf32, mesh_shape:[2],dims_mappings:[-1]>) -> pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>
    (%16) = "pd_op.mean_grad" (%12, %15) {axis:(pd_op.IntArray)[],keepdim:false,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[]},result(0):{dims_maping:[-1,-1]}},reduce_all:false,stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%17) = "pd_op.square_grad" (%11, %16) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%18, %19) = "pd_op.subtract_grad" (%10, %4, %17) {axis:(Int32)-1,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},operand(2):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]},result(1):{dims_maping:[-1,-1]}},stop_gradient:[false,false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, <<NULL TYPE>>
    (%20) = "pd_op.relu_grad" (%10, %18) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%21, %22) = "pd_op.matmul_grad" (%7, %1, %20) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},operand(1):{dims_maping:[0,-1]},operand(2):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,0]},result(1):{dims_maping:[0,-1]}},stop_gradient:[false,false],transpose_x:false,transpose_y:false} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>, pd_dist.tensor<4x8xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>
    (%23) = "pd_op.relu_grad" (%7, %21) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},operand(1):{dims_maping:[-1,0]},result(0):{dims_maping:[-1,0]}},stop_gradient:[false]} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%24, %25) = "pd_op.matmul_grad" (%5, %2, %23) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,0]},operand(2):{dims_maping:[-1,0]},result(0):{dims_maping:[-1,-1],partial(0,SUM)},result(1):{dims_maping:[-1,0]}},stop_gradient:[false,false],transpose_x:false,transpose_y:false} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1], partial(0,SUM)>, pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>
    (%26) = "dist_op.reshard" (%24) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1],partial(0,SUM)},result(0):{dims_maping:[-1,-1]}}} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1], partial(0,SUM)>) -> pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>
    (%27) = "pd_op.relu_grad" (%5, %26) {op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,-1]},operand(1):{dims_maping:[-1,-1]},result(0):{dims_maping:[-1,-1]}},stop_gradient:[false]} : (pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>, pd_dist.tensor<4x16xf32, mesh_shape:[2],dims_mappings:[-1,-1]>) -> <<NULL TYPE>>
    (%28, %29) = "pd_op.sgd_" (%1, %0, %22, <<NULL VALUE>>) {multi_precision:false,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[0,-1]},operand(1):{dims_maping:[]},operand(2):{dims_maping:[0,-1]},operand(3):{null},result(0):{dims_maping:[0,-1]},result(1):{null}},stop_gradient:[false,false]} : (pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>, pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>, pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>, <<NULL TYPE>>) -> pd_dist.tensor<16x8xf32, mesh_shape:[2],dims_mappings:[0,-1]>, <<NULL TYPE>>
    (%30, %31) = "pd_op.sgd_" (%2, %0, %25, <<NULL VALUE>>) {multi_precision:false,op_dist_attr:{mesh:{shape:[2],process_ids:[0,1]},operand(0):{dims_maping:[-1,0]},operand(1):{dims_maping:[]},operand(2):{dims_maping:[-1,0]},operand(3):{null},result(0):{dims_maping:[-1,0]},result(1):{null}},stop_gradient:[false,false]} : (pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, pd_dist.tensor<f32, mesh_shape:[2],dims_mappings:[]>, pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, <<NULL TYPE>>) -> pd_dist.tensor<16x16xf32, mesh_shape:[2],dims_mappings:[-1,0]>, <<NULL TYPE>>
}

Copy link

paddle-bot bot commented Mar 28, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -66,6 +66,7 @@ def __init__(self, mesh):
)

def forward(self, x):
x.stop_gradient = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not need to make x require for gradient, the relu_grad in backward will trigger the partial-->replicated allreduce

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is needed, otherwise, relu_grad is not executed.

op.operands(), op.dist_attr().operand_dist_attrs()
):
if (
var.source().is_dist_dense_tensor_type()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scenario where src_dist_attr and dst_dist_attr have different mesh (e.g. Pipeline Parallelism), it would be better to insert two reshard ops.
one reshard op's mesh = src_dist_attr's mesh
the other's mesh = dst_dist_attr's mesh

therefore in the following (pipeline stage) pruning pass, different stage will keep the reshard op by the mesh it need and remove the other one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be refined in the next PR

Copy link
Contributor

@pkuzyc pkuzyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for spmd rule

Copy link
Contributor

@jeff41404 jeff41404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for API

Copy link
Contributor

@sunzhongkai588 sunzhongkai588 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhiqiu zhiqiu merged commit 70cc347 into PaddlePaddle:develop Mar 29, 2024
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants