Supplement ND SBP signatures for reshape op #9858

leaves-zwx · 2023-02-12T10:24:59Z

对于绝大多数 op 来说，我们只需要列举其可能支持的 1-D SBP signatures，在 2-D (可推及到 ND) 时，对其 1-D SBP signature list 做叉积即可得到 2-D SBP signature list。

但 reshape op 某些情况就属于例外，见如下的例子：

(8, 4) reshape to (2, 4, 4)
with 2x2 ranks, has the 1-D sbp signatrue list as below:
  S(0) -> S(0)
  S(1) -> S(2)
by cross product, get the 2-D sbp signatrue list as below:
  [S(0), S(1)] -> [S(0), S(2)]
  [S(1), S(0)] -> [S(2), S(0)]  (will bring huge comm cost, ignore it in later discuss)
but below 2-D sbp signatrue is supported too:
  [S(0), S(0)] -> [S(0), S(1)]

从上面的例子中可以发现一个规律：高维的 SBP signatures 不能完全由低维组合而来。

基于以上理由，为 op 提供一个新的重载函数 EnumerateNdSbpSignatures：其会在 1-D SBP signatures 被列举完后，并由 1-D 叉积产生了 2-D SBP signature list 后被调用。作为当 1-D 叉积不能产生全部的 2-D SBP signatures 的时候，提供一种手段来补充额外的 2-D SBP signatures。

为 reshape 实现了 EnumerateNdSbpSignatures，算法简单来说就是，找到那些被 reshape 的 dimension，从高到低连续按 rank num 切分，直到失败，或者能均匀切到每个 rank 上（从高到底是为了保证切分连续性）。

EnumerateNdSbpSignatures 与已有的 GetNdSbpSignatureList 重载区别是：EnumerateNdSbpSignatures 是在 1-D SBP signatures 叉积之后的额外补充。而 GetNdSbpSignatureList 是完全重载 2-D SBP signatures 的列举逻辑，不会包含 1-D SBP 的叉积生成，其主要作用是为了 source op，用户可以直接通过 attr 来设置输出的 sbp，而无需推导。

oneflow/user/ops/reshape_user_op_util.cpp

oneflow/core/framework/sbp_infer_util.cpp

oneflow/user/ops/reshape_user_op_util.cpp

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

github-actions · 2023-02-21T20:32:20Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 140.9ms (= 14088.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 144.5ms (= 14454.5ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.03 (= 144.5ms / 140.9ms)

OneFlow resnet50 time: 80.6ms (= 8055.0ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.3ms (= 8531.5ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.06 (= 85.3ms / 80.6ms)

OneFlow resnet50 time: 50.0ms (= 10009.0ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.4ms (= 11870.4ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.19 (= 59.4ms / 50.0ms)

OneFlow resnet50 time: 33.6ms (= 6729.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 43.0ms (= 8600.2ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.28 (= 43.0ms / 33.6ms)

OneFlow resnet50 time: 25.0ms (= 5006.1ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.9ms (= 7985.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.60 (= 39.9ms / 25.0ms)

OneFlow swin dataloader time: 0.239s (= 47.848s / 200, num_workers=1)
PyTorch swin dataloader time: 0.152s (= 30.479s / 200, num_workers=1)
Relative speed: 0.637 (= 0.152s / 0.239s)

OneFlow swin dataloader time: 0.068s (= 13.643s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.603s / 200, num_workers=4)
Relative speed: 0.631 (= 0.043s / 0.068s)

OneFlow swin dataloader time: 0.041s (= 8.251s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.424s / 200, num_workers=8)
Relative speed: 0.536 (= 0.022s / 0.041s)

❌ OneFlow resnet50 time: 152.7ms (= 15268.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 164.0ms (= 16403.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.07 (= 164.0ms / 152.7ms)

OneFlow resnet50 time: 92.1ms (= 9214.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.3ms (= 10231.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 102.3ms / 92.1ms)

OneFlow resnet50 time: 59.5ms (= 11892.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.8ms (= 15751.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 78.8ms / 59.5ms)

OneFlow resnet50 time: 42.1ms (= 8414.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.7ms (= 14331.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.70 (= 71.7ms / 42.1ms)

OneFlow resnet50 time: 36.0ms (= 7193.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.9ms (= 13578.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.89 (= 67.9ms / 36.0ms)

github-actions · 2023-02-21T20:43:34Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9858/

Yipeng1994 · 2023-02-22T08:32:47Z

oneflow/core/framework/sbp_infer_util.cpp

+  const size_t kMaxSplitAxis = 8;
+  const size_t kCarryDigit = kMaxSplitAxis + 2;
+  auto Mesure = [](const NdSbp& nd_sbp) -> size_t {
+    size_t value = 0;
+    for (int i = 0; i < nd_sbp.sbp_parallel_size(); ++i) {
+      size_t cur_dim_value = 0;
+      const auto& sbp = nd_sbp.sbp_parallel(i);
+      if (sbp.has_split_parallel()) {
+        CHECK_LT(sbp.split_parallel().axis(), kMaxSplitAxis);
+        cur_dim_value = sbp.split_parallel().axis();
+      } else if (sbp.has_broadcast_parallel()) {
+        cur_dim_value = kMaxSplitAxis;
+      } else if (sbp.has_partial_sum_parallel()) {
+        cur_dim_value = kMaxSplitAxis + 1;
+      } else {
+        UNIMPLEMENTED();
+      }
+      value += cur_dim_value * std::pow(kCarryDigit, (nd_sbp.sbp_parallel_size() - i - 1));
+    }
+    return value;
+  };


如果rank mesh始终保持不变的话其实是没问题的，就是有的时候可能sbp会被缩成1维
这样S0跟空的是没有区别的，举一个例子：
S1 的值是 1
(S1, S0) 的值也是 1
为了保证不同的sbp顺序绝对不同，建议 S0 -> 1, S1 -> 2, ..., Si -> i+1
此时 kCarryDigit = kMaxSplitAxis + 3;

还有就是 kMaxSplitAxis 在别的文件有具体的定义，可以复用那个，万一以后扩张维度了，只需要改一个就行了。

另外不要用power，

value = value * kCarryDigit + curr_dim_value;

就行。

最后就是建议用数据结构 map<size_t, NdSbp> 来排序
Mesure得到的key是一个size_t，本身就是有序的，用map阔以避免每次compare都重新计算一次key

github-actions · 2023-02-25T20:30:46Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 140.9ms (= 14087.0ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 143.1ms (= 14310.2ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.02 (= 143.1ms / 140.9ms)

OneFlow resnet50 time: 80.6ms (= 8058.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.5ms (= 8554.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.06 (= 85.5ms / 80.6ms)

OneFlow resnet50 time: 49.5ms (= 9898.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.2ms (= 11639.1ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.18 (= 58.2ms / 49.5ms)

OneFlow resnet50 time: 32.5ms (= 6509.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 41.0ms (= 8206.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.26 (= 41.0ms / 32.5ms)

OneFlow resnet50 time: 25.0ms (= 4997.5ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.2ms (= 7643.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.53 (= 38.2ms / 25.0ms)

OneFlow swin dataloader time: 0.243s (= 48.645s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 29.974s / 200, num_workers=1)
Relative speed: 0.616 (= 0.150s / 0.243s)

OneFlow swin dataloader time: 0.072s (= 14.377s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.161s / 200, num_workers=4)
Relative speed: 0.568 (= 0.041s / 0.072s)

OneFlow swin dataloader time: 0.040s (= 7.920s / 200, num_workers=8)
PyTorch swin dataloader time: 0.021s (= 4.251s / 200, num_workers=8)
Relative speed: 0.537 (= 0.021s / 0.040s)

❌ OneFlow resnet50 time: 153.0ms (= 15298.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.1ms (= 16313.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.07 (= 163.1ms / 153.0ms)

OneFlow resnet50 time: 91.8ms (= 9184.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 104.4ms (= 10442.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 104.4ms / 91.8ms)

OneFlow resnet50 time: 60.1ms (= 12024.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.8ms (= 15753.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 78.8ms / 60.1ms)

OneFlow resnet50 time: 41.8ms (= 8360.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.3ms (= 14054.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.68 (= 70.3ms / 41.8ms)

OneFlow resnet50 time: 37.6ms (= 7518.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.6ms (= 14116.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.88 (= 70.6ms / 37.6ms)

github-actions · 2023-02-25T20:36:18Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9858/

Yipeng1994 · 2023-02-27T09:59:45Z

oneflow/core/framework/sbp_infer_util.cpp

+    if (sbp.has_split_parallel()) {
+      CHECK_LT(sbp.split_parallel().axis(), kMaxSplitAxis);
+      // from 1 to 8
+      cur_dim_value = sbp.split_parallel().axis() + 1;
+    } else if (sbp.has_broadcast_parallel()) {
+      // 9
+      cur_dim_value = kMaxSplitAxis + 1;
+    } else if (sbp.has_partial_sum_parallel()) {
+      // 10
+      cur_dim_value = kMaxSplitAxis + 2;
+    } else {
+      UNIMPLEMENTED();
+    }


别的地方没什么问题了。

这里我想了一下，上次我们开会时是讨论到说用97作为不同blob的进位对吧，如果说NdSbp的数字不超过96，是不会有任何风险的。
那现在把B映射到9，P映射到10，而进位是11，这样（B, B）就会是9*11+9 = 108，是超过了97的。
而B是出现很频繁的SBP，也就是风险会出现得频繁一些。
但是如果把B映射到1，P映射到2，Si -> i+3，这样要超过96起码是 88+9，也就是 (S5, S6)，（S5, S7）或者 (S6, 任意SBP) 才有可能有风险。

实际中S5少见，更不用说 (S5, S6)了，甚至 (S5, S5) 都是没有问题的。要出问题，至少有一个S6或者 S7，也就是张量起码要有7维。

所以把B，P映射的数字前调能够很有效地避免大部分的风险。（当然即使有风险也不一定会出问题，素数能有效地规避掉一些，但是如果能够避免大部分的风险，还是避免的好）

oneflow/user/ops/reshape_user_op_util.h

oneflow/core/framework/sbp_infer_util_test.cpp

oneflow/user/ops/reshape_op.cpp

github-actions · 2023-02-28T18:27:21Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.0ms (= 14102.7ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 143.0ms (= 14301.9ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.01 (= 143.0ms / 141.0ms)

OneFlow resnet50 time: 80.7ms (= 8067.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.8ms (= 8379.6ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.04 (= 83.8ms / 80.7ms)

OneFlow resnet50 time: 50.1ms (= 10021.4ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 56.8ms (= 11355.4ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.13 (= 56.8ms / 50.1ms)

OneFlow resnet50 time: 33.0ms (= 6590.1ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 49.4ms (= 9875.6ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.50 (= 49.4ms / 33.0ms)

OneFlow resnet50 time: 26.1ms (= 5226.3ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.9ms (= 7981.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.53 (= 39.9ms / 26.1ms)

OneFlow swin dataloader time: 0.242s (= 48.483s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 29.910s / 200, num_workers=1)
Relative speed: 0.617 (= 0.150s / 0.242s)

OneFlow swin dataloader time: 0.068s (= 13.633s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.310s / 200, num_workers=4)
Relative speed: 0.610 (= 0.042s / 0.068s)

OneFlow swin dataloader time: 0.044s (= 8.818s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.435s / 200, num_workers=8)
Relative speed: 0.503 (= 0.022s / 0.044s)

❌ OneFlow resnet50 time: 152.6ms (= 15257.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.7ms (= 16365.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.07 (= 163.7ms / 152.6ms)

OneFlow resnet50 time: 92.2ms (= 9221.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.9ms (= 10291.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.12 (= 102.9ms / 92.2ms)

OneFlow resnet50 time: 59.7ms (= 11937.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.2ms (= 15833.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 79.2ms / 59.7ms)

OneFlow resnet50 time: 41.8ms (= 8351.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.2ms (= 14233.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.70 (= 71.2ms / 41.8ms)

OneFlow resnet50 time: 35.4ms (= 7084.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 66.5ms (= 13295.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.88 (= 66.5ms / 35.4ms)

github-actions · 2023-02-28T18:31:17Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9858/

leaves-zwx added 6 commits February 9, 2023 05:27

EnumerateNdSbpSignatures

663cf30

ReshapeOp::EnumerateNdSbpSignatures

dccb762

fix algo

7cd2dc1

fix

1eed7aa

cpp test

3dc723b

fix

cbf7261

leaves-zwx requested review from jackalcooper, hjchen2, BBuf, chengtbf and strint as code owners February 12, 2023 10:25

leaves-zwx requested review from Yipeng1994 and liujuncheng February 12, 2023 10:25

leaves-zwx self-assigned this Feb 12, 2023

leaves-zwx added enhancement op labels Feb 12, 2023

leaves-zwx added 4 commits February 12, 2023 10:30

rename

a440e8d

refine UserOp::GetNdSbpSignatureList

6998d52

ReshapeOp::EnumerateNdSbpSignatures

94870a6

py test

bff5183

leaves-zwx requested a review from daquexian as a code owner February 12, 2023 16:04

Yipeng1994 reviewed Feb 13, 2023

View reviewed changes

oneflow/user/ops/reshape_user_op_util.cpp Outdated Show resolved Hide resolved

leaves-zwx added 7 commits February 13, 2023 16:24

revert changes

2db180a

new ReshapeOp::EnumerateNdSbpSignatures

84b3acf

add test case

ad4a227

refine algorithm

7c22e1a

update test and fix bug

c51364a

rm comment

db658eb

rm redundant condition

278f577

Yipeng1994 reviewed Feb 17, 2023

View reviewed changes

oneflow/user/ops/reshape_user_op_util.cpp Outdated Show resolved Hide resolved

Yipeng1994 reviewed Feb 21, 2023

View reviewed changes

oneflow/core/framework/sbp_infer_util.cpp Outdated Show resolved Hide resolved

Yipeng1994 reviewed Feb 21, 2023

View reviewed changes

oneflow/user/ops/reshape_user_op_util.cpp Outdated Show resolved Hide resolved

Yipeng1994 reviewed Feb 21, 2023

View reviewed changes

oneflow/user/ops/reshape_user_op_util.cpp Outdated Show resolved Hide resolved

Yipeng1994 reviewed Feb 21, 2023

View reviewed changes

oneflow/user/ops/reshape_user_op_util.cpp Show resolved Hide resolved

leaves-zwx and others added 5 commits February 21, 2023 18:26

refine sort algorithm and test

351083e

rm calling FilterNdSbpIn2OutSignatures in EnumerateNdSbpSignatures

e6b579b

change example

ca9eb24

Update oneflow/user/ops/reshape_user_op_util.cpp

abe42db

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

Merge branch 'master' into enlarge_reshape_sbp

4836822

Yipeng1994 reviewed Feb 22, 2023

View reviewed changes

Yipeng1994 approved these changes Feb 22, 2023

View reviewed changes

leaves-zwx added 2 commits February 25, 2023 19:24

refine sort algorithm

7bacbcf

Merge branch 'master' into enlarge_reshape_sbp

f578fcf

Yipeng1994 reviewed Feb 27, 2023

View reviewed changes

jackalcooper reviewed Feb 28, 2023

View reviewed changes

oneflow/user/ops/reshape_user_op_util.h Show resolved Hide resolved

jackalcooper reviewed Feb 28, 2023

View reviewed changes

oneflow/core/framework/sbp_infer_util_test.cpp Show resolved Hide resolved

jackalcooper reviewed Feb 28, 2023

View reviewed changes

oneflow/user/ops/reshape_op.cpp Outdated Show resolved Hide resolved

jackalcooper approved these changes Feb 28, 2023

View reviewed changes

leaves-zwx added 3 commits February 28, 2023 17:13

change sort order

d3cf09d

use FilterNdSbpByLogicalShape

13dd1be

Merge branch 'master' into enlarge_reshape_sbp

70ae77e

leaves-zwx added the automerge label Feb 28, 2023

mergify bot merged commit 9f47744 into master Feb 28, 2023

mergify bot deleted the enlarge_reshape_sbp branch February 28, 2023 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supplement ND SBP signatures for reshape op #9858

Supplement ND SBP signatures for reshape op #9858

leaves-zwx commented Feb 12, 2023 •

edited

Loading

github-actions bot commented Feb 21, 2023

github-actions bot commented Feb 21, 2023

Yipeng1994 Feb 22, 2023 •

edited

Loading

github-actions bot commented Feb 25, 2023

github-actions bot commented Feb 25, 2023

Yipeng1994 Feb 27, 2023

github-actions bot commented Feb 28, 2023

github-actions bot commented Feb 28, 2023

Supplement ND SBP signatures for reshape op #9858

Supplement ND SBP signatures for reshape op #9858

Conversation

leaves-zwx commented Feb 12, 2023 • edited Loading

github-actions bot commented Feb 21, 2023

github-actions bot commented Feb 21, 2023

Yipeng1994 Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Feb 25, 2023

github-actions bot commented Feb 25, 2023

Yipeng1994 Feb 27, 2023

Choose a reason for hiding this comment

github-actions bot commented Feb 28, 2023

github-actions bot commented Feb 28, 2023

leaves-zwx commented Feb 12, 2023 •

edited

Loading

Yipeng1994 Feb 22, 2023 •

edited

Loading