-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathNOTES
180 lines (115 loc) · 5.09 KB
/
NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
-*- mode: outline -*-
* GEM/Stub
* Geometry
** Streamout
* Tesselation
** Support for output topologies and domains other than tris.
* Compute
Push constant support.
* Fixed function
** Depth buffers
HiZ.
** Stencil
** Blending
* Sampler
BC1 degenerate block support
Finish BC4 support.
* Thread pool
Group primitives per tile, maybe fixed size per-tile queue, then run all shaders for
one tile on one cpu-core. Should reduce contention, allow one cpu to have entire
RT tile (page) in cache the whole time.
* JIT
** Cache programs if JITer gets too slow
** JIT in ps thread setup code
Maybe compile in entire tile level loop to avoid function pointer
dispatch per SIMD8 group. Compile in RT write, blending, srgb
conversion.
** Track constants per sub reg (use case: msg headers)
** Detect constant offset ubo loads
Since we JIT at 3DPRIMITIVE time, we know which buffers are bound when
we JIT, so if we see a constant offset load through sampler or constant
cache (both read only cached) we can just look up the constant and compile
it into the shader as a regular load.
** JIT to GL/Vulkan compute shaders
Lets us offload to GPU for better performance.
** Use ebx for thread point to avoid push/pop of rdi
** Constant propagation to eliminate redundant imm loads
Each immediate src results in an imm kir instruction. Look for
pre-existing reg with the imm value.
* WM
** SIMD16 dispatch
** Perspective correct barycentric
** Multi-sample
** Lines, points
** Make tile iterator evaluate min w for 8 4x2 blocks at a time.
We don't need to compute exact barycentrics for each pixel, just
determine min bary and see if we need to dispatch for the block. The
shader will compute per-pixel barycentrics by adding a per-edge vector
that's the delta for each pixel between min bary and the pixels bary.
** Instrument rasterizer to get stats
How many empty 4x2 groups per-tile, for example.
* EU
** Use immediate AVX2 shift when EU operand is an immediate
** All the instructions
** Indirect addressing
** Control flow
** Write masks
Can ignore outside control flow, inside control flow we can use avx2
blend to implement write masking.
* Misc
** Hybrid HW passthrough mode.
Run on hw for a while, then switch to simulator when something
triggers. Switch back.
* KIR
** Detect same vb strides
Only compute each vid * stride once, so that buffers with the same
stride don't generate the same computation.
** Spill registers that hold regions or immediates
When choosing a register to spill, try to find one that holds an
immediate, and avoid spilling. Make unspill just reload the value. Can
be done for regions too, but needs analysis to determine that region
is unchanged. Maybe rewrite grf access to be SSA?
** Better region xfer copy prop
When we compile in code to copy constants into the shader regs we get
code like:
# --- code emit
# load attribute deltas
r0 = load_region g236.0<8,8,1>4 vmovdqa 0x1d80(%rdi),%ymm0
store_region r0, g2.0<8,8,1>4 vmovdqa %ymm0,0x40(%rdi)
r1 = load_region g237.0<8,8,1>4 vmovdqa 0x1da0(%rdi),%ymm1
store_region r1, g3.0<8,8,1>4 vmovdqa %ymm1,0x60(%rdi)
# eu ps
r2 = load_region g2.12<0,1,0>4 vpbroadcastd 0x4c(%rdi),%ymm2
store_region r2, g124.0<8,8,1>4 vmovdqa %ymm2,0xf80(%rdi)
r3 = load_region g2.28<0,1,0>4 vpbroadcastd 0x5c(%rdi),%ymm3
store_region r3, g125.0<8,8,1>4 vmovdqa %ymm3,0xfa0(%rdi)
r4 = load_region g3.12<0,1,0>4 vpbroadcastd 0x6c(%rdi),%ymm4
store_region r4, g126.0<8,8,1>4 vmovdqa %ymm4,0xfc0(%rdi)
r5 = load_region g3.28<0,1,0>4 vpbroadcastd 0x7c(%rdi),%ymm5
store_region r5, g127.0<8,8,1>4 vmovdqa %ymm5,0xfe0(%rdi)
send src g124-g127 lea -0x144b(%rip),%rsi # 0x0000000000000060
push %rdi
callq 0xffffffffff741749
pop %rdi
eot retq
This inlines and unrolls the copy, but sine we load uniforms from the
constants, the regions don't match and we don't propagate into the
later loads. It would be even better if we could rewrite the loads
from, eg g2.12<0,1,0>4 to load directly from g236.12<0,1,0>4 and avoid
the intermediate stores.
** Pre-populate registers with common loads
For example, we have a bit of masking and shift to generate the frag
coord hearder in grf1, and then an awkward region load to load in a
different order. We can pre-load a register with the fragcord region
in a more efficient way (construct from x and y) and alias the
fragcoord region to it. Also just for uniform loads from attribute
deltas.
** Create fragcoord header in PS
This allows us to optimize it out for shaders (most) that don't use
fragcoord.
** Maintain pointer to current pixel for rt0
Or maybe just an offset from the surface base. Updating the offset
incrementally is a lot less code than computing a y tile offset from
scratch for every pixel.
** Move instructions to shorten live ranges
** Combine AVX2 store or load with alu