NOTES

							-*- mode: outline -*-

* GEM/Stub

* Geometry

** Streamout

* Tesselation

** Support for output topologies and domains other than tris.

* Compute

Push constant support.

* Fixed function

** Depth buffers

HiZ.

** Stencil

** Blending

* Sampler

BC1 degenerate block support

Finish BC4 support.

* Thread pool

Group primitives per tile, maybe fixed size per-tile queue, then run all shaders for
one tile on one cpu-core.  Should reduce contention, allow one cpu to have entire
RT tile (page) in cache the whole time.

* JIT

** Cache programs if JITer gets too slow

** JIT in ps thread setup code

Maybe compile in entire tile level loop to avoid function pointer
dispatch per SIMD8 group.  Compile in RT write, blending, srgb
conversion.

** Track constants per sub reg (use case: msg headers)

** Detect constant offset ubo loads

Since we JIT at 3DPRIMITIVE time, we know which buffers are bound when
we JIT, so if we see a constant offset load through sampler or constant
cache (both read only cached) we can just look up the constant and compile
it into the shader as a regular load.

** JIT to GL/Vulkan compute shaders

Lets us offload to GPU for better performance.

** Use ebx for thread point to avoid push/pop of rdi

** Constant propagation to eliminate redundant imm loads

Each immediate src results in an imm kir instruction. Look for
pre-existing reg with the imm value.

* WM

** SIMD16 dispatch

** Perspective correct barycentric

** Multi-sample

** Lines, points

** Make tile iterator evaluate min w for 8 4x2 blocks at a time.

We don't need to compute exact barycentrics for each pixel, just
determine min bary and see if we need to dispatch for the block. The
shader will compute per-pixel barycentrics by adding a per-edge vector
that's the delta for each pixel between min bary and the pixels bary.

** Instrument rasterizer to get stats

How many empty 4x2 groups per-tile, for example.

* EU

** Use immediate AVX2 shift when EU operand is an immediate

** All the instructions

** Indirect addressing

** Control flow

** Write masks

Can ignore outside control flow, inside control flow we can use avx2
blend to implement write masking.

* Misc

** Hybrid HW passthrough mode.

Run on hw for a while, then switch to simulator when something
triggers. Switch back.

* KIR

** Detect same vb strides

Only compute each vid * stride once, so that buffers with the same
stride don't generate the same computation.

** Spill registers that hold regions or immediates

When choosing a register to spill, try to find one that holds an
immediate, and avoid spilling. Make unspill just reload the value. Can
be done for regions too, but needs analysis to determine that region
is unchanged. Maybe rewrite grf access to be SSA?

** Better region xfer copy prop

When we compile in code to copy constants into the shader regs we get
code like:

# --- code emit
# load attribute deltas
r0   = load_region g236.0<8,8,1>4           vmovdqa 0x1d80(%rdi),%ymm0
       store_region r0, g2.0<8,8,1>4        vmovdqa %ymm0,0x40(%rdi)
r1   = load_region g237.0<8,8,1>4           vmovdqa 0x1da0(%rdi),%ymm1
       store_region r1, g3.0<8,8,1>4        vmovdqa %ymm1,0x60(%rdi)
# eu ps
r2   = load_region g2.12<0,1,0>4            vpbroadcastd 0x4c(%rdi),%ymm2
       store_region r2, g124.0<8,8,1>4      vmovdqa %ymm2,0xf80(%rdi)
r3   = load_region g2.28<0,1,0>4            vpbroadcastd 0x5c(%rdi),%ymm3
       store_region r3, g125.0<8,8,1>4      vmovdqa %ymm3,0xfa0(%rdi)
r4   = load_region g3.12<0,1,0>4            vpbroadcastd 0x6c(%rdi),%ymm4
       store_region r4, g126.0<8,8,1>4      vmovdqa %ymm4,0xfc0(%rdi)
r5   = load_region g3.28<0,1,0>4            vpbroadcastd 0x7c(%rdi),%ymm5
       store_region r5, g127.0<8,8,1>4      vmovdqa %ymm5,0xfe0(%rdi)
       send src g124-g127                   lea    -0x144b(%rip),%rsi        # 0x0000000000000060
                                            push   %rdi
                                            callq  0xffffffffff741749
                                            pop    %rdi
       eot                                  retq   

This inlines and unrolls the copy, but sine we load uniforms from the
constants, the regions don't match and we don't propagate into the
later loads.  It would be even better if we could rewrite the loads
from, eg g2.12<0,1,0>4 to load directly from g236.12<0,1,0>4 and avoid
the intermediate stores.

** Pre-populate registers with common loads

For example, we have a bit of masking and shift to generate the frag
coord hearder in grf1, and then an awkward region load to load in a
different order. We can pre-load a register with the fragcord region
in a more efficient way (construct from x and y) and alias the
fragcoord region to it.  Also just for uniform loads from attribute
deltas.

** Create fragcoord header in PS

This allows us to optimize it out for shaders (most) that don't use
fragcoord.

** Maintain pointer to current pixel for rt0

Or maybe just an offset from the surface base. Updating the offset
incrementally is a lot less code than computing a y tile offset from
scratch for every pixel.

** Move instructions to shorten live ranges

** Combine AVX2 store or load with alu