Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize compute quotient polys #198

Open
eigmax opened this issue Dec 24, 2024 · 2 comments
Open

Optimize compute quotient polys #198

eigmax opened this issue Dec 24, 2024 · 2 comments

Comments

@eigmax
Copy link
Member

eigmax commented Dec 24, 2024

Description:

Computes the quotient polynomials (sum alpha^i C_i(x)) / Z_H(x) for alpha in alphas,
where the C_is are the Stark constraints. The current implementation is at compute_quotient_polys.

Solution: inspired by Plonky3.

   A quick rundown of the optimizations in this function:
        We are trying to compute sum_i alpha^i * (p(X) - y)/(X - z),
        for each z an opening point, y = p(z). Each p(X) is given as evaluations in bit-reversed order
        in the columns of the matrices. y is computed by barycentric interpolation.
        X and p(X) are in the base field; alpha, y and z are in the extension.
        The primary goal is to minimize extension multiplications.

        - Instead of computing all alpha^i, we just compute alpha^i for i up to the largest width
        of a matrix, then multiply by an "alpha offset" when accumulating.
              a^0 x0 + a^1 x1 + a^2 x2 + a^3 x3 + ...
            = a^0 ( a^0 x0 + a^1 x1 ) + a^2 ( a^0 x2 + a^1 x3 ) + ...
            (see `alpha_pows`, `alpha_pow_offset`, `num_reduced`)

        - For each unique point z, we precompute 1/(X-z) for the largest subgroup opened at this point.
        Since we compute it in bit-reversed order, smaller subgroups can simply truncate the vector.
            (see `inv_denoms`)

        - Then, for each matrix (with columns p_i) and opening point z, we want:
            for each row (corresponding to subgroup element X):
                reduced[X] += alpha_offset * sum_i [ alpha^i * inv_denom[X] * (p_i[X] - y[i]) ]

            We can factor out inv_denom, and expand what's left:
                reduced[X] += alpha_offset * inv_denom[X] * sum_i [ alpha^i * p_i[X] - alpha^i * y[i] ]

            And separate the sum:
                reduced[X] += alpha_offset * inv_denom[X] * [ sum_i [ alpha^i * p_i[X] ] - sum_i [ alpha^i * y[i] ] ]

            And now the last sum doesn't depend on X, so we can precompute that for the matrix, too.
            So the hot loop (that depends on both X and i) is just:
                sum_i [ alpha^i * p_i[X] ]

            with alpha^i an extension, p_i[X] a base

    
@eigmax
Copy link
Member Author

eigmax commented Dec 24, 2024

Now the time cost distribution on GPU is like:

[2024-12-20T09:48:12Z INFO  plonky2::util::timing] 26.5788s to prove
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | 4.7007s to generate all traces
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 0.3288s to simulate CPU
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 1.9611s to convert trace data to tables
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.1851s to generate arithmetic trace
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.1237s to generate Poseidon trace
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.0549s to generate trace rows
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.0688s to convert to PolynomialValues
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.0898s to generate Poseidon sponge trace
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.0516s to generate trace rows
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.0382s to convert to PolynomialValues
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.0675s to generate logic trace
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.0039s to generate trace rows
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.0636s to convert to PolynomialValues
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.5607s to generate memory trace
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.2716s to generate trace rows
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | 1.8853s to compute all trace commitments
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 0.0883s to compute trace commitment for Arithmetic
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 0.9873s to compute trace commitment for Cpu
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 0.1016s to compute trace commitment for Poseidon
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 0.0437s to compute trace commitment for PoseidonSponge
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 0.0532s to compute trace commitment for Logic
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 0.6112s to compute trace commitment for Memory
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | 1.3649s to compute CTL data
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | 18.6209s to compute all proofs given commitments
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 7.1779s to prove Arithmetic STARK
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.2036s to compute lookup helper columns
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.0574s to compute auxiliary polynomials commitment
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 6.5307s to compute quotient polys
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.0038s to split quotient polys
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.0343s to compute quotient commitment
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.2834s to compute openings proof
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.0840s to final_poly
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.1011s to FFT
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | 0.0982s to fri
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | | 0.0903s to fold codewords in the commitment phase
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | | | 0.0072s to find proof-of-work witness
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | 3.0891s to prove CPU STARK
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.0000s to compute lookup helper columns
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.0925s to compute auxiliary polynomials commitment
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 2.0115s to compute quotient polys
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.0027s to split quotient polys
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.0423s to compute quotient commitment
[2024-12-20T09:48:12Z INFO  plonky2::util::timing] | | | 0.6961s to compute openings

@VanhGer
Copy link
Contributor

VanhGer commented Jan 2, 2025

Inspired by Plonky3, we aim to optimize the way we compute $\sum \alpha^i C_i(x)$.
In our current implementation, we compute the combination of multiple constraint evaluations over a point in the domain as follows:

/// Add one constraint on all rows.  
pub fn constraint(&mut self, constraint: P) {  
    for (&alpha, acc) in self.alphas.iter().zip(&mut self.constraint_accs) {  
        *acc *= alpha;  
        *acc += constraint;  
    }  
}

We define $c_0, c_1, \dots c_n$ as the constraint evaluations that need to be combined into a single value. The implementation above combines them as follows:
$c_0 \cdot \alpha_1^{n-1} + c_1 \cdot \alpha_1^{n-2} + \dots + c_{n-1} \alpha_1^0$

The optimization idea is as follows:

Screenshot from 2025-01-02 14-05-29

Here, A stands for $\alpha^i$, and $a$ represents our constraint evaluation. However, this idea approach some problems:

  • It's hard to express $a$ as k-bits tuple effectively, we cannot directly shift it, especially when working in an extended field.
  • We are now using Goldilocks prime of 64bits. This optimization transform one 64bits multiplication into 8 64-bit additions, but 8 additions may take more time on modern CPU (64-bit multiplication is already fast and performing 8 additions while reading data from memory 8 times is likely slower.).
  • Value are stored in their Montgomery forms, which needs some transformation to implement the precomputation.

From that, it seems like this idea is complex and does not bring any benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants