Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from gauge to tasty-bench #100

Merged
merged 1 commit into from
Apr 15, 2021
Merged

Switch from gauge to tasty-bench #100

merged 1 commit into from
Apr 15, 2021

Conversation

Bodigrim
Copy link
Contributor

gauge still cannot be compiled with GHC 9.0 because of basement, and further it will be broken once again by sized primitives in GHC 9.2. Switching to tasty-bench allows running benchmarks against GHC 9.0 and 9.2, which reveals pretty gruesome picture.

cabal bench -w ghc-8.10.4 --ghc-options '-fproc-alignment=64' --benchmark-options '--csv 8.10.4.csv --hide-successes' random:bench
  1. GHC 8.10.4 vs. GHC 9.0.1:
cabal bench -w ghc-9.0.1  --ghc-options '-fproc-alignment=64' --benchmark-options '--baseline 8.10.4.csv --csv 9.0.1.csv --hide-successes --fail-if-slower 50' random:bench
All
  pure
    uniformR
      full
        Word:                       FAIL (0.26s)
          488 μs ±  22 μs, 879% slower than baseline
        Int:                        FAIL (0.13s)
          485 μs ±  43 μs, 885% slower than baseline
        Char:                       FAIL (0.18s)
           25 ms ± 1.7 ms, 8733% slower than baseline
      excludeMax
        Char:                       FAIL (0.39s)
           25 ms ± 2.2 ms, 6210% slower than baseline
      includeHalf
        Char:                       FAIL (0.18s)
           26 ms ± 1.6 ms, 7748% slower than baseline
      floating
        St
          uniformFloat01M:          FAIL (0.16s)
            623 μs ±  45 μs, 1169% slower than baseline
          uniformFloatPositive01M:  FAIL (0.33s)
            642 μs ±  22 μs, 1206% slower than baseline
          uniformDouble01M:         FAIL (0.17s)
            626 μs ±  44 μs, 1157% slower than baseline
          uniformDoublePositive01M: FAIL (0.17s)
            631 μs ±  46 μs, 1177% slower than baseline

9 out of 40 tests failed (9.74s)
  1. GHC 8.10.4 vs GHC 9.2.0 alpha:
cabal bench -w ghc-9.2.0.20210331 --allow-newer='split:base,splitmix:base,tagged:template-haskell' --ghc-options '-fproc-alignment=64' --benchmark-options '--baseline 8.10.4.csv --csv 9.2.0.csv --hide-successes --fail-if-slower 50' random:bench
All
  pure
    uniformR
      full
        Word:                       FAIL (0.24s)
          488 μs ±  38 μs, 879% slower than baseline
        Int:                        FAIL (0.11s)
          485 μs ±  47 μs, 886% slower than baseline
        Char:                       FAIL (0.26s)
           37 ms ± 2.1 ms, 13077% slower than baseline
      excludeMax
        Char:                       FAIL (0.25s)
           37 ms ± 1.7 ms, 9097% slower than baseline
      includeHalf
        Char:                       FAIL (0.27s)
           38 ms ± 3.1 ms, 11484% slower than baseline
      floating
        IO
          uniformFloatPositive01M:  FAIL (0.19s)
             27 ms ± 1.5 ms, 55090% slower than baseline
          uniformDoublePositive01M: FAIL (0.18s)
             26 ms ± 1.3 ms, 52708% slower than baseline
        St
          uniformFloat01M:          FAIL (0.17s)
            645 μs ±  52 μs, 1216% slower than baseline
          uniformFloatPositive01M:  FAIL (0.14s)
             19 ms ± 1.7 ms, 38263% slower than baseline
          uniformDouble01M:         FAIL (0.17s)
            646 μs ±  58 μs, 1197% slower than baseline
          uniformDoublePositive01M: FAIL (0.27s)
             17 ms ± 882 μs, 35324% slower than baseline
        pure
          uniformFloatPositive01M:  FAIL (0.14s)
             19 ms ± 1.8 ms, 38854% slower than baseline
          uniformDoublePositive01M: FAIL (0.12s)
             17 ms ± 1.5 ms, 34514% slower than baseline

13 out of 40 tests failed (10.43s)

It seems that inlining has significantly changed in GHC 9.0 (e. g., {-# INLINE unbiasedWordMult32RM #-} fixes couple of regressions). I intend to relay this data to GHC team, once the branch is merged.

@lehins lehins self-requested a review April 15, 2021 00:09
Copy link
Contributor

@lehins lehins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good. Benchmark results however look very concerning.

@Bodigrim I'll merge it tomorrow, just in case you think of last minute changes.

@idontgetoutmuch
Copy link
Member

@Bodigrim great work finding this before it's too late (I hope it's not too late).

@Shimuuar
Copy link
Contributor

@Bodigrim great (and scary) find. It also means that optimizations aren't very robust

@lehins lehins merged commit ac3fbbb into haskell:master Apr 15, 2021
@Bodigrim
Copy link
Contributor Author

There is something weird going on. I'll be off for several days, so just dump my observations here.

If I run pure/uniformR/full/CUShort benchmark on my machine with GHC 8.10 (both with gauge and with tasty-bench), I see that 100000 random numbers are generated in around 30 microseconds. Which means 3 random numbers per nanosecond. Which is way too fast, right?

If I scrap everything else except

main :: IO ()
main = do
  let !sz = 100000
  defaultMain
    [ bgroup "pure"
      [ bgroup "uniformR"
        [ bgroup "full"
          [ pureUniformRFullBench (Proxy :: Proxy CUShort) sz
          ]
        ]
      ]
    ]

and look at generated Core, there is no random number generation at all. The main routine looks like

main_$s$wgo
  :: State# RealWorld -> Int# -> Int -> (# State# RealWorld, () #)
main_$s$wgo
  = \ (sc_s7ND :: State# RealWorld)
      (sc1_s7NC :: Int#)
      (sc2_s7NB :: Int) ->
      case <=# sc1_s7NC 0# of {
        __DEFAULT ->
          case seq#
                 (case sc2_s7NB of { I# ww1_s7FD ->
                  joinrec {
                    $wgo_s7Fz :: Int# -> ()
                    $wgo_s7Fz (ww2_s7Fx :: Int#)
                      = case <# ww2_s7Fx ww1_s7FD of {
                          __DEFAULT -> ();
                          1# -> jump $wgo_s7Fz (+# ww2_s7Fx 1#)
                        }; } in
                  jump $wgo_s7Fz 0#
                  })
                 sc_s7ND
          of
          { (# ipv_a7cy, ipv1_a7cz #) ->
          main_$s$wgo ipv_a7cy (-# sc1_s7NC 1#) sc2_s7NB
          };
        1# -> (# sc_s7ND, () #)
      }

which is just an empty loop.

@Bodigrim Bodigrim deleted the tasty-bench branch April 15, 2021 19:34
@lehins
Copy link
Contributor

lehins commented Apr 15, 2021

@Bodigrim Yep, really good catch. All those benchmarks turned out to be bogus. ghc was "smart enough" to get rid of "unneeded" computation and all those benchmarks were checking was the performance of the loop.

I'll have a fix for the suite later on today.

@lehins
Copy link
Contributor

lehins commented Apr 16, 2021

Fix for benchmarks and the major regression: #101

@Bodigrim Bodigrim mentioned this pull request Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants