perf

Build
Status

Low-level performance measurement for haskell using the rdtsc register on x86.

libraries

perf

Core functionality.

Hackage lts nightly

perf-analysis

Analysis using perf. Code for the benchmark runs can be found in perf-analysis/examples/examples.hs. To create this readme locally run:

stack build --test --exec "$(stack path --local-install-root)/bin/perf-examples" --exec "$(stack path --local-bin)/pandoc -f markdown -i other/readme_.md -t markdown -o readme.md --filter pandoc-include --mathjax"

benchmarks

Benchmarks are based on:

number of runs:         1.0e3
accumulate to:          1.0e3
function:               foldl' (+) 0

1 cycle = 0.38 𝛈s (Based on my 2.6GHz machine, by definition).

tick_

one tick_: 36 cycles
next 10: [18,16,16,16,16,16,16,16,16,16]
average over 1m: 17.89 cycles
99.999% perc: 11,913
99.9% perc: 50.13
99th perc:  34.58
40th perc:  16.26
[min, 10th, 20th, .. 90th, max]:
 1.2000e1 1.4979e1 1.5417e1 1.5838e1 1.6260e1 1.6681e1 1.7205e1 1.8049e1 1.8892e1 2.2513e1 3.5790e4

tick

sum to 1000
first measure: 886 cycles
second measure: 1040 cycles

ticks

sum to 1000 n = 1000 prime run: 8.860e2
run                       first     2nd     3rd     4th     5th  40th %
ticks                    1.87e3  1.36e3  1.41e3  1.38e3  1.32e3 1.40e3 cycles
ticks (lambda)           1.32e3  1.39e3  1.45e3  1.41e3  1.44e3 1.40e3 cycles
ticks (poly)             1.45e3  1.33e3  1.43e3  1.38e3  1.41e3 1.40e3 cycles
ticksIO                  1.83e3  1.49e3  1.41e3  1.33e3  1.42e3 1.34e3 cycles
ticksIO (lambda)         1.49e3  1.38e3  1.39e3  1.40e3  1.33e3 1.33e3 cycles
ticksIO (poly)           1.61e3  1.38e3  1.36e3  1.36e3  1.39e3 1.33e3 cycles

ticks cost

Looking for hidden computation costs:

n = 1.000e0 outside: 6.181e4 inside: 2.333e4 gap: 3.848e4
n = 1.000e1 outside: 4.014e5 inside: 3.721e4 gap: 3.642e5
n = 1.000e2 outside: 2.130e5 inside: 1.753e5 gap: 3.774e4
n = 1.000e3 outside: 1.436e6 inside: 1.397e6 gap: 3.951e4

tickns

Multiple runs summing to a series of numbers.

sum to's [1,10,100,1000]
ns (ticks n fMono) as:  2.169e1 3.820e1 1.686e2 1.325e3
(replicateM n . tick fMono) <$> as:  1.581e1 1.579e1 9.148e1 6.702e2

vector

sum to 1000
ticks list               2.39e4  1.62e4  1.57e4  1.62e4  1.62e4 1.44e4 cycles
ticks boxed              6.61e3  5.97e3  5.83e3  5.80e3  5.83e3 5.98e3 cycles
ticks storable           1.75e3  1.52e3  1.33e3  1.43e3  1.45e3 1.41e3 cycles
ticks unboxed            2.95e3  2.63e3  2.60e3  2.67e3  2.59e3 2.60e3 cycles

whnf

sum to 1000
tick                      7.32e2 cycles
tickWHNF                  9.60e2 cycles
ticks                    2.35e3  1.44e3  1.37e3  1.34e3  1.40e3 1.40e3 cycles
ticksWHNF                5.60e1  1.60e1  1.80e1  1.80e1  2.00e1 1.84e1 cycles
tickIO                    9.49e3 cycles
tickWHNFIO                1.36e2 cycles
ticksIO                  2.10e3  1.36e3  1.36e3  1.40e3  1.34e3 1.41e3 cycles
ticksWHNFIO              4.48e2  6.60e1  3.00e1  2.80e1  2.60e1 2.28e1 cycles

R&D, To Do

metal speed

The average cycles per (+) operation can get down to 0.7, and there are about 4 cache registers per cycle, so 2.8 low level instructions per (+). Is this close to the metal speed?

unboxed versus boxed

ghc user guide

strictness

All about strictness

sharing

comparative results

ticks non-memo 1.28e5
ticks inline memo
ticks noinline no-memo
ticks inlinable non-memo
ticksIO non-memo 6.48e4

no-full-laziness in Perf.cycle

ticks non-memo 6.51e4
ticks inline memo
ticksIO non-memo 1.28e5

no-cse

ticks memo 7.66e4
ticks inline memo
ticksIO nomemo 6.47e4

polymorphism

Can slow a computation down by 2x

lambda expressions

Can really slow things down

solo experiment recipe:

stack exec "ghc" -- -O2 -rtsopts examples/summing.lhs
./examples/summing +RTS -s -RTS --runs 10000 --sumTo 1000 --chart --chartName other/sum1e3.svg --truncAt 4

references

time

space

memoization

http://okmij.org/ftp/Haskell/#memo-off

cache cycle estimates

Cache Cycles ——————- —————- register 4 per cycle L1 Cache access 3-4 cycles L2 Cache access 11-12 cycles L3 unified access 30 - 40 DRAM hit 195 cycles L1 miss 40 cycles L2 miss >600 cycles

A performance checklist

  1. compile with rtsopts flag
stack ghc -- --make examples/examples.hs -rtsopts -fforce-recomp
  1. check GC examples/examples +RTS -s

  2. enabling profiling

  1. create an examples.prof on execution: time examples/examples +RTS -p

  2. space

examples/examples +RTS -p -hc
hp2ps -e8in -c examples/examples.hp
hp2ps -e8in -c examples/examples.hy # types
hp2ps -e8in -c examples/examples.hp # constructors
  1. check strictness pragmas

  2. space leaks

examples/examples +RTS -s - additional memory
examples/examples +RTS -xt -hy
  1. read core
    stack exec ghc-core -- --no-cast --no-asm --no-syntax examples/simplest.hs >> other/simplest.core

    alias ghci-core="stack ghci -- -ddump-simpl -dsuppress-idinfo -dsuppress-coercions -dsuppress-type-applications -dsuppress-uniques -dsuppress-module-prefixes"

simplest core dump (cleaned up)

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 22, types: 31, coercions: 9}

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule2 :: TrName
$trModule2 = TrNameS "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule1 :: TrName
$trModule1 = TrNameS "Main"#

-- RHS size: {terms: 3, types: 0, coercions: 0}
$trModule :: Module
$trModule = Module $trModule2 $trModule1

-- RHS size: {terms: 4, types: 7, coercions: 0}
main1
  :: State# RealWorld -> (# State# RealWorld, () #)
main1 =
  \ (s_a1o2 [OS=OneShot] :: State# RealWorld) ->
    (# s_a1o2, () #)

-- RHS size: {terms: 1, types: 0, coercions: 3}
main :: IO ()
main = main1 `cast` ...

-- RHS size: {terms: 2, types: 1, coercions: 3}
main2
  :: State# RealWorld -> (# State# RealWorld, () #)
main2 = runMainIO1 @ () (main1 `cast` ...)

-- RHS size: {terms: 1, types: 0, coercions: 3}
:main :: IO ()
:main = main2 `cast` ...

simplest core dump (full)

[1 of 1] Compiling Main             ( examples/simplest.hs, examples/simplest.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 22, types: 31, coercions: 9}

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule2 :: TrName
[GblId,

 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 30 20}]
$trModule2 = TrNameS "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule1 :: TrName
[GblId,

 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 30 20}]
$trModule1 = TrNameS "Main"#

-- RHS size: {terms: 3, types: 0, coercions: 0}
$trModule :: Module
[GblId,

 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 30}]
$trModule = Module $trModule2 $trModule1

-- RHS size: {terms: 4, types: 7, coercions: 0}
main1
  :: State# RealWorld -> (# State# RealWorld, () #)
[GblId,
 Arity=1,

 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
         Tmpl= \ (s_a1o2 [Occ=Once, OS=OneShot]
                    :: State# RealWorld) ->
                 (# s_a1o2, () #)}]
main1 =
  \ (s_a1o2 [OS=OneShot] :: State# RealWorld) ->
    (# s_a1o2, () #)

-- RHS size: {terms: 1, types: 0, coercions: 3}
main :: IO ()
[GblId,
 Arity=1,

 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=0,unsat_ok=True,boring_ok=True)
         Tmpl= main1 `cast` ...}]
main = main1 `cast` ...

-- RHS size: {terms: 2, types: 1, coercions: 3}
main2
  :: State# RealWorld -> (# State# RealWorld, () #)
[GblId,
 Arity=1,

 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 20 60}]
main2 = runMainIO1 @ () (main1 `cast` ...)

-- RHS size: {terms: 1, types: 0, coercions: 3}
:main :: IO ()
[GblId,
 Arity=1,

 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=0,unsat_ok=True,boring_ok=True)
         Tmpl= main2 `cast` ...}]
:main = main2 `cast` ...

Linking examples/simplest ...