perf

Build Status Hackage lts nightly

repo

If you want to make stuff very fast in haskell, you need to dig down below the criterion abstraction-level and start counting cycles using the rdtsc register on x86.

These examples are experiments in measuring cycles (or ticks), a development of intuition about what is going on at the very fast level.

The interface is subject to change as intuition develops and rabbit holes explored.

{-# OPTIONS_GHC -Wall #-}
{-# OPTIONS_GHC -fno-warn-type-defaults #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DataKinds #-}
import Data.Text (pack, intercalate)
import Data.Text.IO (writeFile)
import Formatting
import Protolude hiding ((%), intercalate)
import Data.List ((!!))
import Data.TDigest
import System.Random.MWC.Probability
import Options.Generic
import qualified Control.Foldl as L
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Storable as S
import Chart

The examples below mostly use Perf.Cycles. There is also a monad layer in Perf which has been used in other/summing.lhs.

import Perf

All the imports that are needed for charts

command line

data Opts = Opts
  { runs :: Maybe Int    -- <?> "number of runs"
  , sumTo :: [Double] -- <?> "sum to this number"
  , chartNum :: Maybe Int
  , truncAt :: Maybe Double
  }
  deriving (Generic, Show)

instance ParseRecord Opts

main

main :: IO ()
main = do
  o :: Opts <- getRecord "a random bit of text"
  let n = fromMaybe 10000 (runs o)
  let as = case sumTo o of
        [] -> [1,10,100,1000,10000]
        x -> x
  let trunc = fromMaybe 5 (truncAt o)
  let numChart = min (length as) $ fromMaybe 3 (chartNum o)

For reference, based on a 2.6G machine one cycle is = 0.38 𝛈s

tick_ taps the register twice to get a sense of the cost.

  onetick <- tick_
  ticks' <- replicateM 10 tick_
  avtick <- replicateM 1000000 tick_
  let average cs = L.fold ((/) <$> L.sum <*> L.genericLength) cs
  writeFile "other/onetick.md" $ code
    [ "one tick_: " <> pack (show onetick) <> " cycles"
    , "next 10: " <> pack (show ticks')
    , "average over 1m: " <>
      pack (show $ average (fromIntegral <$> avtick)) <> " cycles"
    ]
one tick_: 44 cycles
next 10: [28,24,28,24,28,24,28,24,28,24]
average over 1m: 20.985862 cycles

Before we actually measure something, lets take a closer look at tick_.

A pattern I see on my machine are shifts by multiples of 4, which correspond to roughly the L1 cache latency.

It pays to look at the whole distribution, and a compact way of doing that is to calculate quantiles:

  -- warmup 100
  xs <- replicateM 10000 tick_
  writeFile "other/tick_.md" $ code $
        (["[min, 10th, 20th, .. 90th, max]:"] :: [Text]) <>
        [mconcat (sformat (" " % prec 3) <$> deciles 5 (fromIntegral <$> xs))]
[min, 10th, 20th, .. 90th, max]:
 18.0 19.5 20.5 21.6 28.1 193

The important cycle count for most work is around the 30th to 50th percentile, where you get a clean measure, hopefully free of GC activity and cache miss-fires.

The quantile print of tick_ sometimes shows a 12 to 14 point jump around the 90th percential, and this is in the zone of an L2 access. Without a warmup, one or more larger values occur at the start, and often are in the zone of an L2 miss. Sometimes there's also a few large hiccoughs at around 2k cycles.

summing

Let's measure something. The simplest something I could think of was summing.

The helper function reportQuantiles utilises spins which takes n measurements of a function application over a range of values to apply.

  let f :: Double -> Double
      f x = foldl' (+) 0 [1..x]
  _ <- warmup 100
  (cs,_) <- spins n tick f as
  reportQuantiles cs as "other/spin.md"
 1.00e0:  406 424 425 427 456 2.88e5
 1.00e1:  151 157 157 160 165 4.25e4
 1.00e2:  124 125 128 146 173 3.83e3
 1.00e3:  121 124 141 168 187 640
 1.00e4:  121 132 142 168 186 278

Each row represents summing to a certain point: 1 up to 10000, and each column is a decile: min, 10th .. 90th, max. The values in each cell are the number of cycles divided by the number of runs and the number of sumTo.

The first row (summing to 1) represents the cost of setting up the sum, so we're looking at about (692 - 128) = 560 cycles every run.

Overall, the computation looks like it's running in O(n) where n is the number of runs * nuber of sumTo. Specifically, I would write down the order at about:

123 * o(n) + 560 * o(1)

The exception is the 70th to 90th zone, where the number of cycles creeps up to 194 for 1k sumTo at the 90th percentile.

Charting the 1k sumTo:

  let xs_0 = fromIntegral <$> take 10000 (cs!!numChart) -- summing to 1000
  let xs1 = (\x -> min x (trunc*deciles 5 xs_0 !! 5)) <$> xs_0
  fileSvg "other/spin1k.svg" (750,250) $ pad 1.1 $ histLine xs1

Switching to a solo experiment gives:

stack exec "ghc" -- -O2 -rtsopts examples/summing.lhs
./examples/summing +RTS -s -RTS --runs 10000 --sumTo 1000 --chart --chartName other/sum1e3.svg --truncAt 4

Executable complexity has changed the profile of the overall computation. Both measurements have the same value, however, up to around the 30th percentile.

generic vector

Using vector to sum:

  let asv :: [V.Vector Double] = (\x -> V.generate (floor x) fromIntegral) <$> as
  let sumv :: V.Vector Double -> Double
      sumv x = V.foldl (+) 0 x
  (csv,_) <- spins n tick sumv asv
  reportQuantiles csv as "other/vectorGen.md"
 1.00e0:  54.0 61.1 63.0 69.3 70.8 1.82e5
 1.00e1:  18.4 19.9 20.2 20.6 21.3 1.14e5
 1.00e2:  14.8 16.1 16.5 16.9 17.4 1.05e4
 1.00e3:  15.2 15.5 15.9 17.9 21.4 1.85e3
 1.00e4:  16.0 16.9 19.6 23.4 36.5 191

randomizing data

To avoid ghc (and fusion) trickiness, it's often a good idea to use random numbers instead of simple progressions. One day, a compiler will discover x(x+1)/2 and then we'll be in real trouble.

  mwc <- create
  !asr <- (fmap V.fromList <$>) <$> sequence $ (\x -> samples x uniform mwc) . floor <$> as
  void $ warmup 100
  (csr,_) <- spins n tick sumv asr
  reportQuantiles csr as "other/vectorr.md"
 1.00e0:  21.0 32.2 33.5 38.0 44.3 1.39e6
 1.00e1:  7.50 9.61 9.78 9.93 10.1 4.88e5
 1.00e2:  6.66 7.01 7.25 7.47 7.98 1.83e4
 1.00e3:  6.58 6.84 6.90 7.12 9.20 1.47e3
 1.00e4:  8.88 9.16 9.38 9.91 12.4 240

Charting the 1k sumTo:

  let csr0 = fromIntegral <$> take 10000 (csr!!numChart) -- summing to 1000
  let csr1 = (\x -> min x (trunc*deciles 5 csr0 !! 5)) <$> csr0
  fileSvg "other/vector1k.svg" (750,250) $ pad 1.1 $ histLine csr1

unboxed

  !asu <- (fmap U.fromList <$>) <$> sequence $ (\x -> samples x uniform mwc) . floor <$> as
  let sumu :: U.Vector Double -> Double
      sumu x = U.foldl (+) 0 x
  void $ warmup 100
  (csu,_) <- spins n tick sumu asu
  reportQuantiles csu as "other/vectoru.md"
 1.00e0:  16.0 28.6 29.6 30.4 35.8 1.02e6
 1.00e1:  3.40 5.28 5.51 7.53 8.16 1.78e5
 1.00e2:  2.46 4.13 4.16 4.20 4.21 1.62e4
 1.00e3:  2.19 2.55 2.56 3.05 3.06 1.36e3
 1.00e4:  2.17 2.17 2.52 3.01 3.32 162

Charting the 1k sumTo:

  let csu0 = fromIntegral <$> take 10000 (csu!!numChart) -- summing to 1000
  let csu1 = (\x -> min x (trunc*deciles 5 csu0 !! 5)) <$> csu0
  fileSvg "other/vectoru1k.svg" (750,250) $ pad 1.1 $ histLine csu1

storable

  !ass <- (fmap S.fromList <$>) <$> sequence $ (\x -> samples x uniform mwc) . floor <$> as
  let sums :: S.Vector Double -> Double
      sums x = S.foldl (+) 0 x
  void $ warmup 100
  (css,_) <- spins n tick sums ass
  reportQuantiles css as "other/vectors.md"
 1.00e0:  12.0 19.5 20.3 22.8 53.7 7.52e5
 1.00e1:  2.20 3.40 4.66 4.87 6.53 1.01e5
 1.00e2:  2.34 2.46 2.53 2.54 2.55 9.88e3
 1.00e3:  2.19 2.20 2.24 2.27 2.27 1.04e3
 1.00e4:  2.17 2.17 2.52 3.01 3.01 98.7

Charting the 1k sumTo:

  let css0 = fromIntegral <$> take 10000 (css!!numChart) -- summing to 1000
  let css1 = (\x -> min x (trunc*deciles 5 css0 !! 5)) <$> css0
  fileSvg "other/vectors1k.svg" (750,250) $ pad 1.1 $ histLine css1

tickf, ticka, tickfa

These functions attempt to discriminate between cycles used to compute f a (ie to apply the function f), and cycles used to force a. In experiments so far, this act of observation tends to change the number of cycles.

Separation of the f and a effects in f a

  void $ warmup 100
  (csr2,_) <- spins n tick sumv asr
  (csvf,_) <- spins n tickf sumv asr
  (csva,_) <- spins n ticka sumv asr
  (csvfa,_) <- spins n tickfa sumv asr
  reportQuantiles csr2 as "other/vectorr2.md"
  reportQuantiles csvf as "other/vectorf.md"
  reportQuantiles csva as "other/vectora.md"
  reportQuantiles (fmap fst <$> csvfa) as "other/vectorfaf.md"
  reportQuantiles (fmap snd <$> csvfa) as "other/vectorfaa.md"

a full tick

 1.00e0:  24.0 32.1 33.2 34.0 34.6 7.21e5
 1.00e1:  7.40 8.65 8.86 8.97 9.22 1.36e5
 1.00e2:  6.74 7.14 7.25 7.27 7.31 1.48e4
 1.00e3:  6.69 6.70 6.73 6.74 6.92 1.64e3
 1.00e4:  6.52 6.54 6.91 7.94 9.04 161

just function application

 1.00e0:  14.0 29.0 29.7 30.4 35.8 1.59e6
 1.00e1:  5.20 8.13 8.23 8.88 9.21 1.44e5
 1.00e2:  7.09 7.23 7.25 7.26 7.29 1.42e4
 1.00e3:  6.53 7.81 9.05 9.05 9.14 1.92e3
 1.00e4:  6.52 6.53 6.71 7.59 9.04 197

just forcing a:

 1.00e0:  24.0 25.7 26.9 28.0 29.1 256
 1.00e1:  1.80 2.01 2.15 2.64 3.13 22.6
 1.00e2:  0.180 0.196 0.206 0.216 0.283 2.32
 1.00e3:  0.0120 0.0198 0.0220 0.0229 0.0279 0.128
 1.00e4:  0.00120 0.00459 0.00524 0.00612 0.00725 0.107

the f splice of f a

 1.00e0:  16.0 29.3 30.3 33.1 34.5 1.45e6
 1.00e1:  6.80 7.98 8.13 8.23 8.49 1.44e5
 1.00e2:  6.74 7.92 8.05 9.48 9.51 6.09e5
 1.00e3:  6.53 6.71 7.60 8.76 9.06 2.58e3
 1.00e4:  6.50 6.52 6.91 9.01 9.03 277

the a slice of fa

 1.00e0:  12.0 18.7 21.0 21.8 22.5 136
 1.00e1:  1.20 1.48 2.11 2.18 2.25 12.6
 1.00e2:  0.120 0.210 0.234 0.278 0.283 3.92
 1.00e3:  0.0120 0.0207 0.0222 0.0277 0.0284 0.276
 1.00e4:  0.00120 0.00540 0.00625 0.00718 0.00806 0.102
  pure ()

helpers

showxs :: [Double] -> Double -> Text
showxs qs m =
          sformat (" " % Formatting.expt 2) m <> ": " <>
          mconcat (sformat (" " % prec 3) <$> ((/m) <$> qs))

deciles :: (Foldable f) => Int -> f Double -> [Double]
deciles n xs =
  (\x -> fromMaybe 0 $
      quantile x (tdigest xs :: TDigest 25)) <$>
      ((/fromIntegral n) . fromIntegral <$> [0..n]) :: [Double]

reportQuantiles ::  [[Cycles]] -> [Double] -> FilePath -> IO ()
reportQuantiles css as name =
  writeFile name $ code $ zipWith showxs (deciles 5 . fmap fromIntegral <$> css) as

code :: [Text] -> Text
code cs = "\n~~~\n" <> intercalate "\n" cs <> "\n~~~\n"

histLine :: [Double] -> Chart' a
histLine xs =
    lineChart (repeat (LineConfig 0.002 (Color 0 0 1 0.1))) widescreen
     (zipWith (\x y -> [V2 x 0,V2 x y]) [0..] xs) <>
     axes
     ( chartAspect .~ widescreen
     $ chartRange .~ Just
       (Rect $ V2
         (Range (0.0,fromIntegral $ length xs))
         (Range (0,L.fold (L.Fold max 0 identity) xs)))
     $ def)