If you want to make stuff very fast in haskell, you need to dig down below the criterion abstraction-level and start counting cycles using the rdtsc register on x86.

These examples are experiments in measuring cycles (or ticks), a development of intuition about what is going on at the very fast level.

The interface is subject to change as intuition develops and rabbit holes explored.

```
{-# OPTIONS_GHC -Wall #-}
{-# OPTIONS_GHC -fno-warn-type-defaults #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DataKinds #-}
import Data.Text (pack, intercalate)
import Data.Text.IO (writeFile)
import Formatting
import Protolude hiding ((%), intercalate)
import Data.List ((!!))
import Data.TDigest
import System.Random.MWC.Probability
import Options.Generic
import qualified Control.Foldl as L
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Storable as S
import Chart
```

The examples below mostly use `Perf.Cycles`

. There is also a monad layer in `Perf`

which has been used in `other/summing.lhs`

.

`import Perf`

All the imports that are needed for charts

```
data Opts = Opts
{ runs :: Maybe Int -- <?> "number of runs"
, sumTo :: [Double] -- <?> "sum to this number"
, chartNum :: Maybe Int
, truncAt :: Maybe Double
}
deriving (Generic, Show)
instance ParseRecord Opts
```

```
main :: IO ()
main = do
o :: Opts <- getRecord "a random bit of text"
let n = fromMaybe 10000 (runs o)
let as = case sumTo o of
[] -> [1,10,100,1000,10000]
x -> x
let trunc = fromMaybe 5 (truncAt o)
let numChart = min (length as) $ fromMaybe 3 (chartNum o)
```

For reference, based on a 2.6G machine one cycle is = 0.38 𝛈s

`tick_`

taps the register twice to get a sense of the cost.

```
onetick <- tick_
ticks' <- replicateM 10 tick_
avtick <- replicateM 1000000 tick_
let average cs = L.fold ((/) <$> L.sum <*> L.genericLength) cs
writeFile "other/onetick.md" $ code
[ "one tick_: " <> pack (show onetick) <> " cycles"
, "next 10: " <> pack (show ticks')
, "average over 1m: " <>
pack (show $ average (fromIntegral <$> avtick)) <> " cycles"
]
```

```
one tick_: 44 cycles
next 10: [28,24,28,24,28,24,28,24,28,24]
average over 1m: 20.985862 cycles
```

Before we actually measure something, lets take a closer look at tick_.

A pattern I see on my machine are shifts by multiples of 4, which correspond to roughly the L1 cache latency.

It pays to look at the whole distribution, and a compact way of doing that is to calculate quantiles:

```
-- warmup 100
xs <- replicateM 10000 tick_
writeFile "other/tick_.md" $ code $
(["[min, 10th, 20th, .. 90th, max]:"] :: [Text]) <>
[mconcat (sformat (" " % prec 3) <$> deciles 5 (fromIntegral <$> xs))]
```

```
[min, 10th, 20th, .. 90th, max]:
18.0 19.5 20.5 21.6 28.1 193
```

The important cycle count for most work is around the 30th to 50th percentile, where you get a clean measure, hopefully free of GC activity and cache miss-fires.

The quantile print of tick_ sometimes shows a 12 to 14 point jump around the 90th percential, and this is in the zone of an L2 access. Without a warmup, one or more larger values occur at the start, and often are in the zone of an L2 miss. Sometimes there's also a few large hiccoughs at around 2k cycles.

Let's measure something. The simplest something I could think of was summing.

The helper function `reportQuantiles`

utilises `spins`

which takes n measurements of a function application over a range of values to apply.

```
let f :: Double -> Double
f x = foldl' (+) 0 [1..x]
_ <- warmup 100
(cs,_) <- spins n tick f as
reportQuantiles cs as "other/spin.md"
```

```
1.00e0: 406 424 425 427 456 2.88e5
1.00e1: 151 157 157 160 165 4.25e4
1.00e2: 124 125 128 146 173 3.83e3
1.00e3: 121 124 141 168 187 640
1.00e4: 121 132 142 168 186 278
```

Each row represents summing to a certain point: 1 up to 10000, and each column is a decile: min, 10th .. 90th, max. The values in each cell are the number of cycles divided by the number of runs and the number of sumTo.

The first row (summing to 1) represents the cost of setting up the sum, so we're looking at about (692 - 128) = 560 cycles every run.

Overall, the computation looks like it's running in O(n) where n is the number of runs * nuber of sumTo. Specifically, I would write down the order at about:

`123 * o(n) + 560 * o(1)`

The exception is the 70th to 90th zone, where the number of cycles creeps up to 194 for 1k sumTo at the 90th percentile.

Charting the 1k sumTo:

```
let xs_0 = fromIntegral <$> take 10000 (cs!!numChart) -- summing to 1000
let xs1 = (\x -> min x (trunc*deciles 5 xs_0 !! 5)) <$> xs_0
fileSvg "other/spin1k.svg" (750,250) $ pad 1.1 $ histLine xs1
```

Switching to a solo experiment gives:

```
stack exec "ghc" -- -O2 -rtsopts examples/summing.lhs
./examples/summing +RTS -s -RTS --runs 10000 --sumTo 1000 --chart --chartName other/sum1e3.svg --truncAt 4
```

Executable complexity has changed the profile of the overall computation. Both measurements have the same value, however, up to around the 30th percentile.

Using vector to sum:

```
let asv :: [V.Vector Double] = (\x -> V.generate (floor x) fromIntegral) <$> as
let sumv :: V.Vector Double -> Double
sumv x = V.foldl (+) 0 x
(csv,_) <- spins n tick sumv asv
reportQuantiles csv as "other/vectorGen.md"
```

```
1.00e0: 54.0 61.1 63.0 69.3 70.8 1.82e5
1.00e1: 18.4 19.9 20.2 20.6 21.3 1.14e5
1.00e2: 14.8 16.1 16.5 16.9 17.4 1.05e4
1.00e3: 15.2 15.5 15.9 17.9 21.4 1.85e3
1.00e4: 16.0 16.9 19.6 23.4 36.5 191
```

To avoid ghc (and fusion) trickiness, it's often a good idea to use random numbers instead of simple progressions. One day, a compiler will discover `x(x+1)/2`

and then we'll be in real trouble.

```
mwc <- create
!asr <- (fmap V.fromList <$>) <$> sequence $ (\x -> samples x uniform mwc) . floor <$> as
void $ warmup 100
(csr,_) <- spins n tick sumv asr
reportQuantiles csr as "other/vectorr.md"
```

```
1.00e0: 21.0 32.2 33.5 38.0 44.3 1.39e6
1.00e1: 7.50 9.61 9.78 9.93 10.1 4.88e5
1.00e2: 6.66 7.01 7.25 7.47 7.98 1.83e4
1.00e3: 6.58 6.84 6.90 7.12 9.20 1.47e3
1.00e4: 8.88 9.16 9.38 9.91 12.4 240
```

Charting the 1k sumTo:

```
let csr0 = fromIntegral <$> take 10000 (csr!!numChart) -- summing to 1000
let csr1 = (\x -> min x (trunc*deciles 5 csr0 !! 5)) <$> csr0
fileSvg "other/vector1k.svg" (750,250) $ pad 1.1 $ histLine csr1
```

```
!asu <- (fmap U.fromList <$>) <$> sequence $ (\x -> samples x uniform mwc) . floor <$> as
let sumu :: U.Vector Double -> Double
sumu x = U.foldl (+) 0 x
void $ warmup 100
(csu,_) <- spins n tick sumu asu
reportQuantiles csu as "other/vectoru.md"
```

```
1.00e0: 16.0 28.6 29.6 30.4 35.8 1.02e6
1.00e1: 3.40 5.28 5.51 7.53 8.16 1.78e5
1.00e2: 2.46 4.13 4.16 4.20 4.21 1.62e4
1.00e3: 2.19 2.55 2.56 3.05 3.06 1.36e3
1.00e4: 2.17 2.17 2.52 3.01 3.32 162
```

Charting the 1k sumTo:

```
let csu0 = fromIntegral <$> take 10000 (csu!!numChart) -- summing to 1000
let csu1 = (\x -> min x (trunc*deciles 5 csu0 !! 5)) <$> csu0
fileSvg "other/vectoru1k.svg" (750,250) $ pad 1.1 $ histLine csu1
```

```
!ass <- (fmap S.fromList <$>) <$> sequence $ (\x -> samples x uniform mwc) . floor <$> as
let sums :: S.Vector Double -> Double
sums x = S.foldl (+) 0 x
void $ warmup 100
(css,_) <- spins n tick sums ass
reportQuantiles css as "other/vectors.md"
```

```
1.00e0: 12.0 19.5 20.3 22.8 53.7 7.52e5
1.00e1: 2.20 3.40 4.66 4.87 6.53 1.01e5
1.00e2: 2.34 2.46 2.53 2.54 2.55 9.88e3
1.00e3: 2.19 2.20 2.24 2.27 2.27 1.04e3
1.00e4: 2.17 2.17 2.52 3.01 3.01 98.7
```

Charting the 1k sumTo:

```
let css0 = fromIntegral <$> take 10000 (css!!numChart) -- summing to 1000
let css1 = (\x -> min x (trunc*deciles 5 css0 !! 5)) <$> css0
fileSvg "other/vectors1k.svg" (750,250) $ pad 1.1 $ histLine css1
```

These functions attempt to discriminate between cycles used to compute `f a`

(ie to apply the function f), and cycles used to force `a`

. In experiments so far, this act of observation tends to change the number of cycles.

Separation of the `f`

and `a`

effects in `f a`

```
void $ warmup 100
(csr2,_) <- spins n tick sumv asr
(csvf,_) <- spins n tickf sumv asr
(csva,_) <- spins n ticka sumv asr
(csvfa,_) <- spins n tickfa sumv asr
reportQuantiles csr2 as "other/vectorr2.md"
reportQuantiles csvf as "other/vectorf.md"
reportQuantiles csva as "other/vectora.md"
reportQuantiles (fmap fst <$> csvfa) as "other/vectorfaf.md"
reportQuantiles (fmap snd <$> csvfa) as "other/vectorfaa.md"
```

a full tick

```
1.00e0: 24.0 32.1 33.2 34.0 34.6 7.21e5
1.00e1: 7.40 8.65 8.86 8.97 9.22 1.36e5
1.00e2: 6.74 7.14 7.25 7.27 7.31 1.48e4
1.00e3: 6.69 6.70 6.73 6.74 6.92 1.64e3
1.00e4: 6.52 6.54 6.91 7.94 9.04 161
```

just function application

```
1.00e0: 14.0 29.0 29.7 30.4 35.8 1.59e6
1.00e1: 5.20 8.13 8.23 8.88 9.21 1.44e5
1.00e2: 7.09 7.23 7.25 7.26 7.29 1.42e4
1.00e3: 6.53 7.81 9.05 9.05 9.14 1.92e3
1.00e4: 6.52 6.53 6.71 7.59 9.04 197
```

just forcing `a`

:

```
1.00e0: 24.0 25.7 26.9 28.0 29.1 256
1.00e1: 1.80 2.01 2.15 2.64 3.13 22.6
1.00e2: 0.180 0.196 0.206 0.216 0.283 2.32
1.00e3: 0.0120 0.0198 0.0220 0.0229 0.0279 0.128
1.00e4: 0.00120 0.00459 0.00524 0.00612 0.00725 0.107
```

the f splice of f a

```
1.00e0: 16.0 29.3 30.3 33.1 34.5 1.45e6
1.00e1: 6.80 7.98 8.13 8.23 8.49 1.44e5
1.00e2: 6.74 7.92 8.05 9.48 9.51 6.09e5
1.00e3: 6.53 6.71 7.60 8.76 9.06 2.58e3
1.00e4: 6.50 6.52 6.91 9.01 9.03 277
```

the a slice of fa

```
1.00e0: 12.0 18.7 21.0 21.8 22.5 136
1.00e1: 1.20 1.48 2.11 2.18 2.25 12.6
1.00e2: 0.120 0.210 0.234 0.278 0.283 3.92
1.00e3: 0.0120 0.0207 0.0222 0.0277 0.0284 0.276
1.00e4: 0.00120 0.00540 0.00625 0.00718 0.00806 0.102
```

` pure ()`

```
showxs :: [Double] -> Double -> Text
showxs qs m =
sformat (" " % Formatting.expt 2) m <> ": " <>
mconcat (sformat (" " % prec 3) <$> ((/m) <$> qs))
deciles :: (Foldable f) => Int -> f Double -> [Double]
deciles n xs =
(\x -> fromMaybe 0 $
quantile x (tdigest xs :: TDigest 25)) <$>
((/fromIntegral n) . fromIntegral <$> [0..n]) :: [Double]
reportQuantiles :: [[Cycles]] -> [Double] -> FilePath -> IO ()
reportQuantiles css as name =
writeFile name $ code $ zipWith showxs (deciles 5 . fmap fromIntegral <$> css) as
code :: [Text] -> Text
code cs = "\n~~~\n" <> intercalate "\n" cs <> "\n~~~\n"
histLine :: [Double] -> Chart' a
histLine xs =
lineChart (repeat (LineConfig 0.002 (Color 0 0 1 0.1))) widescreen
(zipWith (\x y -> [V2 x 0,V2 x y]) [0..] xs) <>
axes
( chartAspect .~ widescreen
$ chartRange .~ Just
(Rect $ V2
(Range (0.0,fromIntegral $ length xs))
(Range (0,L.fold (L.Fold max 0 identity) xs)))
$ def)
```