736 reads

Are Mutable References Fast?

by Jonathan FischoffMay 27th, 2017

Too Long; Didn't Read

I was profiling a simple loop (years ago). It was similar to:

featured image - Are Mutable References Fast?

I was profiling a simple loop (years ago). It was similar to:

countForever ref = forever $ modifyIORef' ref (+1)

I counted four assembly instructions.

Addition takes ~300 picoseconds, but there is memory access so let’s say two nanoseconds

To profile, I wrote:

iorefLoop = doref <- newIORef (0 :: Int)replicateM_ 100000000 $ modifyIORef' ref (+1)

I expected this to take ~ 0.2 seconds, but …

$ bench ./bin/ioref-loop

benchmarking ./bin/ioref-looptime 332.3 ms (330.4 ms .. 333.8 ms)1.000 R² (1.000 R² .. 1.000 R²)mean 334.4 ms (333.2 ms .. 335.9 ms)std dev 2.822 ms (2.182 ms .. 3.942 ms)

Was this a good time? Was I expecting something unrealistic from Haskell? As a sanity check, I wrote an analogous program in C:

int main (int argc, const char** argv){int ref = 0;while (ref < 100000000){ref++;}}

I compiled and ran it like:

$ gcc -O2 ./src/CLoop.c -o ./bin/CLoop && bench ./bin/CLoop

benchmarking ./bin/CLooptime 3.999 ms (3.794 ms .. 4.235 ms)0.980 R² (0.972 R² .. 0.992 R²)mean 3.940 ms (3.838 ms .. 4.093 ms)std dev 593.3 μs (422.8 μs .. 816.0 μs)variance introduced by outliers: 90% (severely inflated)

This was much faster than I had expected.

Maybe bench isn’t accurate when programs are fast?

I went on IRC (different times because GHC makes faster code now):

jfischoff ~ Is there anyway to make modifyIORef' faster? jfischoff ~ I am surprised that in a second I was only able to           ↪ update this ref 100 million times:           ↪ timeout 1000000 $ forever $ modifyIORef' x (1+)jfischoff ~ where as c++ was able to do the same in 4 millisecondsglguy     ~ c++ was able to do 1 update every 0.04 nanoseconds?glguy     ~ an update rate of 25 gigahertz?dv-       ~ gcc probably just replaced it with a constantjfischoff ~ dv-: perhapsglguy     ~ That or C++ unlocks the fast mode of an Intel processor

Burn.

$ gcc -O2 CLoop.c -S

Here’s the assembly:

; CLoop.s; Notice, there are no jumps.; There is no looping._main: ## @main.cfi_startproc## BB#0:pushq %rbpLtmp0:.cfi_def_cfa_offset 16Ltmp1:.cfi_offset %rbp, -16movq %rsp, %rbpLtmp2:.cfi_def_cfa_register %rbpxorl %eax, %eaxpopq %rbpretq.cfi_endproc

dv- was right. The compiler got rid of the loop, I’m assuming, because I wasn’t using the result. I added a volatile keyword to prevent optimizations:

//CountForver.cint main (int argc, const char** argv){volatile int ref = 0;while (ref < 100000000){ref++;}}

Compiling:

$ gcc -O2 ./src/CLoopVolatile.c -o ./bin/CLoopVolatile$ bench ./bin/CLoopVolatile

benchmarking ./bin/CLoopVolatile

time 158.2 ms (156.7 ms .. 159.5 ms)1.000 R² (0.999 R² .. 1.000 R²)mean 159.1 ms (158.4 ms .. 160.2 ms)std dev 2.423 ms (1.755 ms .. 3.772 ms)

The assembly of the loop is:

loop:incl -4(%rbp) # Increment the value at the address stored# in rbp displaced by 4.movl -4(%rbp), %eax # move the value at the address stored in rbp# displaced by 4 to eax.cmpl $100000000, %eax # Compare to 100000000 to eax.jl loop # Jump to the start of the loop if the eax# was less than when compared above.

So, incrementing C’s mutable refs is over twice as fast as IORef? Maybe, though, handwritten assembly could be faster? Maybe the volatile keyword threw off optimizations too much?

; AsmLoop.asmEXTERN _exitGLOBAL start

SECTION .dataalign 8iterationCount DD 100000000result DD 0

SECTION .textstart:; Copy the iteration count to the ecx register; which is used by the loopnz instructionmov ecx, [iterationCount]

loopBlock:inc dword [result] ; Increment the value at the address of resultloopnz loopBlock ; Decrement the ecx counter and loop until ecx; is zero

exitBlock:push dword 0 ; Set the exit code to zeromov eax, 1 ; Place the system call number (exit) in the eax regsub esp, 4 ; I have to add some dummy space to the stack for; some reasonint 0x80 ; Exit

Compiling:

$ nasm -fmacho src/AsmLoop.asm$ ld -o ./bin/AsmLoop -macosx_version_min 10.7.0 ./src/AsmLoop.o$ bench ./bin/AsmLoop

benchmarking ./bin/AsmLooptime 185.2 ms (183.7 ms .. 186.9 ms)1.000 R² (0.999 R² .. 1.000 R²)mean 187.0 ms (185.9 ms .. 188.5 ms)std dev 3.132 ms (2.229 ms .. 4.269 ms)

Wow, my l33t skillz increased the time by around 15% (three years ago this version was faster … just saying).

So IORef is about two times slower than naive approaches in C and assembly. What gives? To the core!

$ stack exec ghc-core -- --no-syntax -- -O2 \-dsuppress-all src/IORefLoop.hs

Looking through the core I saw:

...case readMutVar# ipv1_aVW w2_a35mof _ { (# ipv2_aWt, ipv3_aWu #) ->case ipv3_aWu of _ { I# x_aUW ->case writeMutVar# ipv1_aVW (I# (+# x_aUW 1#)) ipv2_aWt...

I find GHC core almost unreadable.

One trick is to try to ignore most of the case statements. The first and third case statements are not for scrutinizing alternatives, but are to ensure proper sequencing of IO actions. Also, renaming the generated variables helps.

Here is a cleaned up version of what was above:

...case readMutVar# ioRef token of{ (# token', x #) ->case x of{ I# unbox -> case writeMutVar# ioRef (I# (+# unbox 1#)) token'...

The second case statement is unboxing an Int to a primitive unboxed Int#,

case x of{ I# unbox

and then boxing when setting.

case writeMutVar# ioRef (I# (+# unbox 1#)) token'

I# is the Int constructor (the # means it is a compiler primitive). It wraps or boxes an unpacked, unboxed, "machine" int. Most of the time, GHC can unbox primitives automatically, but it can't in this case. Boxing/unboxing could be the root of the slowdown.

We need an unboxed mutable reference. I can think of two options: Ptr Int and MutableByteArray.

First the Ptr version:

main = alloca $ \ptr -> dopoke ptr (0 :: Int)replicateM_ 100000000 $ doi <- peek ptrlet !i' = i + 1poke ptr i'

Running, I get:

bench ./bin/ptr-loop

benchmarking ./bin/ptr-looptime 175.2 ms (174.1 ms .. 176.2 ms)1.000 R² (0.999 R² .. 1.000 R²)mean 174.7 ms (174.1 ms .. 175.2 ms)std dev 1.626 ms (1.372 ms .. 2.089 ms)

This is good! Almost twice as fast IORef.

Now the MutableByteBuffer version:

main = domba <- newAlignedPinnedByteArray 4 8replicateM_ 100000000 $ doi <- readByteArray mba 0 :: IO Intlet !i' = i + 1writeByteArray mba 0 i'

Running:

bench ./bin/mutable-byte-buffer-loop

benchmarking ./bin/mutable-byte-buffer-looptime 175.2 ms (173.5 ms .. 177.3 ms)0.999 R² (0.999 R² .. 1.000 R²)mean 177.4 ms (176.5 ms .. 178.4 ms)std dev 2.423 ms (1.987 ms .. 2.919 ms)

Basically identical.

I’ve made progress, but I don’t want to use either Ptr or MutableByteBuffer. They are not very safe (Ptr is arguably safe enough, but I prefer not to manually allocate memory directly if I can help it).

If only there were a safe wrapper around one of the types? There is! Vector is a safe wrapper around MutableByteBuffer. Here is the Vector version:

main = domba <- M.new 1replicateM_ 100000000 $ doi <- M.read mba 0 :: IO Intlet !i' = i + 1M.write mba 0 i'

Running:

bench ./bin/vector-loop

benchmarking ./bin/vector-looptime 178.6 ms (177.3 ms .. 179.9 ms)1.000 R² (0.999 R² .. 1.000 R²)mean 178.2 ms (177.1 ms .. 179.3 ms)std dev 3.021 ms (2.082 ms .. 4.288 ms)

Nice! With only a small increase in time, we get a safe interface for fast, unboxed mutable references. It is a little odd to use a collection type for a single element, but it’s not terrible, either.

What’s on the Table

The CLoopVolatile and the AsmLoop.asm are not the fastest way one could write a loop that increments a number. That would involve putting the reference in a register for the duration of the loop, and other tricks like loop unrolling. Just to give a sense of how much faster things could be here is a some simple assembly example that does that.

; FastAsmLoop.asmEXTERN _exitGLOBAL start

SECTION .textstart:mov rcx, 0mov rax, 100000000

loopBlock:inc qword rcx ;cmp rcx, rax ;jl loopBlock ;

exitBlock:mov rax, 0x2000001 ; exitmov rdi, 0syscall

Compiling

$ nasm -fmacho64 src/FastAsmLoop.asm$ ld -o ./bin/FastAsmLoop -macosx_version_min 10.7.0 \./src/FastAsmLoop.o$ bench ./bin/FastAsmLoop;

benchmarking ./bin/FastAsmLooptime 38.38 ms (37.96 ms .. 39.15 ms)0.999 R² (0.995 R² .. 1.000 R²)mean 38.22 ms (37.98 ms .. 38.87 ms)std dev 715.8 μs (282.6 μs .. 1.261 ms)

I barely believe these numbers, but I think it shows that GHC still is leaving some performance on the table.

Something else worth noting, this isn’t how a tight loop would be written in Haskell anyway. One would not need a mutable reference. In a future post I would like to do more direct comparison between Haskell and assembly similar to above. The C code was really just a baseline that lead me to look into what GHC was doing.

Update: Michael Snoyman alerted me to a package of his that provides an unboxed mutable reference, among other nice things: https://www.stackage.org/haddock/lts-8.16/mutable-containers-0.3.3/Data-Mutable.html#t:URef

The a project with this code that can also produce these timings can be found here: https://github.com/jfischoff/are-mutable-references-in-haskell-fast

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!