After a long time of not publishing anything and being passive about doing any writing in this blog, I am coming back for … however many articles I can. Today I want to talk about an underrated feature of AArch64 ISA which is often overlooked but used by compilers a lot. It’s just a good and short story on what made Arm even better and more “CISCy” when it comes down to conditional moves. The story of csinc
deserves an article like this.
You probably heard of cmov
Traditionally, when you encounter conditional moves in literature, it is about x86 instruction cmov
. It’s a nice feature and allows to accomplish better performance in low level optimization. Say, if you merge 2 arrays, you can compare numbers and choose the one depending on the value of compare instructions (more precisely, flags):
while ((pos1 < size1) & (pos2 < size2)) { v1 = input1[pos1]; v2 = input2[pos2]; output_buffer[pos++] = (v1 <= v2) ? v1 : v2; pos1 = (v1 <= v2) ? pos1 + 1 : pos1; pos2 = (v1 >= v2) ? pos2 + 1 : pos2; }

cmpl %r14d, %ebp # compare which one is smaller, set CF setbe %bl # set CF to %bl if it's smaller cmovbl %ebp, %r14d # move ebp into r14d if flag CF was set
If branches are unpredictable, for instance, you merge 2 arrays of random integers, conditional move instructions bring significant speed-ups against branchy version because of removing the branch misprediction penalty. A lot was written about this in the Lemire’s blog. Much engineering has been done on this including Agner Fog, cmov vs branch profile guided optimizations. Conditional move instructions are a huge domain of modern software, pretty much anything you run likely have them.
What about Arm?
AArch64 is no exception in this area and has some conditional move instructions as well. The immediate equivalent, if you Google it, is csel
which is translated like conditional select
. There is almost no difference to cmov
except you specify directly which condition you want to check and destination register (in cmov the destination is unchanged if condition is not met). To my eye it is a bit more intuitive to read:
When I was studying the structure of this instruction in the optimization guide, I noticed the family included different variations:

I was intrigued by the existence of some other forms as this involves more opportunities for the compilers and engineers to write software. For example, csinc Xd, Xa, Xb, cond
(conditional select increase) means that if the condition holds, Xd = Xb + 1
, otherwise Xd = Xa
. For example, in merging 2 arrays, the line:
pos1 = (v1 <= v2) ? pos1 + 1 : pos1;
can be compiled into:
csinc X0, X0, X0, #condition_of_v1_less_equal_v2
where X0
is a register for pos1
.
csneg
, csinv
are similar and represent conditional negations and inversions.
For example, clang recognizes this sequence, whereas GCC does not.

Where can this be useful otherwise?
Interestingly enough, in compression! You might heard of Snappy, the old Google compression library which was surpassed by LZ4 many times. For x86, the difference in speed – even for the latest version of clang – is quite big. For example, on my server Intel Xeon 2.00GHz I have 2721MB/s of decompression for LZ4 and 2172MB/s for Snappy which is a 25% gap.
For Snappy to reach that level of decompression, engineers needed to write very subtle code to achieve cmov
code generation:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
For Arm, csinc
instruction was used because of the nature of the format:

Shortly, last 2 bits of the byte that opens the block have the instruction on what to do and which memory to copy: 00
copies len-1
data. With careful optimization of conditional moves, we can save on adding this +1 back through csinc
:

On Google T2A instances I got 3048MB/s decompression for LZ4 and 2839MB/s which is only a 7% gap. If I enable LZ4_FAST_DEC_LOOP, I have 3233MB/s which still makes a 13% gap but not 25% as per x86 execution.
In conclusion, conditional select instructions for Arm deserve attention and awareness:
csel
,csinc
and others have same latency and throughput, meaning, they are as cheap as usualcsel
for almost all modern Arm processors including Apple M1, M2.- Compilers do recognize them (in my experience, clang did better than GCC, see above), no need to do anything special, just be aware that some formats might work better for Arm than for x86.
To sum up, contrary to the belief of CISC vs RISC debate about x86 and Arm ISA, the latter has surprising features of conditional instructions which are more flexible than the traditionally discussed ones.
You can read more about this topic in other blogs, in Arm reference guide, Microsoft blog of AArch64 conditional execution.