A new Mag­num fea­ture provides ef­fi­cient com­pile-time and runtime CPU de­tec­tion and dis­patch on x86, ARM and WebAssembly. The core idea be­hind al­lows adding new vari­ants without hav­ing to write any dis­patch­ing code.

Among key as­pects dif­fer­en­ti­at­ing between “per­form­ance is a joke for this pro­ject!” and “wow they really mean it” — be­sides fa­vor­ing data-ori­ented design over ab­stract fact­ory proxy man­ager del­eg­ates — is use of SIMD in­struc­tions such as AVX or NEON, and ul­ti­mately de­tect­ing and pick­ing the best op­tim­ized im­ple­ment­a­tion for giv­en hard­ware at runtime.

For over a dec­ade, Mag­num could af­ford to not both­er, partly be­cause the ac­tu­al heavy lift­ing such as phys­ics or tex­ture com­pres­sion was of­f­loaded to 3rd party lib­rar­ies, and partly be­cause if you really needed to per­form things like fast mul­ti­plic­a­tion of large matrices, you could al­ways seam­lessly del­eg­ate to Ei­gen. But as all new Mag­num APIs are data-ori­ented, de­signed with batch op­er­a­tions in mind, I just couldn’t ig­nore the ex­tra per­form­ance that SIMD in­struc­tions could bring.

Core idea — in­her­it­ance-based over­load res­ol­u­tion

What triggered this whole pro­cess was that I real­ized I could make use of C++ in­her­it­ance to pick among can­did­ate over­loads. Loosely cit­ing the C++ ref­er­en­ce on Over­load Res­ol­u­tion:

F1 is de­term­ined to be a bet­ter func­tion than F2 if […] there is at least one ar­gu­ment of F1 whose im­pli­cit con­ver­sion is bet­ter than the cor­res­pond­ing im­pli­cit con­ver­sion for that ar­gu­ment of F2.

If two con­ver­sion se­quences are in­dis­tin­guish­able be­cause they have the same rank, the fol­low­ing ad­di­tion­al rules ap­ply:

  • If Mid is de­rived (dir­ectly or in­dir­ectly) from Base, and Derived is de­rived (dir­ectly or in­dir­ectly) from Mid, then Derived to Mid is bet­ter than Derived to Base.

For a prac­tic­al ex­ample, let’s as­sume we’re about to im­ple­ment a mem­r­chr() func­tion, be­cause, com­pared to std::mem­chr(), it’s for some reas­on not a widely avail­able API and we want to have a fast im­ple­ment­a­tion every­where. Tak­ing just x86 in­to ac­count for sim­pli­city, we’ll have an AVX2 vari­ant, a slightly slower SSE2 vari­ant and a scal­ar fall­back, dis­tin­guished from each oth­er not by a name but by a tag, sim­il­arly to what tags like NoIn­it or ValueIn­it do in Con­tain­ers::Ar­ray con­struct­ors:

struct ScalarT {};
struct Sse2T: ScalarT {};

struct Avx2T: AvxT {};
struct Avx512T: Avx2T{};

const char* memrchr(ScalarT, const char* data, std::size_t size, char c);
const char* memrchr(Sse2T, const char* data, std::size_t size, char c);
const char* memrchr(Avx2T, const char* data, std::size_t size, char c);

Then, an ap­pro­pri­ate tag would be typedef‘d de­pend­ing on what par­tic­u­lar COR­RADE_TAR­GET_* pre­pro­cessor vari­ables are defined, with an in­stance of that tag ex­posed through a constexpr vari­able for easy use:

#ifdef CORRADE_TARGET_AVX512
typedef Avx512T DefaultCpuT;
#elif defined(CORRADE_TARGET_AVX2)
typedef Avx2T DefaultCpuT;

#elif defined(CORRADE_TARGET_SSE2)
typedef Sse2T DefaultCpuT;
#else
typedef ScalarT DefaultCpuT;
#endif

constexpr DefaultCpuT DefaultCpu;

Ul­ti­mately, to pick the right over­load at com­pile time, the DefaultCpu tag gets passed along­side oth­er para­met­ers. On an AVX-512 ma­chine the AVX2 im­ple­ment­a­tion gets chosen, as Avx2T is the closest avail­able base of Avx512T. On an AVX ma­chine, the SSE2 im­ple­ment­a­tion gets chosen in­stead — Avx2T is a sub­class of AvxT, so it’s ruled out, and the closest base is Sse2T. In the rare scen­ario where not even SSE2 is present, it falls back to the ScalarT vari­ant.

const char* found = memrchr(DefaultCpu, );

Pretty neat, huh? This makes com­pile-time dis­patch ex­tremely easy to per­form, and what I es­pe­cially like about it is that adding a new vari­ant is lit­er­ally just one more over­load. No re­dund­ant boil­er­plate, no manu­ally man­aged dis­patch tables, no sur­face for bugs to creep in.

Now, what’s left is “just” the runtime dis­patch.

A year passed by …

… dur­ing which I ba­sic­ally aban­doned the whole idea.1 Reas­on is that the very fea­ture that makes it work at com­pile time — dif­fer­ent func­tion sig­na­tures — is what makes runtime dis­patch really pain­ful. For every such func­tion, one would have to manu­ally write a snip­pet like this, adding sig­ni­fic­ant main­ten­ance and runtime over­head. Didn’t I want to avoid ex­actly this in the first place?

const char* memrchr(const char* data, std::size_t size, char c) {
    if(targetIsAvx2)
        return memrchr(Avx2T{}, data, size, c);
    if(targetIsSse2)
        return memrchr(Sse2T{}, data, size, c);
    // plus #ifdefs and branches for ARM NEON, WebAssembly SIMD, ...
    return memrchr(ScalarT{}, data, size, c);
}

Also, ori­gin­ally I planned to use CPU fea­ture dis­patch for sig­ni­fic­ant chunks of code like cal­cu­lat­ing mesh nor­mals or res­iz­ing im­ages, to min­im­ize the im­pact of such dis­patch over­head. But ended up here, want­ing to use it for a memrchr() im­ple­ment­a­tion! Which means that any sort of over­head that’s big­ger than a reg­u­lar func­tion call is not go­ing to cut it. Es­pe­cially not a gi­ant if cas­cade with a po­ten­tially ex­pens­ive ar­gu­ment passthrough.

How the grownups do it

For­tu­nately, dur­ing re­cent perf in­vest­ig­a­tions and code pro­fil­ing ses­sions I dis­covered the GNU IFUNC at­trib­ute, and found out that it’s even The Solu­tion used by glibc it­self to dis­patch to ar­chi­tec­ture-spe­cif­ic vari­ants of std::mem­chr() or std::mem­cmp(). Can’t really do much bet­ter than that, right?

Which led me to a con­clu­sion that the ideal way to per­form runtime CPU fea­ture dis­patch would be to:

  1. De­clare a func­tion (point­er) in the head­er, with the ac­tu­al com­plex­ity and ar­chi­tec­ture-spe­cif­ic code hid­den away in­to a source file. To min­im­ize func­tion call over­head, avoid passing heavy classes with re­dund­ant state — ideally just built­in types. In case of our ex­ample, it would be extern const char*(*memrchr)(const char*, std::size_t, char).
  2. If the func­tion is meant to be called from with­in a high­er-level API and not dir­ectly (such as in this case, where it’d be ex­posed through Con­tain­ers::StringView::find­Last(char)), make that API the smal­lest pos­sible wrap­per that’s in­lined in the head­er. That way the com­piler can in­line the wrap­per on the call site, turn­ing it in­to just a single call to the func­tion (point­er), in­stead of two nes­ted calls.

Mean­ing, if I want to have a fast dis­patch, I have to find a solu­tion that doesn’t in­volve ex­pens­ive ar­gu­ment passthrough. Out of des­per­a­tion I even thought of Em­bra­cing the Dark­ness and reinterpret_cast<>‘ing the vari­ants to a com­mon func­tion point­er type, but dis­carded that idea upon dis­cov­er­ing that WebAssembly checks that a func­tion is only called with a match­ing type, pre­vent­ing such hack from work­ing there. Not to men­tion the pub­lic hu­mi­li­ation, ad­dress san­it­izer and stat­ic ana­lys­is com­plaints.

# # #

Func­tion cur­ry­ing to the res­cue

The Heureka Mo­ment came to me when I was check­ing if C++ could do func­tion cur­ry­ing, and the solu­tion isn’t that much com­plex than the ori­gin­al idea. First let me show how the memrchr() ex­ample from the top would be re­writ­ten in a way that ac­tu­ally works with both a com­pile-time and a runtime dis­patch, and uses an ac­tu­al Cor­rade::Cpu lib­rary:

#include <Corrade/Cpu.h>

using namespace Corrade;

CORRADE_ALWAYS_INLINE auto memrchrImplementation(Cpu::ScalarT) {
    return +[](const char* data, std::size_t size, char c) {  };
}

CORRADE_ALWAYS_INLINE auto memrchrImplementation(Cpu::Sse2T) {
    return +[](const char* data, std::size_t size, char c) {  };
}

CORRADE_ALWAYS_INLINE auto memrchrImplementation(Cpu::Avx2T) {
    return +[](const char* data, std::size_t size, char c) {  };
}

CORRADE_CPU_DISPATCHER_BASE(memrchrImplementation)

The only dif­fer­ence com­pared to the pre­vi­ous at­tempt is that the ar­chi­tec­ture-spe­cif­ic vari­ants are now re­turn­ing a lambda2 that con­tains the ac­tu­al code in­stead … and then there’s an opaque macro. Since non-cap­tur­ing lambdas are just like reg­u­lar func­tions, there isn’t any ex­tra over­head from put­ting the code there in­stead. The wrap­per func­tion, tak­ing the tag, is how­ever marked as COR­RADE_AL­WAYS_IN­LINE, mean­ing it op­tim­izes down to ac­cess­ing a func­tion point­er dir­ectly. Thus per­form­ing

const char* found = memrchrImplementation(Cpu::DefaultBase)();

— where Cpu::De­fault­Base is an ali­as to the highest base in­struc­tion set en­abled at com­pile time — is equi­val­ent to the pre­vi­ous at­tempt, but with the im­port­ant dif­fer­ence that it’s pos­sible to sep­ar­ate the com­pile-time dis­patch from the ac­tu­al func­tion call.

Which is what makes pos­sible to im­ple­ment a zero-over­head runtime dis­patch as well. And that’s what the COR­RADE_CPU_DIS­PATCH­ER­_­BASE() macro is for — here is the x86 vari­ant of it. It uses Cpu::Fea­tures, which I didn’t talk about yet, but suf­fice to say it’s sim­il­ar to a Con­tain­ers::Enum­Set, con­vert­ing the com­pile-time tags to a bit­field-like value that can be op­er­ated with at runtime:

#define CORRADE_CPU_DISPATCHER_BASE(function)                               \
    auto function(Cpu::Features features) {                                 \
        if(features & Cpu::Avx512f)                                         \
            return function(Cpu::Avx512f);                                  \
        if(features & Cpu::Avx2)                                            \
            return function(Cpu::Avx2);                                     \
        …                                                                   \
        if(features & Cpu::Sse2)                                            \
            return function(Cpu::Sse2);                                     \
        return function(Cpu::Scalar);                                       \
    }

The macro just stamps out checks for all pos­sible CPU fea­tures, from most ad­vanced to least, and then calls the func­tion with each of those tags. Thus — like be­fore — Cpu::Avx2 and above branches will re­solve to memrchrImplementation(Cpu::Avx2T), all branches be­low in­clud­ing Cpu::Sse2 will re­solve to memrchrImplementation(Cpu::Sse2T), and the memrchrImplementation(Cpu::ScalarT) fall­back gets used if noth­ing else matches.

Quite a lot of branch­ing, eh? Not really. Be­cause of the COR­RADE_AL­WAYS_IN­LINE, this com­piles down to re­turn­ing a set of func­tion point­ers, and with even the low­est op­tim­iz­a­tions en­abled the com­piler sees that sev­er­al branches re­turn the same point­er and col­lapses them to­geth­er. To proof such a wild claim, here’s the as­sembly of the memrchrImplementation(Cpu::Features) func­tion that this macro gen­er­ated, with GCC and -O1:

0x00001131:  lea    0x175(%rip),%rax <memrchrImplementation(Corrade::Cpu::Avx2T)>
0x00001138:  test   $0xc0,%dil
0x0000113c:  je     0x113f
0x0000113e:  ret
0x0000113f:  test   $0x3f,%dil
0x00001143:  lea    0x15f(%rip),%rax <memrchrImplementation(Corrade::Cpu::Sse2T)>
0x0000114a:  lea    0x154(%rip),%rdx <memrchrImplementation(Corrade::Cpu::ScalarT)>
0x00001151:  cmove  %rdx,%rax
0x00001155:  jmp    0x113e

Clang does the same, MS­VC is dif­fer­ent in that the COR­RADE_AL­WAYS_IN­LINE won’t work when op­tim­iz­a­tions are dis­abled, lead­ing to the if cas­cade in­deed be­ing a bunch of sad func­tion calls. Yet even that isn’t a prob­lem, as the dis­patch­er func­tion is meant to only be called once in a pro­gram life­time. But I’m skip­ping ahead.

The CPU usu­ally doesn’t change un­der our hands

Now comes the ac­tu­al runtime CPU fea­ture de­tec­tion. Which, for x86, is done via the CPUID in­struc­tion, and ex­posed through Cpu::runtime­Fea­tures(). There’s not much else to say, ex­cept that a lot of curs­ing went in­to mak­ing that code port­able. An im­port­ant as­sump­tion is that the set of CPU fea­tures doesn’t change dur­ing pro­gram life­time,3 and so we can query them and dis­patch just once, cach­ing the res­ult. Which is what COR­RADE_CPU_DIS­PATCHED_­POINT­ER() is for:

CORRADE_CPU_DISPATCHED_POINTER(memrchrImplementation,
    const char*(*memrchr)(const char*, std::size_t, char))

In­tern­ally, it simply as­signs the res­ult of a call to memrchrImplementation(Cpu::Features), defined with the COR­RADE_CPU_DIS­PATCH­ER­_­BASE() macro above, to a memrchr func­tion point­er:

#define CORRADE_CPU_DISPATCHED_POINTER(dispatcher, ...)                     \
    __VA_ARGS__ = dispatcher(Corrade::Cpu::runtimeFeatures());

Since this all hap­pens in a glob­al con­struct­or, the memrchr func­tion point­er is ready for use without any ex­pli­cit setup call. All that’s needed is ex­pos­ing the point­er in some pub­lic head­er as an extern:

extern const char*(*memrchr)(const char*, std::size_t, char);



const char* found = memrchr();

The ma­gic of IFUNC

This sounds pretty much ideal already, so what does IFUNC ac­tu­ally bring to the table? In short, it turns the point­er in­to a reg­u­lar func­tion. Which means, in­stead of first hav­ing to load the value of the func­tion point­er from some­where and only then call­ing it, it’s like any oth­er dy­nam­ic lib­rary func­tion call.4

Us­age-wise it’s not much dif­fer­ent from the func­tion point­er ap­proach. A dis­patch­er func­tion is as­so­ci­ated with a func­tion pro­to­type, which the dy­nam­ic load­er then uses to as­sign the pro­to­type a con­crete func­tion point­er. This all hap­pens dur­ing early star­tup, so the dis­patch­er code can’t really do much — es­pe­cially not call­ing in­to ex­tern­al lib­rar­ies, which may not even be there yet at that point. Prac­tic­ally it means Cpu::runtime­Fea­tures() has to be fully in­lined in or­der to be us­able in this con­text.

Here’s how the above would look with COR­RADE_CPU_DIS­PATCHED_I­FUNC() in­stead:

CORRADE_CPU_DISPATCHED_IFUNC(memrchrImplementation,
    const char* memrchr(const char*, std::size_t, char))

In the macro im­ple­ment­a­tion, a func­tion is an­not­ated with __attribute__((ifunc)) car­ry­ing a name of the dis­patch­er func­tion. The dis­patch­er func­tion gets called with no ar­gu­ments, so the macro cre­ates a memrchrImplementation() wrap­per that del­eg­ates to memrchrImplementation(Cpu::Features). What the com­piler manu­al doesn’t say is that the dis­patch­er func­tion has to have C link­age in or­der to be found.

#define CORRADE_CPU_DISPATCHED_IFUNC(dispatcher, ...)                       \
    extern "C" { static auto dispatcher() {                                 \
        return dispatcher(Corrade::Cpu::runtimeFeatures());                 \
    }}                                                                      \
    __VA_ARGS__ __attribute__((ifunc(#dispatcher)));

The only down­side of this ap­proach is that it’s a glibc-spe­cif­ic fea­ture, thus mainly just Linux (and An­droid, as I’ll de­tail later). Apart from low-level glibc code us­ing it, this is also the back­bone of GCC’s and Clang’s func­tion multi-ver­sion­ing. So far I’m not aware of any­thing sim­il­arly con­veni­ent on ma­cOS or Win­dows, ex­cept maybe for the cpu_dispatch at­trib­ute in Clang and ICC. But that one, as far as I can tell, dis­patches on every call, and at least in case of ICC is lim­ited only to In­tel pro­cessors. No ARM but no AMD either.

Even less over­head, please?

In cer­tain cases it may be de­sir­able to not go through a dy­nam­ic­ally dis­patched func­tion at all in or­der to get be­ne­fits from in­ter­pro­ced­ur­al op­tim­iz­a­tions and LTO. In that case, the dis­patch­er can se­lect an over­load at com­pile time us­ing Cpu::De­fault­Base, sim­il­arly to what the very first ex­ample was show­ing:

const char* memrchr(const char* data, std::size_t size, char c) {
    return memrchrImplementation(Cpu::DefaultBase)(data, size, c);
}

In my ex­per­i­ments at least, with com­piler op­tim­iz­a­tions en­abled, the whole re­turned lambda gets in­lined here, re­mov­ing any re­main­ing ar­gu­ment-passing over­head. Thus be­ing identic­al to the ideal case where the high-level func­tion would dir­ectly con­tain op­tim­ized AVX or SSE code.

Over­head com­par­is­on

The fol­low­ing plot com­pares the three ap­proaches. Apart from the weird out­lier with a func­tion point­er in a dy­nam­ic lib­rary that I can’t ex­plain, it shows that a reg­u­lar func­tion call is the clear win­ner in case the code can be com­piled dir­ectly for the hard­ware the it will run on. IFUNC ul­ti­mately isn’t any faster than reg­u­lar point­ers, but isn’t really slower either, in this mi­crobench­mark at least. I sup­pose in real-world scen­ari­os it could be­ne­fit at least from cache loc­al­ity with oth­er dy­nam­ic func­tions.

1.26 ± 0.06 ns 1.51 ± 0.06 ns 1.52 ± 0.08 ns 1.52 ± 0.07 ns 1.25 ± 0.05 ns 1.5 ± 0.06 ns 2.5 ± 0.08 ns 0.0 0.5 1.0 1.5 2.0 2.5 ns Regular function Function pointer IFUNC function Regular function Function pointer IFUNC function Dispatch on every call in a dynamic library in a dynamic library in a dynamic library in a dynamic library Dispatch overhead, Linux x86-64

Com­pil­ing dif­fer­ent func­tions for dif­fer­ent tar­gets

In or­der to use in­trins­ics for a par­tic­u­lar CPU in­struc­tion set, GCC and Clang re­quire the code to be com­piled with a cor­res­pond­ing tar­get op­tion, such as -mavx2 for AVX2. Such re­quire­ment makes sense, as it al­lows the com­piler to per­form its own op­tim­iz­a­tions on top of the ex­pli­cit in­trins­ics calls. Hav­ing a sep­ar­ate file for every vari­ant would be quite im­prac­tic­al though, for­tu­nately one can use __attribute__((target)) to en­able in­struc­tion sets just on par­tic­u­lar func­tions, al­low­ing all vari­ants to live in the same file.5

This is ex­posed via mac­ros such as COR­RADE_EN­ABLE_AVX2, and ad­di­tion­ally each macro is defined only if com­pil­ing for a match­ing ar­chi­tec­ture and the com­piler sup­ports giv­en in­struc­tion set. Which can be con­veni­ently used to guard the code to be only com­piled where it makes sense:

#ifdef CORRADE_ENABLE_AVX2
CORRADE_ALWAYS_INLINE auto memrchrImplementation(Cpu::Avx2T) {
    return +[](const char* data, std::size_t size, char c) CORRADE_ENABLE_AVX2 {  };
}
#endif

MS­VC, on the oth­er hand, doesn’t re­quire any op­tion in or­der to use any in­trins­ics, so there the mac­ros are empty. It how­ever also means that it will only ap­ply the baseline op­tim­iz­a­tions, so for ex­ample ex­tract­ing all AVX+ code to a file with /arch:AVX en­abled might have some perf be­ne­fits.5

^ v ^

And then the real­ity hits

So far, to make the code shine, I was hold­ing back on a cer­tain im­port­ant as­pect of the in­struc­tion set land­scape. In that it’s def­in­itely not a lin­ear se­quence of in­struc­tion sets where each is a clear su­per­set of the pre­vi­ous one. Rather, it looks more like this:

x86 instruction family tree FMA FMA F16C F16C AVX AVX FMA->AVX F16C->AVX LZCNT LZCNT POPCNT POPCNT SSE3 SSE3 BMI1 BMI1 SSE2 SSE2 SSE3->SSE2 SSSE3 SSSE3 SSSE3->SSE3 SSE41 SSE4.1 SSE41->SSSE3 SSE42 SSE4.2 SSE42->SSE41 AVX->SSE42 AVX2 AVX2 AVX2->AVX AVX512F AVX512F AVX512CD AVX512F->AVX2 AVX512VLDQBW AVX512VL AVX512DQ AVX512BW AVX512VLDQBW->AVX512F VBMI2 VBMI2 BITALG VPCLMULQDQ GFNI VAES VBMI2->AVX512VLDQBW SSE4a SSE4a SSE4a->SSE3 SSE5 SSE5 SSE5->SSSE3 XOP XOP XOP->AVX FMA4 FMA4 FMA4->AVX AVX512ERPF AVX512ER AVX512PF AVX512ERPF->AVX512F FMAPS 4FMAPS 4VNNIW VPOPCNTDQ FMAPS->AVX512ERPF

Please note there’s many in­struc­tion sets still omit­ted and the hier­archy likely con­tains severe er­rors.

With the passing of time many of these branches for­tu­nately went aban­doned, how­ever there’s still many cases where any or­der­ing is just im­possible. For ex­ample, there are AVX CPUs that have F16C but not FMA, there are AVX CPUs with FMA but no F16C, and there’s ap­par­ently even a VIA CPU with AVX2 but no FMA. To ac­count for all these cases, I ended up dif­fer­en­ti­at­ing between a base in­struc­tion set that has clear or­der­ing and ex­tra in­struc­tion sets that have no re­la­tions what­so­ever. For AVX-512 I’m not 100% sure yet, but I’m hop­ing it’ll even­tu­ally con­verge as well.

x86 instruction family tree FMA FMA F16C F16C AVX AVX FMA->AVX F16C->AVX LZCNT LZCNT POPCNT POPCNT SSE3 SSE3 BMI1 BMI1 SSE2 SSE2 SSE3->SSE2 SSSE3 SSSE3 SSSE3->SSE3 SSE41 SSE4.1 SSE41->SSSE3 SSE42 SSE4.2 SSE42->SSE41 AVX->SSE42 AVX2 AVX2 AVX2->AVX AVX512F AVX512F AVX512CD AVX512F->AVX2 AVX512VLDQBW AVX512VL AVX512DQ AVX512BW AVX512VLDQBW->AVX512F VBMI2 VBMI2 BITALG VPCLMULQDQ GFNI VAES VBMI2->AVX512VLDQBW SSE4a SSE4a SSE4a->SSE3 SSE5 SSE5 SSE5->SSSE3 XOP XOP XOP->AVX FMA4 FMA4 FMA4->AVX AVX512ERPF AVX512ER AVX512PF AVX512ERPF->AVX512F FMAPS 4FMAPS 4VNNIW VPOPCNTDQ FMAPS->AVX512ERPF

base sets · ex­pec­ted con­tinu­ation · ex­tra sets · aban­doned sets

To make this work, the dis­patch has to se­lect among vari­ants with both the base in­struc­tion set and all ex­tra sets sup­por­ted, and these two pri­or­ity rules:

  • If one vari­ant has the base in­struc­tion set more ad­vanced than an­oth­er, it’s pre­ferred (which is same as be­fore). For ex­ample, if there’s an AVX-512 vari­ant and an AVX+FMA vari­ant, AVX-512 gets picked.
  • Oth­er­wise, if both vari­ants have the same base in­struc­tion set, a vari­ant that uses more ex­tra in­struc­tion sets is pre­ferred. For ex­ample, if there’s a plain AVX vari­ant and an AVX+FMA vari­ant, AVX+FMA gets picked.

There’s de­lib­er­ately no or­der­ing defined among the ex­tra in­struc­tion sets. Thus hav­ing for ex­ample a SSE2+POP­CNT and a SSE2+LZCNT vari­ant would lead to an am­bi­gu­ity if the ma­chine sup­por­ted both POP­CNT and LZCNT. This can only be re­solved by the im­ple­ment­a­tion it­self, by provid­ing a SSE2+POP­CNT+LZCNT vari­ant and del­eg­at­ing to whichever is the bet­ter op­tion in that case.

Now comes the pile of nasty tem­plates

Let’s say there’s a Tag<bits> type that stores the bit­mask of both the base in­struc­tion set and the ex­tra sets. An ex­tra in­struc­tion set is al­ways rep­res­en­ted by just a single unique bit, while a base in­struc­tion set con­tains one unique bit for it­self plus also all bits for the in­struc­tion sets it’s based on. Thus, for ex­ample, SSE2 would be Tag<0b00000001>, SSE3 Tag<0b00000011>, SSSE3 Tag<0b00000111> and so on, while POP­CNT would be Tag<0b01000000> and LZCNT Tag<0b10000000>. Then, com­bined, SSE3+POP­CNT would be Tag<0b01000011>. Fi­nally, the class provides a con­ver­sion op­er­at­or to an­oth­er Tag<> that’s en­abled only if the res­ult­ing bit mask is a sub­set:

template<unsigned bits> struct Tag {
    template<
        unsigned otherBits,
        class = std::enable_if_t<(otherBits & bits) == otherBits>
    > operator Tag<otherBits>() const {  }
};

The pri­or­ity or­der­ing is defined primar­ily by the in­dex of the base in­struc­tion set, and sec­ond­ar­ily by the count of ex­tra in­struc­tion sets. Which, giv­en an up­per bound on the count of ex­tra in­struc­tion sets, can be rep­res­en­ted with a single num­ber — for ex­ample, SSE2 would have a pri­or­ity num­ber 100, SSE3 200, SSSE3 300 and so on, SSE2+LZCNT+POP­CNT would be 102 and SSE3+POP­CNT 201. Here’s where the core idea of in­her­it­ance hier­archy gets re­used again, just in a cal­cu­lated man­ner:

template<unsigned i> struct Priority: Priority<i - 1> {};
template<> struct Priority<0> {};

I claim no in­ven­tion here — cred­it goes to @sthalik who sug­ges­ted the pri­or­ity tag idea to me.

To make these work to­geth­er, the dis­patch has to use two func­tion ar­gu­ments — one is where the can­did­ate fil­ter­ing hap­pens, and the oth­er or­ders by pri­or­ity. In the fol­low­ing ex­ample, call­ing with Tag<0b11000011> and Priority<202> (SSE3+LZCNT+POP­CNT), would pick the SSE3+POP­CNT over­load:

void foo(Tag<0b00000000>, Priority<000>) {  } /* scalar */
void foo(Tag<0b00000011>, Priority<200>) {  } /* SSE3 */
void foo(Tag<0b01000011>, Priority<201>) {  } /* SSE3+POPCNT */
void foo(Tag<0b10000111>, Priority<301>) {  } /* SSSE3+LZCNT */

Be­cause I was un­able to fig­ure out a way that wouldn’t in­volve us­ing two para­met­ers, the user-fa­cing API hides this whole pro­cess be­hind two mac­ros — COR­RADE_CPU_­DE­CLARE() and COR­RADE_CPU_SE­LECT(). To­geth­er with an abil­ity to com­bine the Cpu tags with bit­wise op­er­a­tions, the pro­cess of de­clar­a­tion and com­pile-time dis­patch looks like this:

void foo(CORRADE_CPU_DECLARE(Cpu::Scalar)) {  }
void foo(CORRADE_CPU_DECLARE(Cpu::Sse3)) {  }
void foo(CORRADE_CPU_DECLARE(Cpu::Sse3|Cpu::Popcnt)) {  }
void foo(CORRADE_CPU_DECLARE(Cpu::Ssse3|Cpu::Lzcnt)) {  }



foo(CORRADE_CPU_SELECT(Cpu::Sse3|Cpu::Lzcnt|Cpu::Popcnt));

Fi­nally, the runtime dis­patch­er then has to take the ex­tra in­struc­tion sets in­to ac­count as well. Im­pli­citly con­sid­er­ing all pos­sible com­bin­a­tions would how­ever lead to a lot of work for the com­piler. As a tradeoff, since in prac­tice the vari­ants usu­ally need at most one or two ex­tra sets, the COR­RADE_CPU_DIS­PATCH­ER() macro re­quires you to list them ex­pli­citly:

CORRADE_CPU_DISPATCHER(foo, Cpu::Popcnt, Cpu::Lzcnt)

This macro pro­duces a foo(Cpu::Features) which can be again used with COR­RADE_CPU_DIS­PATCHED_­POINT­ER() or COR­RADE_CPU_DIS­PATCHED_I­FUNC(). For brev­ity I glossed over the nas­ti­er im­ple­ment­a­tion de­tails, if you’re in­ter­ested please dive in­to the source.

~ ~ ~

Prac­tic­al us­ab­il­ity veri­fic­a­tion

To en­sure san­ity of this design, I pro­ceeded with us­ing the Cpu APIs in a prac­tic­al scen­ario — im­ple­ment­ing a std::mem­chr() al­tern­at­ive to use in Con­tain­ers::StringView::find(). Turns out I was lucky, hav­ing not really any pri­or ex­per­i­ence writ­ing in­trins­ics for any ar­chi­tec­ture, I man­aged to pick a suf­fi­ciently hard prob­lem where

  • com­piler autovec­tor­iz­a­tion doesn’t do a “good enough” job,
  • the stand­ard im­ple­ment­a­tion is well op­tim­ized and not easy to beat,
  • the al­gorithm needs to have spe­cial­ized hand­ling for small and large in­puts in or­der to be fast,
  • I get to use also the ex­tra in­struc­tion sets such as BMI1,
  • and the ideal way is sig­ni­fic­antly dif­fer­ent between x86, ARM and WebAssembly.

Apart from hav­ing the API us­ab­il­ity con­firmed (and a ton of ugly com­piler bugs dis­covered), I real­ized that in or­der to really make the most of every plat­form, it doesn’t make sense to try to come up with an in­struc­tion-level ab­strac­tion API like is for ex­ample in SIMD every­where. So that’s some­thing Mag­num will not have. The build­ing blocks need to be much more high level.

Since I’m very new to all this, I won’t em­bar­rass my­self fur­ther by try­ing to pre­tend I know what I’m talk­ing about. Hope­fully in a later post — here’s just a sneak peek at the code that’s now in Cor­rade mas­ter:

1.0 ± 0.0 time relative to std::memchr() 1.2098 ± 0.0518 time relative to std::memchr() 0.8908 ± 0.072 time relative to std::memchr() 0.7013 ± 0.0065 time relative to std::memchr() 0.1369 ± 0.0274 time relative to std::memchr() 0.0 0.2 0.4 0.6 0.8 1.0 1.2 time relative to std::memchr() std::memchr() StringView::find() StringView::find() StringView::find() StringView::find() baseline x86 SSE2+BMI1 (Linux) x86 AVX2+BMI1 (Linux) ARM NEON (Apple M1) WASM SIMD128 (Node.js) Character lookup

CPU fea­tures and dis­patch on ARM plat­forms

In­struc­tion-set-wise, ARM has the ba­sic­ally ubi­quit­ous NEON. To­geth­er with a few re­cog­nized ex­ten­sions, it’s de­tec­ted as Cpu::Neon, Cpu::NeonFma and Cpu::NeonFp16. Apart from NEON the main in­struc­tions sets are SVE and SVE2 and Apple’s own pro­pri­et­ary AMX (which doesn’t even have any pub­licly avail­able com­piler sup­port yet). I’ll wire up de­tec­tion for these once they be­come more com­mon.

Com­pared to the x86 CPUID in­struc­tion, runtime de­tec­tion of ARM fea­tures has to rely on OS-spe­cif­ic calls such as get­auxv­al() on Linux or sy­sctlby­name() on Apple plat­forms. What’s nice is that both Linux and An­droid (API 18+) sup­port IFUNC dis­patch on ARM as well. It’s com­plic­ated by the fact that getauxval() is a call to an ex­tern­al lib­rary that’s not avail­able at the time IFUNC point­ers get re­solved so in­stead it’s fed to the re­solv­er from out­side. An­droid how­ever ad­op­ted such be­ha­vi­or only since API 30 (An­droid 11). To ac­count for this, the COR­RADE_CPU_DIS­PATCHED_I­FUNC() macro is spe­cial-cased for ARM and en­abled only on glibc and An­droid 30+. Over­head-wise, IFUNCs on ARM are com­par­able to x86.

Prac­tic­al us­age of ARM NEON in­trins­ics high­lighted two things. First, it’s a very dif­fer­ent in­struc­tion set from x86, so na­ively try­ing to emu­late in­struc­tions like movemask is not go­ing to be fast. In­stead, us­ing fea­tures unique to NEON can yield sig­ni­fic­ant spee­dups. Second, in my ex­per­i­ments at least, GCC seems to be sig­ni­fic­antly worse than Clang in deal­ing with ARM in­trins­ics. The stand­ard std::mem­chr() im­ple­ment­a­tion is usu­ally writ­ten in plain as­sembly, and while I was able to get my in­trins­ics to a com­par­able speed us­ing Clang, it was plain im­possible with GCC. I won­der wheth­er it’s the cause or the con­sequence of An­droid and ma­cOS/iOS, the ma­jor ARM plat­forms, nowadays ex­clus­ively us­ing Clang. Or maybe I’m just do­ing some­thing wrong, again this is all new to me.

WebAssembly 128-bit SIMD

WebAssembly cur­rently provides a single 128-bit SIMD in­struc­tion set, de­tec­ted as Cpu::Sim­d128. The situ­ation is a bit spe­cif­ic, as se­cur­ity on the web will al­ways have a pri­or­ity over per­form­ance. WebAssembly mod­ules are stat­ic­ally veri­fied up­front and if they fail the val­id­a­tion — for ex­ample be­cause an un­known in­struc­tion was en­countered — they’re re­jec­ted. Since WebAssembly SIMD is re­l­at­ively new, this means that a bin­ary can either use any SIMD in­struc­tions any­where, or nowhere at all, to pass val­id­a­tion on VMs that don’t know about SIMD yet.

While that makes any sort of runtime dis­patch rather use­less, per­haps the more con­cern­ing is­sue is that runtime dis­patch is pro­hib­it­ively ex­pens­ive — go­ing through an ar­bit­rary func­tion point­er has a fourty times big­ger over­head than call­ing a func­tion dir­ectly. The fol­low­ing num­bers are from Node.js, but both Chro­mi­um and Fire­fox showed a sim­il­ar dis­par­ity.

0.3 ± 0.04 ns 0.31 ± 0.05 ns 0.5 ± 0.03 ns 21.43 ± 0.51 ns 21.76 ± 0.64 ns 0 5 10 15 20 ns Function Function pointer Function Function pointer Dispatch on every call in a library in a library in a library Dispatch overhead, Emscripten and WebAssembly

Hon­estly I hope I just for­got to en­able some op­tim­iz­a­tion. This is too much.

While there’s cur­rently no way to per­form runtime de­tec­tion of sup­por­ted CPU fea­tures, there’s a Fea­ture De­tec­tion pro­pos­al aim­ing to cov­er this gap. But even then, un­less the over­head situ­ation im­proves, runtime dis­patch will only be use­ful for heav­ier chunks of code — def­in­itely not for things like memchr().

Prac­tic­al us­age of WebAssembly SIM­D128 in­trins­ics only fur­ther em­phas­ised that the dif­fer­ence between x86 and ARM isn’t some­thing to be ig­nored. I have two vari­ants of my code where us­ing was­m_i8x16_any_true() in­stead of was­m_i8x16_bit­mask() makes code 20% faster on ARM, while at the same time mak­ing it al­most 30% slower on x86. I could use a hack to dis­tin­guish between x86 and ARM and then choose an ap­pro­pri­ate vari­ant at runtime, but again, con­sid­er­ing the over­head of runtime dis­patch this would be far worse than hav­ing just a single non-dis­patched vari­ant.

For now at least, giv­en there are users with 99% of their audi­en­ce on mo­bile plat­forms, I’m con­sid­er­ing provid­ing a com­pile-time op­tion to prefer ARM-friendly vari­ants. Which would re­use the concept of ex­tra in­struc­tion sets de­scribed above, for ex­ample as Cpu::Simd128|Cpu::WasmArm. And once the over­head situ­ation im­proves, the vari­ants can dir­ectly work with a runtime dis­patch.

POWER, RISC-V

I … have no idea, sorry. There’s COR­RADE_TAR­GET_­POWER­PC but that’s about it. It doesn’t how­ever mean the code will stop com­pil­ing on these plat­forms — since none of the CORRADE_ENABLE_* mac­ros is defined there, the code will fall back to Cpu::Scal­ar vari­ants, be­hav­ing just like be­fore the CPU dis­patch was in­tro­duced.

Next steps

While the Cpu lib­rary can be con­sidered stable and ready for pro­duc­tion use, Cor­rade it­self has runtime CPU dis­patch en­abled only on Linux if IFUNC is de­tec­ted to be avail­able. Oth­er plat­forms have it cur­rently off by de­fault, un­til I’m 100% con­fid­ent about the over­head. In any case, you can use the COR­RADE_BUILD_CPU_RUNTIME_DIS­PATCH and COR­RADE_CPU_USE_I­FUNC CMake op­tions to toggle this func­tion­al­ity at build time.

For the ac­tu­al SIMD-en­abled al­gorithms, of course the use­ful­ness of nar­rowly beat­ing std::mem­chr() is du­bi­ous. But the code can now be trivi­ally mod­i­fied to im­ple­ment the above-men­tioned mem­r­chr() as well as mak­ing op­tim­ized two- and more-char­ac­ter vari­ants. That’s where the ac­tu­al gains will be, as multi-char­ac­ter look­up is a com­mon bot­tle­neck in JSON (glTF) or OBJ pars­ing, and stand­ard lib­rar­ies don’t really have a ded­ic­ated tool for that.

Among oth­er fea­tures that were planned to use SIMD someday were des­per­ately wait­ing for the CPU dis­patch to get ready are — hier­arch­ic­al trans­form­a­tion cal­cu­la­tions, batch frust­um cull­ing and oth­er bulk pro­cessing util­it­ies to make use in your ECS work­flows. Ex­cit­ing times ahead!

Single-head­er im­ple­ment­a­tion

Like with vari­ous Con­tain­ers im­ple­ment­a­tions, this fea­ture is self-con­tained enough to make sense as a stan­dalone lib­rary. To make it pos­sible to use it without re­ly­ing on the whole of Cor­rade, it’s provided as a single-head­er lib­rary in the Mag­num Singles re­pos­it­ory. To use it, simply down­load the file and #include it your pro­ject:

Lib­rary LoC Pre­pro­cessed LoC De­scrip­tion
Cor­radeCpu.h new 1646 2085 See the Cpu namespace docs

To put the head­er size in­to per­spect­ive, just #include <immintrin.h> for all AVX in­trins­ics yields 34 thou­sand pre­pro­cessed lines on GCC. Com­pared to when I was bitch­ing about <memory> be­ing heavy, this is a whole oth­er level. And hon­estly I don’t know what could fix this, there’s simply a lot in AVX-512.

. : .

1.
^ The ori­gin­al pro­to­type ap­peared in mosra/cor­rade#115 in March 2021, which was a spin-off of mosra/mag­num#306, dat­ing back to Janu­ary 2019. Yup, cer­tain fea­tures take time to brew.
2.
^ The + be­fore the lambda “de­cays” it in­to a reg­u­lar func­tion point­er, and that’s what the func­tion re­turns. I used C++14 for brev­ity, but the Cpu lib­rary works in plain C++11 as well, just with a bit more typ­ing in­stead of the auto re­turn type. I’m happy be­ing the cave­man that’s still us­ing C++11.
3.
^ Ex­cept for that one case where ARM big.LITTLE cores each re­por­ted a slightly dif­fer­ent in­struc­tion set, lead­ing to crashes when you quer­ied CPU fea­tures on the more ad­vanced core and ended up run­ning ad­vanced in­struc­tions on the core that didn’t have them.
4.
^ The ex­plan­a­tion is grossly over­sim­pli­fied, but dy­nam­ic lib­rary func­tion call is the key point here. Calls in­to dy­nam­ic lib­rar­ies are in­dir­ect, mean­ing they’re kinda like a func­tion point­er as well. The dif­fer­ence is, how­ever, that they’re re­solved at load time, con­stant for the ap­plic­a­tion life­time and all con­tained in a single place, which has the po­ten­tial of re­du­cing over­head com­pared to hav­ing to fetch a func­tion point­er value from a ran­dom loc­a­tion be­fore every call.
5.
^ a b Again I’m just scratch­ing the sur­face here. En­abling AVX2 for the whole file may cause AVX2 to be used also in func­tions where you use just SSE2 in­trins­ics and thus crash­ing on ma­chines without AVX2 sup­port. That is rather ob­vi­ous, but sim­il­ar cases could hap­pen also in case LTO com­bines code from mul­tiple files or when a single tem­plate func­tion gets com­piled for dif­fer­ent in­struc­tion sets and then the linker de­du­plic­ates it. Com­puters are hard, ap­par­ently, and I hope most of this got solved since.