Aug 02, 2022

Convenient CPU feature detection and dispatch

A new Magnum feature provides efficient compile-time and runtime CPU detection and dispatch on x86, ARM and WebAssembly. The core idea behind allows adding new variants without having to write any dispatching code.

Among key aspects differentiating between “performance is a joke for this project!” and “wow they really mean it” — besides favoring data-oriented design over abstract factory proxy manager delegates — is use of SIMD instructions such as AVX or NEON, and ultimately detecting and picking the best optimized implementation for given hardware at runtime.

For over a decade, Magnum could afford to not bother, partly because the actual heavy lifting such as physics or texture compression was offloaded to 3rd party libraries, and partly because if you really needed to perform things like fast multiplication of large matrices, you could always seamlessly delegate to Eigen. But as all new Magnum APIs are data-oriented, designed with batch operations in mind, I just couldn’t ignore the extra performance that SIMD instructions could bring.

Core idea — inheritance-based overload resolution

What triggered this whole process was that I realized I could make use of C++ inheritance to pick among candidate overloads. Loosely citing the C++ reference on Overload Resolution:

F1 is determined to be a better function than F2 if […] there is at least one argument of F1 whose implicit conversion is better than the corresponding implicit conversion for that argument of F2.

If two conversion sequences are indistinguishable because they have the same rank, the following additional rules apply:

If Mid is derived (directly or indirectly) from Base, and Derived is derived (directly or indirectly) from Mid, then Derived to Mid is better than Derived to Base.

For a practical example, let’s assume we’re about to implement a memrchr() function, because, compared to std::memchr(), it’s for some reason not a widely available API and we want to have a fast implementation everywhere. Taking just x86 into account for simplicity, we’ll have an AVX2 variant, a slightly slower SSE2 variant and a scalar fallback, distinguished from each other not by a name but by a tag, similarly to what tags like NoInit or ValueInit do in Containers::Array constructors:

struct ScalarT {};
struct Sse2T: ScalarT {};
…
struct Avx2T: AvxT {};
struct Avx512T: Avx2T{};

const char* memrchr(ScalarT, const char* data, std::size_t size, char c);
const char* memrchr(Sse2T, const char* data, std::size_t size, char c);
const char* memrchr(Avx2T, const char* data, std::size_t size, char c);

Then, an appropriate tag would be typedef‘d depending on what particular CORRADE_TARGET_* preprocessor variables are defined, with an instance of that tag exposed through a constexpr variable for easy use:

#ifdef CORRADE_TARGET_AVX512
typedef Avx512T DefaultCpuT;
#elif defined(CORRADE_TARGET_AVX2)
typedef Avx2T DefaultCpuT;
…
#elif defined(CORRADE_TARGET_SSE2)
typedef Sse2T DefaultCpuT;
#else
typedef ScalarT DefaultCpuT;
#endif

constexpr DefaultCpuT DefaultCpu;

Ultimately, to pick the right overload at compile time, the DefaultCpu tag gets passed alongside other parameters. On an AVX-512 machine the AVX2 implementation gets chosen, as Avx2T is the closest available base of Avx512T. On an AVX machine, the SSE2 implementation gets chosen instead — Avx2T is a subclass of AvxT, so it’s ruled out, and the closest base is Sse2T. In the rare scenario where not even SSE2 is present, it falls back to the ScalarT variant.

const char* found = memrchr(DefaultCpu, …);

Pretty neat, huh? This makes compile-time dispatch extremely easy to perform, and what I especially like about it is that adding a new variant is literally just one more overload. No redundant boilerplate, no manually managed dispatch tables, no surface for bugs to creep in.

Now, what’s left is “just” the runtime dispatch.

A year passed by …

… during which I basically abandoned the whole idea.1 Reason is that the very feature that makes it work at compile time — different function signatures — is what makes runtime dispatch really painful. For every such function, one would have to manually write a snippet like this, adding significant maintenance and runtime overhead. Didn’t I want to avoid exactly this in the first place?

const char* memrchr(const char* data, std::size_t size, char c) {
    if(targetIsAvx2)
        return memrchr(Avx2T{}, data, size, c);
    if(targetIsSse2)
        return memrchr(Sse2T{}, data, size, c);
    // plus #ifdefs and branches for ARM NEON, WebAssembly SIMD, ...
    return memrchr(ScalarT{}, data, size, c);
}

Also, originally I planned to use CPU feature dispatch for significant chunks of code like calculating mesh normals or resizing images, to minimize the impact of such dispatch overhead. But ended up here, wanting to use it for a memrchr() implementation! Which means that any sort of overhead that’s bigger than a regular function call is not going to cut it. Especially not a giant if cascade with a potentially expensive argument passthrough.

How the grownups do it

Fortunately, during recent perf investigations and code profiling sessions I discovered the GNU IFUNC attribute, and found out that it’s even The Solution used by glibc itself to dispatch to architecture-specific variants of std::memchr() or std::memcmp(). Can’t really do much better than that, right?

Which led me to a conclusion that the ideal way to perform runtime CPU feature dispatch would be to:

Declare a function (pointer) in the header, with the actual complexity and architecture-specific code hidden away into a source file. To minimize function call overhead, avoid passing heavy classes with redundant state — ideally just builtin types. In case of our example, it would be extern const char*(*memrchr)(const char*, std::size_t, char).
If the function is meant to be called from within a higher-level API and not directly (such as in this case, where it’d be exposed through Containers::StringView::findLast(char)), make that API the smallest possible wrapper that’s inlined in the header. That way the compiler can inline the wrapper on the call site, turning it into just a single call to the function (pointer), instead of two nested calls.

Meaning, if I want to have a fast dispatch, I have to find a solution that doesn’t involve expensive argument passthrough. Out of desperation I even thought of Embracing the Darkness and reinterpret_cast<>‘ing the variants to a common function pointer type, but discarded that idea upon discovering that WebAssembly checks that a function is only called with a matching type, preventing such hack from working there. Not to mention the public humiliation, address sanitizer and static analysis complaints.

# # #

Function currying to the rescue

The Heureka Moment came to me when I was checking if C++ could do function currying, and the solution isn’t that much more complex than the original idea. First let me show how the memrchr() example from the top would be rewritten in a way that actually works with both a compile-time and a runtime dispatch, and uses an actual Corrade::Cpu library:

#include <Corrade/Cpu.h>

using namespace Corrade;

CORRADE_ALWAYS_INLINE auto memrchrImplementation(Cpu::ScalarT) {
    return +[](const char* data, std::size_t size, char c) { … };
}

CORRADE_ALWAYS_INLINE auto memrchrImplementation(Cpu::Sse2T) {
    return +[](const char* data, std::size_t size, char c) { … };
}

CORRADE_ALWAYS_INLINE auto memrchrImplementation(Cpu::Avx2T) {
    return +[](const char* data, std::size_t size, char c) { … };
}

CORRADE_CPU_DISPATCHER_BASE(memrchrImplementation)

The only difference compared to the previous attempt is that the architecture-specific variants are now returning a lambda2 that contains the actual code instead … and then there’s an opaque macro. Since non-capturing lambdas are just like regular functions, there isn’t any extra overhead from putting the code there instead. The wrapper function, taking the tag, is however marked as CORRADE_ALWAYS_INLINE, meaning it optimizes down to accessing a function pointer directly. Thus performing

const char* found = memrchrImplementation(Cpu::DefaultBase)(…);

— where Cpu::DefaultBase is an alias to the highest base instruction set enabled at compile time — is equivalent to the previous attempt, but with the important difference that it’s possible to separate the compile-time dispatch from the actual function call.

Which is what makes possible to implement a zero-overhead runtime dispatch as well. And that’s what the CORRADE_CPU_DISPATCHER_BASE() macro is for — here is the x86 variant of it. It uses Cpu::Features, which I didn’t talk about yet, but suffice to say it’s similar to a Containers::EnumSet, converting the compile-time tags to a bitfield-like value that can be operated with at runtime:

#define CORRADE_CPU_DISPATCHER_BASE(function)                               \
    auto function(Cpu::Features features) {                                 \
        if(features & Cpu::Avx512f)                                         \
            return function(Cpu::Avx512f);                                  \
        if(features & Cpu::Avx2)                                            \
            return function(Cpu::Avx2);                                     \
        …                                                                   \
        if(features & Cpu::Sse2)                                            \
            return function(Cpu::Sse2);                                     \
        return function(Cpu::Scalar);                                       \
    }

The macro just stamps out checks for all possible CPU features, from most advanced to least, and then calls the function with each of those tags. Thus — like before — Cpu::Avx2 and above branches will resolve to memrchrImplementation(Cpu::Avx2T), all branches below including Cpu::Sse2 will resolve to memrchrImplementation(Cpu::Sse2T), and the memrchrImplementation(Cpu::ScalarT) fallback gets used if nothing else matches.

Quite a lot of branching, eh? Not really. Because of the CORRADE_ALWAYS_INLINE, this compiles down to returning a set of function pointers, and with even the lowest optimizations enabled the compiler sees that several branches return the same pointer and collapses them together. To proof such a wild claim, here’s the assembly of the memrchrImplementation(Cpu::Features) function that this macro generated, with GCC and -O1:

0x00001131:  lea    0x175(%rip),%rax <memrchrImplementation(Corrade::Cpu::Avx2T)…>
0x00001138:  test   $0xc0,%dil
0x0000113c:  je     0x113f
0x0000113e:  ret
0x0000113f:  test   $0x3f,%dil
0x00001143:  lea    0x15f(%rip),%rax <memrchrImplementation(Corrade::Cpu::Sse2T)…>
0x0000114a:  lea    0x154(%rip),%rdx <memrchrImplementation(Corrade::Cpu::ScalarT)…>
0x00001151:  cmove  %rdx,%rax
0x00001155:  jmp    0x113e

Clang does the same, MSVC is different in that the CORRADE_ALWAYS_INLINE won’t work when optimizations are disabled, leading to the if cascade indeed being a bunch of sad function calls. Yet even that isn’t a problem, as the dispatcher function is meant to only be called once in a program lifetime. But I’m skipping ahead.

The CPU usually doesn’t change under our hands

Now comes the actual runtime CPU feature detection. Which, for x86, is done via the CPUID instruction, and exposed through Cpu::runtimeFeatures(). There’s not much else to say, except that a lot of cursing went into making that code portable. An important assumption is that the set of CPU features doesn’t change during program lifetime,3 and so we can query them and dispatch just once, caching the result. Which is what CORRADE_CPU_DISPATCHED_POINTER() is for:

CORRADE_CPU_DISPATCHED_POINTER(memrchrImplementation,
    const char*(*memrchr)(const char*, std::size_t, char))

Internally, it simply assigns the result of a call to memrchrImplementation(Cpu::Features), defined with the CORRADE_CPU_DISPATCHER_BASE() macro above, to a memrchr function pointer:

#define CORRADE_CPU_DISPATCHED_POINTER(dispatcher, ...)                     \
    __VA_ARGS__ = dispatcher(Corrade::Cpu::runtimeFeatures());

Since this all happens in a global constructor, the memrchr function pointer is ready for use without any explicit setup call. All that’s needed is exposing the pointer in some public header as an extern:

extern const char*(*memrchr)(const char*, std::size_t, char);

…

const char* found = memrchr(…);

The magic of IFUNC

This sounds pretty much ideal already, so what does IFUNC actually bring to the table? In short, it turns the pointer into a regular function. Which means, instead of first having to load the value of the function pointer from somewhere and only then calling it, it’s like any other dynamic library function call.4

Usage-wise it’s not much different from the function pointer approach. A dispatcher function is associated with a function prototype, which the dynamic loader then uses to assign the prototype a concrete function pointer. This all happens during early startup, so the dispatcher code can’t really do much — especially not calling into external libraries, which may not even be there yet at that point. Practically it means Cpu::runtimeFeatures() has to be fully inlined in order to be usable in this context.

Here’s how the above would look with CORRADE_CPU_DISPATCHED_IFUNC() instead:

CORRADE_CPU_DISPATCHED_IFUNC(memrchrImplementation,
    const char* memrchr(const char*, std::size_t, char))

In the macro implementation, a function is annotated with __attribute__((ifunc)) carrying a name of the dispatcher function. The dispatcher function gets called with no arguments, so the macro creates a memrchrImplementation() wrapper that delegates to memrchrImplementation(Cpu::Features). What the compiler manual doesn’t say is that the dispatcher function has to have C linkage in order to be found.

#define CORRADE_CPU_DISPATCHED_IFUNC(dispatcher, ...)                       \
    extern "C" { static auto dispatcher() {                                 \
        return dispatcher(Corrade::Cpu::runtimeFeatures());                 \
    }}                                                                      \
    __VA_ARGS__ __attribute__((ifunc(#dispatcher)));

The only downside of this approach is that it’s a glibc-specific feature, thus mainly just Linux (and Android, as I’ll detail later). Apart from low-level glibc code using it, this is also the backbone of GCC’s and Clang’s function multi-versioning. So far I’m not aware of anything similarly convenient on macOS or Windows, except maybe for the cpu_dispatch attribute in Clang and ICC. But that one, as far as I can tell, dispatches on every call, and at least in case of ICC is limited only to Intel processors. No ARM but no AMD either.

Even less overhead, please?

In certain cases it may be desirable to not go through a dynamically dispatched function at all in order to get benefits from interprocedural optimizations and LTO. In that case, the dispatcher can select an overload at compile time using Cpu::DefaultBase, similarly to what the very first example was showing:

const char* memrchr(const char* data, std::size_t size, char c) {
    return memrchrImplementation(Cpu::DefaultBase)(data, size, c);
}

In my experiments at least, with compiler optimizations enabled, the whole returned lambda gets inlined here, removing any remaining argument-passing overhead. Thus being identical to the ideal case where the high-level function would directly contain optimized AVX or SSE code.

Overhead comparison

The following plot compares the three approaches. Apart from the weird outlier with a function pointer in a dynamic library that I can’t explain, it shows that a regular function call is the clear winner in case the code can be compiled directly for the hardware the it will run on. IFUNC ultimately isn’t any faster than regular pointers, but isn’t really slower either, in this microbenchmark at least. I suppose in real-world scenarios it could benefit at least from cache locality with other dynamic functions.

Testing considerations

There’s one hidden benefit with the function pointer approach — being able to change the pointer later at runtime. Together with publicly exposing the memrchrImplementation(Cpu::Features) dispatcher this is very useful for testing, as a test can go through all variants and verify each without having to recompile for a different target. Not only that, it’s even possible for a benchmark to supply alternative implementations to compare against the builtin ones. This is how a test case verifying all three alternatives could look like with Corrade’s TestSuite:

const struct {
    const char* name;
    Cpu::Features features;
} TestData[]{
    {"scalar", Cpu::Scalar},
    {"SSE2", Cpu::Sse2},
    {"AVX2", Cpu::Avx2},
};

void MemrchrTest::test() {
    auto&& data = TestData[testCaseInstanceId()];
    setTestCaseDescription(data.name);

    memrchr = memrchrImplementation(data.features);

    …
}

Compiling different functions for different targets

In order to use intrinsics for a particular CPU instruction set, GCC and Clang require the code to be compiled with a corresponding target option, such as -mavx2 for AVX2. Such requirement makes sense, as it allows the compiler to perform its own optimizations on top of the explicit intrinsics calls. Having a separate file for every variant would be quite impractical though, fortunately one can use __attribute__((target)) to enable instruction sets just on particular functions, allowing all variants to live in the same file.5

This is exposed via macros such as CORRADE_ENABLE_AVX2, and additionally each macro is defined only if compiling for a matching architecture and the compiler supports given instruction set. Which can be conveniently used to guard the code to be only compiled where it makes sense:

#ifdef CORRADE_ENABLE_AVX2
CORRADE_ALWAYS_INLINE auto memrchrImplementation(Cpu::Avx2T) {
    return +[](const char* data, std::size_t size, char c) CORRADE_ENABLE_AVX2 { … };
}
#endif

MSVC, on the other hand, doesn’t require any option in order to use any intrinsics, so there the macros are empty. It however also means that it will only apply the baseline optimizations, so for example extracting all AVX+ code to a file with /arch:AVX enabled might have some perf benefits.5

^ v ^

And then the reality hits

So far, to make the code shine, I was holding back on a certain important aspect of the instruction set landscape. In that it’s definitely not a linear sequence of instruction sets where each is a clear superset of the previous one. Rather, it looks more like this:

Please note there’s many instruction sets still omitted and the hierarchy likely contains severe errors.

With the passing of time many of these branches fortunately went abandoned, however there’s still many cases where any ordering is just impossible. For example, there are AVX CPUs that have F16C but not FMA, there are AVX CPUs with FMA but no F16C, and there’s apparently even a VIA CPU with AVX2 but no FMA. To account for all these cases, I ended up differentiating between a base instruction set that has clear ordering and extra instruction sets that have no relations whatsoever. For AVX-512 I’m not 100% sure yet, but I’m hoping it’ll eventually converge as well.

base sets · expected continuation · extra sets · abandoned sets

To make this work, the dispatch has to select among variants with both the base instruction set and all extra sets supported, and these two priority rules:

If one variant has the base instruction set more advanced than another, it’s preferred (which is same as before). For example, if there’s an AVX-512 variant and an AVX+FMA variant, AVX-512 gets picked.
Otherwise, if both variants have the same base instruction set, a variant that uses more extra instruction sets is preferred. For example, if there’s a plain AVX variant and an AVX+FMA variant, AVX+FMA gets picked.

There’s deliberately no ordering defined among the extra instruction sets. Thus having for example a SSE2+POPCNT and a SSE2+LZCNT variant would lead to an ambiguity if the machine supported both POPCNT and LZCNT. This can only be resolved by the implementation itself, by providing a SSE2+POPCNT+LZCNT variant and delegating to whichever is the better option in that case.

Now comes the pile of nasty templates

Let’s say there’s a Tag<bits> type that stores the bitmask of both the base instruction set and the extra sets. An extra instruction set is always represented by just a single unique bit, while a base instruction set contains one unique bit for itself plus also all bits for the instruction sets it’s based on. Thus, for example, SSE2 would be Tag<0b00000001>, SSE3 Tag<0b00000011>, SSSE3 Tag<0b00000111> and so on, while POPCNT would be Tag<0b01000000> and LZCNT Tag<0b10000000>. Then, combined, SSE3+POPCNT would be Tag<0b01000011>. Finally, the class provides a conversion operator to another Tag<> that’s enabled only if the resulting bit mask is a subset:

template<unsigned bits> struct Tag {
    template<
        unsigned otherBits,
        class = std::enable_if_t<(otherBits & bits) == otherBits>
    > operator Tag<otherBits>() const { … }
};

The priority ordering is defined primarily by the index of the base instruction set, and secondarily by the count of extra instruction sets. Which, given an upper bound on the count of extra instruction sets, can be represented with a single number — for example, SSE2 would have a priority number 100, SSE3 200, SSSE3 300 and so on, SSE2+LZCNT+POPCNT would be 102 and SSE3+POPCNT 201. Here’s where the core idea of inheritance hierarchy gets reused again, just in a calculated manner:

template<unsigned i> struct Priority: Priority<i - 1> {};
template<> struct Priority<0> {};

I claim no invention here — credit goes to @sthalik who suggested the priority tag idea to me.

To make these work together, the dispatch has to use two function arguments — one is where the candidate filtering happens, and the other orders by priority. In the following example, calling with Tag<0b11000011> and Priority<202> (SSE3+LZCNT+POPCNT), would pick the SSE3+POPCNT overload:

void foo(Tag<0b00000000>, Priority<000>) { … } /* scalar */
void foo(Tag<0b00000011>, Priority<200>) { … } /* SSE3 */
void foo(Tag<0b01000011>, Priority<201>) { … } /* SSE3+POPCNT */
void foo(Tag<0b10000111>, Priority<301>) { … } /* SSSE3+LZCNT */

Because I was unable to figure out a way that wouldn’t involve using two parameters, the user-facing API hides this whole process behind two macros — CORRADE_CPU_DECLARE() and CORRADE_CPU_SELECT(). Together with an ability to combine the Cpu tags with bitwise operations, the process of declaration and compile-time dispatch looks like this:

void foo(CORRADE_CPU_DECLARE(Cpu::Scalar)) { … }
void foo(CORRADE_CPU_DECLARE(Cpu::Sse3)) { … }
void foo(CORRADE_CPU_DECLARE(Cpu::Sse3|Cpu::Popcnt)) { … }
void foo(CORRADE_CPU_DECLARE(Cpu::Ssse3|Cpu::Lzcnt)) { … }

…

foo(CORRADE_CPU_SELECT(Cpu::Sse3|Cpu::Lzcnt|Cpu::Popcnt));

Finally, the runtime dispatcher then has to take the extra instruction sets into account as well. Implicitly considering all possible combinations would however lead to a lot of work for the compiler. As a tradeoff, since in practice the variants usually need at most one or two extra sets, the CORRADE_CPU_DISPATCHER() macro requires you to list them explicitly:

CORRADE_CPU_DISPATCHER(foo, Cpu::Popcnt, Cpu::Lzcnt)

This macro produces a foo(Cpu::Features) which can be again used with CORRADE_CPU_DISPATCHED_POINTER() or CORRADE_CPU_DISPATCHED_IFUNC(). For brevity I glossed over the nastier implementation details, if you’re interested please dive into the source.

~ ~ ~

Practical usability verification

To ensure sanity of this design, I proceeded with using the Cpu APIs in a practical scenario — implementing a std::memchr() alternative to use in Containers::StringView::find(). Turns out I was lucky, having not really any prior experience writing intrinsics for any architecture, I managed to pick a sufficiently hard problem where

compiler autovectorization doesn’t do a “good enough” job,
the standard implementation is well optimized and not easy to beat,
the algorithm needs to have specialized handling for small and large inputs in order to be fast,
I get to use also the extra instruction sets such as BMI1,
and the ideal way is significantly different between x86, ARM and WebAssembly.

Apart from having the API usability confirmed (and a ton of ugly compiler bugs discovered), I realized that in order to really make the most of every platform, it doesn’t make sense to try to come up with an instruction-level abstraction API like is for example in SIMD everywhere. So that’s something Magnum will not have. The building blocks need to be much more high level.

Since I’m very new to all this, I won’t embarrass myself further by trying to pretend I know what I’m talking about. Hopefully in a later post — here’s just a sneak peek at the code that’s now in Corrade master:

CPU features and dispatch on ARM platforms

Instruction-set-wise, ARM has the basically ubiquitous NEON. Together with a few recognized extensions, it’s detected as Cpu::Neon, Cpu::NeonFma and Cpu::NeonFp16. Apart from NEON the main instructions sets are SVE and SVE2 and Apple’s own proprietary AMX (which doesn’t even have any publicly available compiler support yet). I’ll wire up detection for these once they become more common.

Compared to the x86 CPUID instruction, runtime detection of ARM features has to rely on OS-specific calls such as getauxval() on Linux or sysctlbyname() on Apple platforms. What’s nice is that both Linux and Android (API 18+) support IFUNC dispatch on ARM as well. It’s complicated by the fact that getauxval() is a call to an external library that’s not available at the time IFUNC pointers get resolved so instead it’s fed to the resolver from outside. Android however adopted such behavior only since API 30 (Android 11). To account for this, the CORRADE_CPU_DISPATCHED_IFUNC() macro is special-cased for ARM and enabled only on glibc and Android 30+. Overhead-wise, IFUNCs on ARM are comparable to x86.

Practical usage of ARM NEON intrinsics highlighted two things. First, it’s a very different instruction set from x86, so naively trying to emulate instructions like movemask is not going to be fast. Instead, using features unique to NEON can yield significant speedups. Second, in my experiments at least, GCC seems to be significantly worse than Clang in dealing with ARM intrinsics. The standard std::memchr() implementation is usually written in plain assembly, and while I was able to get my intrinsics to a comparable speed using Clang, it was plain impossible with GCC. I wonder whether it’s the cause or the consequence of Android and macOS/iOS, the major ARM platforms, nowadays exclusively using Clang. Or maybe I’m just doing something wrong, again this is all new to me.

WebAssembly 128-bit SIMD

WebAssembly currently provides a single 128-bit SIMD instruction set, detected as Cpu::Simd128. The situation is a bit specific, as security on the web will always have a priority over performance. WebAssembly modules are statically verified upfront and if they fail the validation — for example because an unknown instruction was encountered — they’re rejected. Since WebAssembly SIMD is relatively new, this means that a binary can either use any SIMD instructions anywhere, or nowhere at all, to pass validation on VMs that don’t know about SIMD yet.

While that makes any sort of runtime dispatch rather useless, perhaps the more concerning issue is that runtime dispatch is prohibitively expensive — going through an arbitrary function pointer has a fourty times bigger overhead than calling a function directly. The following numbers are from Node.js, but both Chromium and Firefox showed a similar disparity.

Honestly I hope I just forgot to enable some optimization. This is too much.

While there’s currently no way to perform runtime detection of supported CPU features, there’s a Feature Detection proposal aiming to cover this gap. But even then, unless the overhead situation improves, runtime dispatch will only be useful for heavier chunks of code — definitely not for things like memchr().

Practical usage of WebAssembly SIMD128 intrinsics only further emphasised that the difference between x86 and ARM isn’t something to be ignored. I have two variants of my code where using wasm_i8x16_any_true() instead of wasm_i8x16_bitmask() makes code 20% faster on ARM, while at the same time making it almost 30% slower on x86. I could use a hack to distinguish between x86 and ARM and then choose an appropriate variant at runtime, but again, considering the overhead of runtime dispatch this would be far worse than having just a single non-dispatched variant.

For now at least, given there are users with 99% of their audience on mobile platforms, I’m considering providing a compile-time option to prefer ARM-friendly variants. Which would reuse the concept of extra instruction sets described above, for example as Cpu::Simd128|Cpu::WasmArm. And once the overhead situation improves, the variants can directly work with a runtime dispatch.

POWER, RISC-V

I … have no idea, sorry. There’s CORRADE_TARGET_POWERPC but that’s about it. It doesn’t however mean the code will stop compiling on these platforms — since none of the CORRADE_ENABLE_* macros is defined there, the code will fall back to Cpu::Scalar variants, behaving just like before the CPU dispatch was introduced.

Next steps

While the Cpu library can be considered stable and ready for production use, Corrade itself has runtime CPU dispatch enabled only on Linux if IFUNC is detected to be available. Other platforms have it currently off by default, until I’m 100% confident about the overhead. In any case, you can use the CORRADE_BUILD_CPU_RUNTIME_DISPATCH and CORRADE_CPU_USE_IFUNC CMake options to toggle this functionality at build time.

For the actual SIMD-enabled algorithms, of course the usefulness of narrowly beating std::memchr() is dubious. But the code can now be trivially modified to implement the above-mentioned memrchr() as well as making optimized two- and more-character variants. That’s where the actual gains will be, as multi-character lookup is a common bottleneck in JSON (glTF) or OBJ parsing, and standard libraries don’t really have a dedicated tool for that.

Among other features that were planned to use SIMD someday were desperately waiting for the CPU dispatch to get ready are — hierarchical transformation calculations, batch frustum culling and other bulk processing utilities to make use in your ECS workflows. Exciting times ahead!

Single-header implementation

Like with various Containers implementations, this feature is self-contained enough to make sense as a standalone library. To make it possible to use it without relying on the whole of Corrade, it’s provided as a single-header library in the Magnum Singles repository. To use it, simply download the file and #include it your project:

Library	LoC	Preprocessed LoC	Description
CorradeCpu.hpp new	1646	2085	See the Cpu namespace docs

To put the header size into perspective, just #include <immintrin.h> for all AVX intrinsics yields 34 thousand preprocessed lines on GCC. Compared to when I was bitching about <memory> being heavy, this is a whole other level. And honestly I don’t know what could fix this, there’s simply a lot in AVX-512.

. : .

1.: ^ The original prototype appeared in mosra/corrade#115 in March 2021, which was a spin-off of mosra/magnum#306, dating back to January 2019. Yup, certain features take time to brew.
2.: ^ The + before the lambda “decays” it into a regular function pointer, and that’s what the function returns. I used C++14 for brevity, but the Cpu library works in plain C++11 as well, just with a bit more typing instead of the auto return type. I’m happy being the caveman that’s still using C++11.
3.: ^ Except for that one case where ARM big.LITTLE cores each reported a slightly different instruction set, leading to crashes when you queried CPU features on the more advanced core and ended up running advanced instructions on the core that didn’t have them.
4.: ^ The explanation is grossly oversimplified, but dynamic library function call is the key point here. Calls into dynamic libraries are indirect, meaning they’re kinda like a function pointer as well. The difference is, however, that they’re resolved at load time, constant for the application lifetime and all contained in a single place, which has the potential of reducing overhead compared to having to fetch a function pointer value from a random location before every call.
5.: ^ a b Again I’m just scratching the surface here. Enabling AVX2 for the whole file may cause AVX2 to be used also in functions where you use just SSE2 intrinsics and thus crashing on machines without AVX2 support. That is rather obvious, but similar cases could happen also in case LTO combines code from multiple files or when a single template function gets compiled for different instruction sets and then the linker deduplicates it. Computers are hard, apparently, and I hope most of this got solved since.