Dur­ing the past four months, Mag­num began its ad­ven­ture in­to the Py­thon world. Not just with some auto­gen­er­ated bind­ings and not just with some auto­gen­er­ated Sphinx docs — that simply wouldn’t be Mag­num enough. Brace yourselves, this art­icle will show you everything.

The new Mag­num Py­thon bind­ings, while still labeled ex­per­i­ment­al, already give you a pack­age us­able in real work­flows — a NumPy-com­pat­ible con­tain­er lib­rary, graph­ics-ori­ented math classes and func­tions, OpenGL buf­fer, mesh, shader and tex­ture APIs, im­age and mesh data im­port and a SDL / GLFW ap­plic­a­tion class with key and mouse events. Head over to the in­stall­a­tion doc­u­ment­a­tion to get it your­self; if you are on Arch­Linux or use Homebrew, pack­ages are already there, wait­ing for you:

brew tap mosra/magnum
brew install --HEAD corrade magnum magnum-plugins magnum-bindings

And of course it has all good­ies you’d ex­pect from a “Py­thon-nat­ive” lib­rary — full sli­cing sup­port, er­rors re­por­ted through Py­thon ex­cep­tions in­stead of re­turn codes (or hard as­serts) and prop­er­ties in­stead of set­ters/get­ters where it makes sense. To get you a quick over­view of how it looks and how is it used, the first few ex­amples are por­ted to it:

Enter py­bind11

I dis­covered py­bind11 by a lucky ac­ci­dent in early 2018 and im­me­di­ately had to try it. Learn­ing the ba­sics and ex­pos­ing some min­im­al mat­rix/vec­tor math took me about two hours. It was an ex­treme fun and I have to thank all py­bind11 de­velopers for mak­ing it so straight­for­ward to use.

py::class_<Vector3>(m, "Vector3")
    .def_static("x_axis", &Vector3::xAxis, py::arg("length") = 1.0f)
    .def_static("y_axis", &Vector3::yAxis, py::arg("length") = 1.0f)
    .def_static("z_axis", &Vector3::zAxis, py::arg("length") = 1.0f)
    .def(py::init<Float, Float, Float>())
    .def(py::self == py::self)
    .def(py::self != py::self)
    .def("is_zero", &Vector3::isZero)
    .def("is_normalized", &Vector3::isNormalized)
    .def(py::self += py::self)
    .def(py::self + py::self)
    .def(py::self *= Float{})
    .def(py::self * Float{})
    .def(py::self *= py::self)

That’s what it took to bind a vec­tor class.

How­ever, dif­fer­ent things took a pri­or­ity and so the pro­to­type got shelved un­til it got re­vived again this year. But I learned one main thing — even just the math classes alone were some­thing so use­ful that I kept the built Py­thon mod­ule around and used it from time to time as an en­hanced cal­cu­lat­or. Now, with the mag­num.math mod­ule be­ing al­most com­plete, it’s an every­day tool I use for quick cal­cu­la­tions. Feel free to do the same.

>>> from magnum import *
>>> Matrix3.rotation(Deg(45))
Matrix(0.707107, -0.707107, 0,
    0.707107, 0.707107, 0,
    0, 0, 1)

Quick, where are the minus signs in a 2D ro­ta­tion mat­rix?

What Py­thon APIs (and docs) could learn from C++

Every time someone told me they’re us­ing numpy for “do­ing math quickly in Py­thon”, I as­sumed it’s the reas­on­able thing to do — un­til I ac­tu­ally tried to use it. I get that my use case of 4×4 matrices at most might not align well with NumPy’s goals, but the prob­lem is, as far as I know, there’s no full-fea­tured math lib­rary for Py­thon that would give me the whole pack­age1 in­clud­ing Qua­ternions or 2D/3D trans­form­a­tion matrices.

As an ex­cer­cise, for us­ab­il­ity com­par­is­on I tried to ex­press the ro­ta­tion mat­rix shown in the box above in SciPy / NumPy. It took me a good half an hour of star­ing at the docs of scipy.spa­tial.trans­form.Ro­ta­tion un­til I ul­ti­mately de­cided it’s not worth my time. The over­arch­ing prob­lem I have with all those APIs is that it’s not clear at all what types I’m ex­pec­ted to feed to them and provided ex­ample code looks like I’m sup­posed to do half of the cal­cu­la­tions my­self any­way.

>>> from scipy.spatial.transform import Rotation as R
>>> r = R.from_quat([0, 0, np.sin(np.pi/4), np.cos(np.pi/4)])

Ro­ta­tion.from_quat(quat, nor­mal­ized=False)


quat: ar­ray_­like, shape (N, 4) or (4,)
Each row is a (pos­sibly non-unit norm) qua­ternion in scal­ar-last (x, y, z, w) format.


Type in­form­a­tion in the SciPy doc­u­ment­a­tion is vague at best. Also, I’d like some­thing that would make the qua­ternion for me, as well.

To avoid the type con­fu­sion, with Mag­num Py­thon bind­ings I de­cided to use strong types where pos­sible — so in­stead of a single dy­nam­ic mat­rix / vec­tor type akin to numpy.ndar­ray, there’s a clear dis­tinc­tion between matrices and vec­tors of dif­fer­ent sizes. So then if you do Matrix4x3() @ Matrix2x4(), docs of Mat­rix4x3.__mat­mul__() will tell you the res­ult is Mat­rix2x3. For NumPy it­self, there’s a pro­pos­al for im­proved type an­nota­tions at numpy/numpy#7370 which would help a lot, but the doc­u­ment­a­tion tools have to make use of that. More on that be­low.

One little thing with big im­pact of the C++ API is strongly-typed angles. You no longer need to re­mem­ber that trig func­tions use ra­di­ans in­tern­ally but HSV col­ors or Open­AL juggles with de­grees in­stead — simply use whatever you please. So Py­thon got the Deg and Rad as well. Py­thon doesn’t have any user-defined lit­er­als (and I’m not aware of any pro­pos­als to add it), how­ever there’s a way to make Py­thon re­cog­nize them. I’m not yet sure if this amount of ma­gic is wise to ap­ply, but I might try it out once.

^ As /u/Ni­hon­Nukite poin­ted out on Red­dit, there’s Pyrr that provides the above miss­ing func­tion­al­ity, fully in­teg­rated with numpy. The only po­ten­tial down­side is that it’s all pure Py­thon, not op­tim­ized nat­ive code.

Hard things are sud­denly easy if you use a dif­fer­ent lan­guage

>>> a = Vector4(1.5, 0.3, -1.0, 1.0)
>>> b = Vector4(7.2, 2.3, 1.1, 0.0)
>>> a.wxy = b.xwz
>>> a
Vector(0, 1.1, -1, 7.2)

If you ever used GLSL or any oth­er shader lan­guage, you prob­ably fell in love with vec­tor swizzles right at the mo­ment you saw them … and then be­came sad after a real­iz­a­tion that such APIs are prac­tic­ally im­pos­sible2 to have in C++. Swizzle op­er­a­tions are nev­er­the­less use­ful and as­sign­ing each com­pon­ent sep­ar­ately would be a pain, so Mag­num provides Math::gath­er() and Math::scat­ter() that al­low you to ex­press the above:

a = Math::scatter<'w', 'x', 'y'>(a, Math::gather<'x', 'w', 'z'>(b));

Verb­ose3 but prac­tic­ally pos­sible. Point is, how­ever, that the above is im­ple­ment­able very eas­ily in Py­thon us­ing __getattr__() and __setattr__() … and a ton of er­ror check­ing on top.

^ GLM does have those, if you en­able GLM_FORCE_SWIZZLE, but do­ing so adds three seconds4 to com­pil­a­tion time of each file that in­cludes GLM head­ers. I’d say that makes swizzles pos­sible in the­ory but such over­head makes them prac­tic­ally use­less.
^ Math func­tions are func­tions and so do not mutate their ar­gu­ments, that’s why the fi­nal self-as­sign­ment. It would of course be bet­ter to be able to write Math::gather<"wxy">(b) or at least Math::gather<'wxy'>(b) but C++ in­sists on the first be­ing im­possible and the second be­ing un­port­able. And cre­at­ing a user-defined lit­er­al just to spe­cify a swizzle seems ex­cess­ive.
^ I did a couple of bench­marks for a yet-to-be-pub­lished art­icle com­par­ing math lib­rary im­ple­ment­a­tions, and this was a shock­er. The only oth­er lib­rary that could come close was Boost.Geo­metry, with two seconds per file.

… but on the con­trary, C++ has it easi­er with over­loads

I was very de­lighted upon dis­cov­er­ing that py­bind11 sup­ports func­tion over­loads just like that — if you bind more than one func­tion of the same name, it’ll take a type­less (*args, **kwargs) and dis­patch to a cor­rect over­load based on ar­gu­ment types. It’s prob­ably not blaz­ingly fast (and in some cases you could prob­ably beat its speed by do­ing the dis­patch you­self), but it’s there and much bet­ter than hav­ing to in­vent new names for over­loaded func­tions (and con­struct­ors!). With the new typ­ing mod­ule, it’s pos­sible to achieve a sim­il­ar thing in pure Py­thon us­ing the @over­load dec­or­at­or — though only for doc­u­ment­a­tion pur­poses, you’re still re­spons­ible to im­ple­ment the type dis­patch your­self. In case of math.dot() im­ple­men­ted in pure Py­thon, this could look like this:

def dot(a: Quaternion, b: Quaternion) -> float:
def dot(a: Vector2, b: Vector2) -> float:
def dot(a, b):
    # actual implementation

What was ac­tu­ally hard though, was the fol­low­ing, look­ing com­pletely or­din­ary to a C++ pro­gram­mer:

>>> a = Matrix3.translation((4.0, 2.0))
>>> a
Matrix(1, 0, 4,
       0, 1, 2,
       0, 0, 1)
>>> a.translation = Vector2(5.0, 3.0)
>>> a
Matrix(1, 0, 5,
       0, 1, 3,
       0, 0, 1)

Is the Py­thon lan­guage po­lice go­ing to ar­rest me now?

While the case of Mat­rix3.scal­ing() vs. mat.scaling() — where the former re­turns a scal­ing Mat­rix3 and lat­ter a scal­ing Vec­tor3 out of a scal­ing mat­rix — was easi­er and could be done just via a dis­patch based on ar­gu­ment types (“if the first ar­gu­ment is an in­stance of Mat­rix3, be­have like the mem­ber func­tion”), in case of Mat­rix3.trans­la­tion() it’s either a stat­ic meth­od or an in­stance prop­erty. Ul­ti­mately I man­aged to solve it by sup­ply­ing a cus­tom meta­class that does a cor­rect dis­patch when en­coun­ter­ing ac­cess to the translation at­trib­ute.

But yeah, while al­most any­thing is pos­sible in Py­thon, it could give a hand here — am I the first per­son ever that needs this func­tion­al­ity?

Zero-copy data trans­fer

One very im­port­ant part of Py­thon is the Buf­fer Pro­tocol. It al­lows zero-copy shar­ing of ar­bit­rat­ily shaped data between C and Py­thon — simple tightly-packed lin­ear ar­rays, 2D matrices, or a green chan­nel of a lower right quarter of an im­age flipped up­side down. Hav­ing a full sup­port for the buf­fer pro­tocol was among the reas­ony why Con­tain­ers::StridedAr­rayView went through a ma­jor re­design earli­er this year. This strided ar­ray view is now ex­posed to Py­thon as a con­tain­ers.StridedAr­rayView1D (or Mut­ableStridedAr­rayView1D, and their 2D, 3D and 4D vari­ants) and thanks to the buf­fer pro­tocol it can be seam­lessly con­ver­ted from and to numpy.ar­ray() (and Py­thon’s own memoryview as well). Trans­it­ively that means you can un­leash numpy-based Py­thon al­gorithms dir­ectly on data com­ing out of Im­ageView2D.pixels() and have the modi­fic­a­tions im­me­di­ately re­flec­ted back in C++.

Be­cause, again, hav­ing a spe­cial­ized type with fur­ther re­stric­tions makes the code easi­er to reas­on about, con­tain­ers.Ar­rayView (and its mut­able vari­ant) is ex­posed as well. This one works only with lin­ear tightly packed memory and thus is suit­able for tak­ing views onto bytes or byte­ar­ray, file con­tents and such. Both the strided and lin­ear ar­ray views of course sup­port the full Py­thon sli­cing API. As an ex­ample, here’s how you can read an im­age in Py­thon, pass its con­tents to a Mag­num im­port­er and get the raw pixel data back:

from magnum import trade

def consume_pixels(pixels: np.ndarray):

importer: trade.AbstractImporter =
with open(filename, 'rb') as f:
image: trade.ImageData2D = importer.image2d(0)
# green channel of a lower right quarter of a 256x256 image flipped upside down

Just one ques­tion left — who owns the memory here, then? To an­swer that, let’s dive in­to Py­thon’s ref­er­ence count­ing.

Ref­er­en­ce count­ing

In C++, views are one of the more dan­ger­ous con­tain­ers, as they ref­er­ence data owned by some­thing else. There you’re ex­pec­ted to en­sure the data own­er is kept in scope for at least as long as the view on it. A sim­il­ar thing is with oth­er types — for ex­ample, a GL::Mesh may ref­er­ence a bunch of GL::Buf­fers, or a Trade::Ab­strac­tIm­port­er loaded from a plu­gin needs its plu­gin man­ager to be alive to keep the plu­gin lib­rary loaded.

importer importer manager manager importer->manager f f importer->f image image image->f pixels pixels pixels->image
Ref­er­ence hier­archy

The dim dashed lines show ad­di­tion­al po­ten­tial de­pend­en­cies that would hap­pen with fu­ture zero-copy plu­gin im­ple­ment­a­tions — when the file format al­lows it, these would ref­er­ence dir­ectly the data in f in­stead of stor­ing a copy them­selves.

How­ever, im­pos­ing sim­il­ar con­straints on Py­thon users would be dar­ing too much, so all ex­posed Mag­num types that refer to ex­tern­al data im­ple­ment ref­er­ence count­ing un­der the hood. The des­ig­nated way of do­ing this with py­bind11 is wrap­ping all your everything with std::shared_ptr. On the oth­er hand, Mag­num is free of any shared point­ers by design, and adding them back just to make Py­thon happy would make every­one else angry in ex­change. What Mag­num does in­stead is ex­tend­ing the so-called hold­er type in py­bind11 (which doesn’t have to be std::shared_ptr; std::unique_ptr or a cus­tom point­er types is fine as well) and stor­ing ref­er­ences to in­stance de­pend­en­cies in­side it.

The straight­for­ward way of do­ing this would be to take GL::Mesh, sub­class it in­to a PyMesh, store buf­fer ref­er­ences in­side it and then ex­pose PyMesh as gl.Mesh in­stead. But com­pared to the hold­er type ap­proach this has a ser­i­ous dis­ad­vant­age where every API that works with meshes would sud­denly need to work with PyMesh in­stead and that’s not al­ways pos­sible.

For test­ing and de­bug­ging pur­poses, ref­er­ences to memory own­ers or oth­er data are al­ways ex­posed through the API — see for ex­ample Im­ageView2D.own­er or gl.Mesh.buf­fers.

Zero-waste data sli­cing

One thing I got used to, es­pe­cially when writ­ing pars­ers, is to con­tinu­ally slice the in­put data view as the al­gorithm con­sumes its pre­fix. Con­sider the fol­low­ing Py­thon code, vaguely re­sem­bling an OBJ pars­er:

view = containers.ArrayView(data)
while view:
    # Comment, ignore until EOL
    if view[0] == '#': while view and view[0] != '\n': view = view[1:]
    # Vertex / face
    elif view[0] == 'v': view = self.parse_vertex(view)
    elif view[0] == 'f': view = self.parse_face(view)

On every op­er­a­tion, the view gets some pre­fix chopped off. While not a prob­lem in C++, this would gen­er­ate an im­press­ively long ref­er­ence chain in Py­thon, pre­serving all in­ter­me­di­ate views from all loop it­er­a­tions.

slice4 slice4 slice3 slice3 slice4->slice3 slice2 slice2 slice3->slice2 slice1 slice1 slice2->slice1 view view slice1->view data data view->data sliceN sliceN sliceN->slice4

While the views are gen­er­ally smal­ler than the data they refer to, with big files it could eas­ily hap­pen that the over­head of views be­comes lar­ger than the parsed file it­self. To avoid such end­less growth, sli­cing op­er­a­tions on views al­ways refer the ori­gin­al data own­er, al­low­ing the in­ter­me­di­ate views to be col­lec­ted. In oth­er words, for a con­tain­ers.Ar­rayView.own­er, view[:].owner is view.owner al­ways holds.

view view data data view->data sliceN sliceN sliceN->data slice4 slice4 slice4->data slice3 slice3 slice3->data slice2 slice2 slice2->data slice1 slice1 slice1->data

The less-than-great as­pects of py­bind11

Throw­ing C++ ex­cep­tions is ac­tu­ally really slow

While I was aware there’s some over­head in­volved with C++’s throw, I nev­er guessed the over­head would be so big. In most cases, this would not be a prob­lem as ex­cep­tions are ex­cep­tion­al but there’s one little corner of Py­thon where you have to use them — it­er­a­tion. In or­der to it­er­ate any­thing, Py­thon calls __getitem__() with an in­creas­ing in­dex, and in­stead of check­ing against __len__(), simply wait­ing un­til it raises In­dex­Er­ror. This is also how con­ver­sion from/to lists is used and also how numpy.ar­ray() pop­u­lates the ar­ray from a list-like type, un­less the type sup­ports Buf­fer Pro­tocol. Bind­ings for Mag­num’s Vec­tor3.__­getitem__() ori­gin­ally looked like this:

.def("__getitem__", [](const T& self, std::size_t i) -> typename T::Type {
    if(i >= T::Size) throw pybind11::index_error{};
    return self[i];

Plain and simple and seem­ingly not a perf prob­lem at all … un­til you start meas­ur­ing:

0.1356 ± 0.0049 µs 3.4824 ± 0.1484 µs 2.607 ± 0.1367 µs 0.4181 ± 0.0363 µs 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 µs pure Python pybind11 pybind11 pybind11 / CPython raise IndexError() throw pybind11::index_error{} throw pybind11::error_already_set{} PyErr_SetString() Cost of raising an exception

This is fur­ther blown out of pro­por­tion in case of numpy.ar­ray() — look­ing at the sources of PyAr­ray_­Fro­mAny(), it’s ap­par­ently hit­ting the out-of-bounds con­di­tion three times — first when check­ing for di­men­sions, second when cal­cu­lat­ing a com­mon type for all ele­ments and third when do­ing an ac­tu­al copy of the data. This was most prob­ably not worth op­tim­iz­ing as­sum­ing sane ex­cep­tion per­form­ance, how­ever com­bined with py­bind11, it leads to a massive slow­down:

0.5756 ± 0.0313 µs 17.2296 ± 1.1786 µs 14.2204 ± 0.3782 µs 6.3909 ± 0.1217 µs 0.6411 ± 0.0368 µs 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 µs from a list from Vector3 from Vector3 from Vector3 from Vector3 throw pybind11::index_error{} throw pybind11::error_already_set{} PyErr_SetString() buffer protocol Constructing numpy.ndarray

As hin­ted by the plots above, there’s a few pos­sible ways of coun­ter­ing the in­ef­fi­ciency:

  1. A lot of over­head in py­bind11 is re­lated to ex­cep­tion trans­la­tion which can be sidestepped by call­ing Py­Er­r_­Set­String() and telling py­bind11 an er­ror is already set and it only needs to propag­ate it:

    .def("__getitem__", [](const T& self, std::size_t i) -> typename T::Type {
        if(i >= T::Size) {
            PyErr_SetString(PyExc_IndexError, "");
            throw pybind11::error_already_set{};
        return self[i];

    As seen above, this res­ults in a mod­er­ate im­prove­ment with ex­cep­tions tak­ing ~1 µs less to throw (though for numpy.ar­ray() it doesn’t help much). This is what Mag­num Bind­ings glob­ally switched to after dis­cov­er­ing the perf dif­fer­ence, and apart from Py­Er­r_­Set­String(), there’s also Py­Er­r_­Format() able to string­i­fy Py­thon ob­jects dir­ectly us­ing "%A" — hard to beat that with any third-party solu­tion.

  2. Even with the above, the throw and the whole ex­cep­tion bub­bling in­side py­bind is still re­spons­ible for quite a lot, so the next step is to only call Py­Er­r_­Set­String() and re­turn noth­ing to py­bind to in­dic­ate we want to raise an ex­cep­tion in­stead:

    .def("__getitem__", [](const T& self, std::size_t i) -> pybind11::object {
        if(i >= T::Size) {
            PyErr_SetString(PyExc_IndexError, "");
            return pybind11::object{};
        return pybind11::cast(self[i]);

    This res­ults in quite a sig­ni­fic­ant im­prove­ment, re­du­cing the ex­cep­tion over­head from about 3.5 µs to 0.4 µs. It how­ever re­lies on a patch that’s not merged yet (see py­bind/py­bind11#1853) and it re­quires the bound API to re­turn a type­less ob­ject in­stead of a con­crete value as there’s no oth­er way to ex­press a “null” value oth­er­wise.

  3. If there’s a pos­sib­il­ity to use the Buf­fer Pro­tocol, pre­fer­ring it over __getitem__(). At first I was skep­tic­al about this idea be­cause the buf­fer pro­tocol setup with point­ers and shapes and formats and sizes and all re­lated er­ror check­ing cer­tainly feels heav­ier than simply it­er­at­ing a three-ele­ment ar­ray. But the re­l­at­ive heav­i­ness of ex­cep­tions makes it a win­ner. Py­bind has a built­in sup­port, so why not use it. Well, ex­cept …

Let’s al­loc­ate a bunch of vec­tors and strings to do a zero-copy data trans­fer

Py­thon’s Buf­fer Pro­tocol, men­tioned above, is a really nice ap­proach for data trans­fers with min­im­al over­head — if used cor­rectly. Let’s look again at the case of call­ing numpy.ar­ray() above:

0.5756 ± 0.0313 µs 0.4766 ± 0.0263 µs 0.6411 ± 0.0368 µs 0.5552 ± 0.0294 µs 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 µs from a list from array.array from Vector3 from Vector3 pybind11::buffer Py_buffer Creating numpy.array() from a list-like type

It’s clear that con­vert­ing a pure Py­thon list to a numpy.ar­ray() is, even with all the ex­cep­tions in­volved, still faster than us­ing py­bind’s buf­fer pro­tocol im­ple­ment­a­tion to con­vert a Vec­tor3 to it. In com­par­is­on, ar­ray.ar­ray() (which im­ple­ments Buf­fer Pro­tocol as well, only nat­ively in plain C) is quite speedy, so there’s def­in­itely some­thing fishy in py­bind11.

struct buffer_info {
    void *ptr;
    ssize_t itemsize;
    std::string format;
    ssize_t ndim;
    std::vector<ssize_t> shape;
    std::vector<ssize_t> strides;

Oh, so that’s why.

The std::vec­tor al­loc­a­tions (and std::string pos­sibly as well, if the format spe­cifier is too long for small string op­tim­iz­a­tion) in pybind11::buffer_info add up to the over­head, so I de­cided to sidestep py­bind11 al­to­geth­er and in­ter­face dir­ectly with the un­der­ly­ing Py­thon C API in­stead. Be­cause the Py_buf­fer struc­ture is quite flex­ible, I ended up point­ing its mem­bers to stat­ic­ally defined data for each mat­rix / vec­tor type, mak­ing the buf­fer pro­tocol op­er­a­tion com­pletely al­loc­a­tion-less. In case of con­tain­ers.Ar­rayView and its strided equi­val­ent the struc­ture points to their in­tern­al mem­bers, so noth­ing needs to be al­loc­ated even in case of con­tain­ers.StridedAr­rayView4D. Ad­di­tion­ally, op­er­at­ing dir­ectly with the C API al­lowed me to cor­rectly propag­ate readonly prop­er­ties and the above-men­tioned data own­er­ship as well.

Be­come a 10× pro­gram­mer with this one weird trick

Com­pile times with py­bind are some­thing I can’t get used to at all. Maybe this is noth­ing ex­traordin­ary when you do a lot of Mod­ern C++, but an in­cre­ment­al build of a single file tak­ing 20 seconds is a bit too much for my taste. In com­par­is­on, I can re­com­pile the full Mag­num (without tests) in 15 seconds. This gets a lot worse when build­ing Re­lease, due to -flto be­ing passed to the com­piler — then an in­cre­ment­al build of that same file takes 90 seconds5, large part of the time spent in Link-Time Op­tim­iz­a­tion.

For­tu­nately, by an­oth­er lucky ac­ci­dent, I re­cently dis­covered that GCC’s -flto flag has a par­al­lel op­tion6 — so if you have 8 cores, -flto=8 will make the LTO step run eight times faster, turn­ing the above 90 seconds in­to slightly-less-hor­rif­ic 35 seconds. Ima­gine that. This has how­ever a dan­ger­ous con­sequence — the build­sys­tem is not aware of the LTO par­al­lel­ism, so it’s in­ev­it­able that it will hap­pily sched­ule 8 par­al­lel­ized link jobs at once, bring­ing your ma­chine to a grind­ing halt un­less you have 32 GB RAM and most of those free. If you use Ninja, it has job pools where you can tell it to not fire up more than one such link job at once, but as far as my un­der­stand­ing of this fea­ture goes, this will not af­fect the schedul­ing of com­pile and link in par­al­lel.

Once Clang 9 is out (and once I get some free time), I want to un­leash the new -ftime-trace op­tion on the py­bind code, to see if there’s any low-hanging fruit. But un­for­tu­nately in the long term I’m afraid I’ll need to re­place even more parts of py­bind to bring com­pile times back to sane bounds.

^ To give a per­spect­ive, the cov­er im­age of this art­icle (on the top) is gen­er­ated from pre­pro­cessed out­put of the file that takes 90 seconds to build. About 1%, few faded lines in the front, is the ac­tu­al bind­ings code. The rest — as far as your eyes can see — is STL and py­bind11 head­ers.
^ It’s cur­rently opt-in, but GCC 10 is sched­uled to have it en­abled by de­fault. If you are on Clang, it has Thin­LTO, how­ever I was not able to con­vince it to run par­al­lel for me.

Every­one “just uses Sphinx”. You?

The ob­vi­ous first choice when it comes to doc­u­ment­ing Py­thon code is to use Sphinx — everything in­clud­ing the stand­ard lib­rary uses it and I don’t even re­mem­ber see­ing a single Py­thon lib­rary that doesn’t. How­ever, if you clicked on any of the above doc links, you prob­ably real­ized that … no, Mag­num is not us­ing it.

Ever since the doc­u­ment­a­tion search got in­tro­duced early last year, many de­velopers quickly be­came ad­dicted used to it. Whip­ping up some Sphinx docs, where both search per­form­ance and res­ult rel­ev­ance is ex­tremely un­der­whelm­ing, would be ef­fect­ively un­do­ing all us­ab­il­ity pro­gress Mag­num made un­til now, so the only op­tion was to bring the search to Py­thon as well.

mag­num.Mat­rix4 doc­u­ment­a­tion
Type an­nota­tions are cent­ral to the Py­thon doc gen­er­at­or.

While at it, I made the doc gen­er­at­or aware of all kinds of type an­nota­tions, prop­erly cross­link­ing everything to cor­res­pond­ing type defin­i­tions. And not just loc­al types — sim­il­arly to Doxy­gen’s tag­files, Sphinx has In­ter­sphinx, so link­ing to 3rd party lib­rary docs (such as NumPy) or even the stand­ard lib­rary is pos­sible as well. Con­versely, the m.css Py­thon doc gen­er­at­or ex­ports an In­ter­sphinx in­vent­ory file, so no mat­ter wheth­er you use vanilla Sphinx or m.css, you can link to m.css-gen­er­ated Py­thon docs from your own doc­u­ment­a­tion as well.

If you want to try it out on your pro­ject, head over to the m.css web­site for a brief in­tro­duc­tion. Com­pared to Sphinx or Doxy­gen it be­haves more like Doxy­gen, as it im­pli­citly browses your mod­ule hier­archy and gen­er­ates a ded­ic­ated page for each class and mod­ule. The way Sphinx does it was in­ter­est­ing for me at first, but over the time I real­ized it needs quite a lot of ef­fort from de­veloper side to or­gan­ize well — and from the doc­u­ment­a­tion read­er side, it can lead to things be­ing harder to find than they should be (for ex­ample, docs for str.splitlines() are bur­ied some­where in the middle of a kilo­met­er-long page doc­u­ment­ing all built­in types).

The doc gen­er­at­or re­sembles Sphinx, but I de­cided to ex­per­i­ment with a clean slate first in­stead of mak­ing it 100% com­pat­ible — some design de­cisions in Sphinx it­self are his­tor­ic­al (such as type an­nota­tions in the doc block in­stead of in the code it­self) and it didn’t make sense for me to port those over. Well, at least for now, a full Sphinx com­pat­ib­il­ity is not com­pletely out of ques­tion.

What’s next?

The work is far from be­ing done — apart from ex­pos­ing APIs that are not ex­posed yet (which is just a routine work, mostly), I’m still not quite sat­is­fied with the per­form­a­ce of bound types, so on my roadmap is try­ing to ex­pose a ba­sic type us­ing pure C Py­thon APIs the most ef­fi­cient way pos­sible and then com­par­ing how long does it take to in­stan­ti­ate that type and call meth­ods on it. One of the things to try is a vec­tor­call call pro­tocol that’s new in Py­thon 3.8 (PEP590) and the re­search couldn’t be com­plete without also try­ing a sim­il­ar thing in Mi­croPy­thon.

~ ~ ~