Dur­ing the past four months, Mag­num be­gan its ad­ven­ture in­to the Python world. Not just with some au­to­gen­er­at­ed bind­ings and not just with some au­to­gen­er­at­ed Sphinx docs — that sim­ply wouldn’t be Mag­num enough. Brace your­selves, this ar­ti­cle will show you ev­ery­thing.

The new Mag­num Python bind­ings, while still la­beled ex­per­i­men­tal, al­ready give you a pack­age us­able in re­al work­flows — a NumPy-com­pat­i­ble con­tain­er li­brary, graph­ics-ori­ent­ed math class­es and func­tions, OpenGL buf­fer, mesh, shad­er and tex­ture APIs, im­age and mesh da­ta im­port and a SDL / GLFW ap­pli­ca­tion class with key and mouse events. Head over to the in­stal­la­tion doc­u­men­ta­tion to get it your­self; if you are on Arch­Lin­ux or use Home­brew, pack­ages are al­ready there, wait­ing for you:

brew tap mosra/magnum
brew install --HEAD corrade magnum magnum-plugins magnum-bindings

And of course it has all good­ies you’d ex­pect from a “Python-na­tive” li­brary — full slic­ing sup­port, er­rors re­port­ed through Python ex­cep­tions in­stead of re­turn codes (or hard as­serts) and prop­er­ties in­stead of set­ters/get­ters where it makes sense. To get you a quick over­view of how it looks and how is it used, the first few ex­am­ples are port­ed to it:

En­ter py­bind11

I dis­cov­ered py­bind11 by a lucky ac­ci­dent in ear­ly 2018 and im­me­di­ate­ly had to try it. Learn­ing the ba­sics and ex­pos­ing some min­i­mal ma­trix/vec­tor math took me about two hours. It was an ex­treme fun and I have to thank all py­bind11 de­vel­op­ers for mak­ing it so straight­for­ward to use.

py::class_<Vector3>(m, "Vector3")
    .def_static("x_axis", &Vector3::xAxis, py::arg("length") = 1.0f)
    .def_static("y_axis", &Vector3::yAxis, py::arg("length") = 1.0f)
    .def_static("z_axis", &Vector3::zAxis, py::arg("length") = 1.0f)
    .def(py::init<Float, Float, Float>())
    .def(py::init<Float>())
    .def(py::self == py::self)
    .def(py::self != py::self)
    .def("is_zero", &Vector3::isZero)
    .def("is_normalized", &Vector3::isNormalized)
    .def(py::self += py::self)
    .def(py::self + py::self)
    .def(py::self *= Float{})
    .def(py::self * Float{})
    .def(py::self *= py::self)

That’s what it took to bind a vec­tor class.

How­ev­er, dif­fer­ent things took a pri­or­i­ty and so the pro­to­type got shelved un­til it got re­vived again this year. But I learned one main thing — even just the math class­es alone were some­thing so use­ful that I kept the built Python mod­ule around and used it from time to time as an en­hanced cal­cu­la­tor. Now, with the mag­num.math mod­ule be­ing al­most com­plete, it’s an ev­ery­day tool I use for quick cal­cu­la­tions. Feel free to do the same.

>>> from magnum import *
>>> Matrix3.rotation(Deg(45))
Matrix(0.707107, -0.707107, 0,
    0.707107, 0.707107, 0,
    0, 0, 1)

Quick, where are the mi­nus signs in a 2D ro­ta­tion ma­trix?

What Python APIs (and docs) could learn from C++

Ev­ery time some­one told me they’re us­ing numpy for “do­ing math quick­ly in Python”, I as­sumed it’s the rea­son­able thing to do — un­til I ac­tu­al­ly tried to use it. I get that my use case of 4×4 ma­tri­ces at most might not align well with NumPy’s goals, but the prob­lem is, as far as I know, there’s no full-fea­tured math li­brary for Python that would give me the whole pack­age1 in­clud­ing Quater­nions or 2D/3D trans­for­ma­tion ma­tri­ces.

As an ex­cer­cise, for us­abil­i­ty com­par­i­son I tried to ex­press the ro­ta­tion ma­trix shown in the box above in SciPy / NumPy. It took me a good half an hour of star­ing at the docs of scipy.spa­tial.trans­form.Ro­ta­tion un­til I ul­ti­mate­ly de­cid­ed it’s not worth my time. The over­ar­ch­ing prob­lem I have with all those APIs is that it’s not clear at all what types I’m ex­pect­ed to feed to them and pro­vid­ed ex­am­ple code looks like I’m sup­posed to do half of the cal­cu­la­tions my­self any­way.

>>> from scipy.spatial.transform import Rotation as R
>>> r = R.from_quat([0, 0, np.sin(np.pi/4), np.cos(np.pi/4)])

Ro­ta­tion.from_quat(quat, nor­mal­ized=False)

Pa­ram­e­ters:

quat: ar­ray_­like, shape (N, 4) or (4,)
Each row is a (pos­si­bly non-unit norm) quater­nion in scalar-last (x, y, z, w) for­mat.

scipy.spa­tial.trans­form.Ro­ta­tion.from_quat()

Type in­for­ma­tion in the SciPy doc­u­men­ta­tion is vague at best. Al­so, I’d like some­thing that would make the quater­nion for me, as well.

To avoid the type con­fu­sion, with Mag­num Python bind­ings I de­cid­ed to use strong types where pos­si­ble — so in­stead of a sin­gle dy­nam­ic ma­trix / vec­tor type akin to numpy.ndar­ray, there’s a clear dis­tinc­tion be­tween ma­tri­ces and vec­tors of dif­fer­ent sizes. So then if you do Matrix4x3() @ Matrix2x4(), docs of Ma­trix4x3.__­mat­mul__() will tell you the re­sult is Ma­trix2x3. For NumPy it­self, there’s a pro­pos­al for im­proved type an­no­ta­tions at numpy/numpy#7370 which would help a lot, but the doc­u­men­ta­tion tools have to make use of that. More on that be­low.

One lit­tle thing with big im­pact of the C++ API is strong­ly-typed an­gles. You no longer need to re­mem­ber that trig func­tions use ra­di­ans in­ter­nal­ly but HSV col­ors or Ope­nAL jug­gles with de­grees in­stead — sim­ply use what­ev­er you please. So Python got the Deg and Rad as well. Python doesn’t have any us­er-de­fined lit­er­als (and I’m not aware of any pro­pos­als to add it), how­ev­er there’s a way to make Python rec­og­nize them. I’m not yet sure if this amount of mag­ic is wise to ap­ply, but I might try it out once.

1.
^ As /u/Ni­hon­Nukite point­ed out on Red­dit, there’s Pyrr that pro­vides the above miss­ing func­tion­al­i­ty, ful­ly in­te­grat­ed with numpy. The on­ly po­ten­tial down­side is that it’s all pure Python, not op­ti­mized na­tive code.

Hard things are sud­den­ly easy if you use a dif­fer­ent lan­guage

>>> a = Vector4(1.5, 0.3, -1.0, 1.0)
>>> b = Vector4(7.2, 2.3, 1.1, 0.0)
>>> a.wxy = b.xwz
>>> a
Vector(0, 1.1, -1, 7.2)

If you ev­er used GLSL or any oth­er shad­er lan­guage, you prob­a­bly fell in love with vec­tor swiz­zles right at the mo­ment you saw them … and then be­came sad af­ter a re­al­iza­tion that such APIs are prac­ti­cal­ly im­pos­si­ble2 to have in C++. Swiz­zle op­er­a­tions are nev­er­the­less use­ful and as­sign­ing each com­po­nent sep­a­rate­ly would be a pain, so Mag­num pro­vides Math::gath­er() and Math::scat­ter() that al­low you to ex­press the above:

a = Math::scatter<'w', 'x', 'y'>(a, Math::gather<'x', 'w', 'z'>(b));

Ver­bose3 but prac­ti­cal­ly pos­si­ble. Point is, how­ev­er, that the above is im­ple­mentable very eas­i­ly in Python us­ing __getattr__() and __setattr__() … and a ton of er­ror check­ing on top.

2.
^ GLM does have those, if you en­able GLM_FORCE_SWIZZLE, but do­ing so adds three sec­onds4 to com­pi­la­tion time of each file that in­cludes GLM head­ers. I’d say that makes swiz­zles pos­si­ble in the­o­ry but such over­head makes them prac­ti­cal­ly use­less.
3.
^ Math func­tions are func­tions and so do not mu­tate their ar­gu­ments, that’s why the fi­nal self-as­sign­ment. It would of course be bet­ter to be able to write Math::gather<"wxy">(b) or at least Math::gather<'wxy'>(b) but C++ in­sists on the first be­ing im­pos­si­ble and the sec­ond be­ing un­portable. And cre­at­ing a us­er-de­fined lit­er­al just to spec­i­fy a swiz­zle seems ex­ces­sive.
4.
^ I did a cou­ple of bench­marks for a yet-to-be-pub­lished ar­ti­cle com­par­ing math li­brary im­ple­men­ta­tions, and this was a shock­er. The on­ly oth­er li­brary that could come close was Boost.Ge­om­e­try, with two sec­onds per file.

… but on the con­trary, C++ has it eas­i­er with over­loads

I was very de­light­ed up­on dis­cov­er­ing that py­bind11 sup­ports func­tion over­loads just like that — if you bind more than one func­tion of the same name, it’ll take a type­less (*args, **kwargs) and dis­patch to a cor­rect over­load based on ar­gu­ment types. It’s prob­a­bly not blaz­ing­ly fast (and in some cas­es you could prob­a­bly beat its speed by do­ing the dis­patch you­self), but it’s there and much bet­ter than hav­ing to in­vent new names for over­load­ed func­tions (and con­struc­tors!). With the new typ­ing mod­ule, it’s pos­si­ble to achieve a sim­i­lar thing in pure Python us­ing the @over­load dec­o­ra­tor — though on­ly for doc­u­men­ta­tion pur­pos­es, you’re still re­spon­si­ble to im­ple­ment the type dis­patch your­self. In case of math.dot() im­ple­ment­ed in pure Python, this could look like this:

@overload
def dot(a: Quaternion, b: Quaternion) -> float:
    ...
@overload
def dot(a: Vector2, b: Vector2) -> float:
    ...
def dot(a, b):
    # actual implementation

What was ac­tu­al­ly hard though, was the fol­low­ing, look­ing com­plete­ly or­di­nary to a C++ pro­gram­mer:

>>> a = Matrix3.translation((4.0, 2.0))
>>> a
Matrix(1, 0, 4,
       0, 1, 2,
       0, 0, 1)
>>> a.translation = Vector2(5.0, 3.0)
>>> a
Matrix(1, 0, 5,
       0, 1, 3,
       0, 0, 1)

Is the Python lan­guage po­lice go­ing to ar­rest me now?

While the case of Ma­trix3.scal­ing() vs. mat.scaling() — where the for­mer re­turns a scal­ing Ma­trix3 and lat­ter a scal­ing Vec­tor3 out of a scal­ing ma­trix — was eas­i­er and could be done just via a dis­patch based on ar­gu­ment types (“if the first ar­gu­ment is an in­stance of Ma­trix3, be­have like the mem­ber func­tion”), in case of Ma­trix3.trans­la­tion() it’s ei­ther a stat­ic method or an in­stance prop­er­ty. Ul­ti­mate­ly I man­aged to solve it by sup­ply­ing a cus­tom meta­class that does a cor­rect dis­patch when en­coun­ter­ing ac­cess to the translation at­tribute.

But yeah, while al­most any­thing is pos­si­ble in Python, it could give a hand here — am I the first per­son ev­er that needs this func­tion­al­i­ty?

Ze­ro-copy da­ta trans­fer

One very im­por­tant part of Python is the Buf­fer Pro­to­col. It al­lows ze­ro-copy shar­ing of ar­bi­trati­ly shaped da­ta be­tween C and Python — sim­ple tight­ly-packed lin­ear ar­rays, 2D ma­tri­ces, or a green chan­nel of a low­er right quar­ter of an im­age flipped up­side down. Hav­ing a full sup­port for the buf­fer pro­to­col was among the rea­sony why Con­tain­ers::StridedAr­rayView went through a ma­jor re­de­sign ear­li­er this year. This strid­ed ar­ray view is now ex­posed to Python as a con­tain­ers.StridedAr­rayView1D (or Mu­ta­bleStridedAr­rayView1D, and their 2D, 3D and 4D vari­ants) and thanks to the buf­fer pro­to­col it can be seam­less­ly con­vert­ed from and to numpy.ar­ray() (and Python’s own mem­o­ryview as well). Tran­si­tive­ly that means you can un­leash numpy-based Python al­go­rithms di­rect­ly on da­ta com­ing out of Im­ageView2D.pix­els() and have the mod­i­fi­ca­tions im­me­di­ate­ly re­flect­ed back in C++.

Be­cause, again, hav­ing a spe­cial­ized type with fur­ther re­stric­tions makes the code eas­i­er to rea­son about, con­tain­ers.Ar­rayView (and its mu­ta­ble vari­ant) is ex­posed as well. This one works on­ly with lin­ear tight­ly packed mem­o­ry and thus is suit­able for tak­ing views on­to bytes or bytear­ray, file con­tents and such. Both the strid­ed and lin­ear ar­ray views of course sup­port the full Python slic­ing API. As an ex­am­ple, here’s how you can read an im­age in Python, pass its con­tents to a Mag­num im­porter and get the raw pix­el da­ta back:

from magnum import trade

def consume_pixels(pixels: np.ndarray):
    ...

importer: trade.AbstractImporter =
    trade.ImporterManager().load_and_instantiate('AnyImageImporter')
with open(filename, 'rb') as f:
    importer.open_data(f.readall())
image: trade.ImageData2D = importer.image2d(0)
# green channel of a lower right quarter of a 256x256 image flipped upside down
consume_pixels(image.pixels[128:128,::-1,1:2])

Just one ques­tion left — who owns the mem­o­ry here, then? To an­swer that, let’s dive in­to Python’s ref­er­ence count­ing.

Ref­er­ence count­ing

In C++, views are one of the more dan­ger­ous con­tain­ers, as they ref­er­ence da­ta owned by some­thing else. There you’re ex­pect­ed to en­sure the da­ta own­er is kept in scope for at least as long as the view on it. A sim­i­lar thing is with oth­er types — for ex­am­ple, a GL::Mesh may ref­er­ence a bunch of GL::Buf­fers, or a Trade::Ab­strac­tIm­porter load­ed from a plug­in needs its plug­in man­ag­er to be alive to keep the plug­in li­brary load­ed.

importer importer manager manager importer->manager f f importer->f image image image->f pixels pixels pixels->image
Ref­er­ence hi­er­ar­chy

The dim dashed lines show ad­di­tion­al po­ten­tial de­pen­den­cies that would hap­pen with fu­ture ze­ro-copy plug­in im­ple­men­ta­tions — when the file for­mat al­lows it, these would ref­er­ence di­rect­ly the da­ta in f in­stead of stor­ing a copy them­selves.

How­ev­er, im­pos­ing sim­i­lar con­straints on Python users would be dar­ing too much, so all ex­posed Mag­num types that re­fer to ex­ter­nal da­ta im­ple­ment ref­er­ence count­ing un­der the hood. The des­ig­nat­ed way of do­ing this with py­bind11 is wrap­ping all your ev­ery­thing with std::shared_p­tr. On the oth­er hand, Mag­num is free of any shared point­ers by de­sign, and adding them back just to make Python hap­py would make ev­ery­one else an­gry in ex­change. What Mag­num does in­stead is ex­tend­ing the so-called hold­er type in py­bind11 (which doesn’t have to be std::shared_p­tr; std::unique_p­tr or a cus­tom point­er types is fine as well) and stor­ing ref­er­ences to in­stance de­pen­den­cies in­side it.

The straight­for­ward way of do­ing this would be to take GL::Mesh, sub­class it in­to a PyMesh, store buf­fer ref­er­ences in­side it and then ex­pose PyMesh as gl.Mesh in­stead. But com­pared to the hold­er type ap­proach this has a se­ri­ous dis­ad­van­tage where ev­ery API that works with mesh­es would sud­den­ly need to work with PyMesh in­stead and that’s not al­ways pos­si­ble.

For test­ing and de­bug­ging pur­pos­es, ref­er­ences to mem­o­ry own­ers or oth­er da­ta are al­ways ex­posed through the API — see for ex­am­ple Im­ageView2D.own­er or gl.Mesh.buf­fers.

Ze­ro-waste da­ta slic­ing

One thing I got used to, es­pe­cial­ly when writ­ing parsers, is to con­tin­u­al­ly slice the in­put da­ta view as the al­go­rithm con­sumes its pre­fix. Con­sid­er the fol­low­ing Python code, vague­ly re­sem­bling an OBJ pars­er:

view = containers.ArrayView(data)
while view:
    # Comment, ignore until EOL
    if view[0] == '#': while view and view[0] != '\n': view = view[1:]
    # Vertex / face
    elif view[0] == 'v': view = self.parse_vertex(view)
    elif view[0] == 'f': view = self.parse_face(view)
    ...

On ev­ery op­er­a­tion, the view gets some pre­fix chopped off. While not a prob­lem in C++, this would gen­er­ate an im­pres­sive­ly long ref­er­ence chain in Python, pre­serv­ing all in­ter­me­di­ate views from all loop it­er­a­tions.

slice4 slice4 slice3 slice3 slice4->slice3 slice2 slice2 slice3->slice2 slice1 slice1 slice2->slice1 view view slice1->view data data view->data sliceN sliceN sliceN->slice4

While the views are gen­er­al­ly small­er than the da­ta they re­fer to, with big files it could eas­i­ly hap­pen that the over­head of views be­comes larg­er than the parsed file it­self. To avoid such end­less growth, slic­ing op­er­a­tions on views al­ways re­fer the orig­i­nal da­ta own­er, al­low­ing the in­ter­me­di­ate views to be col­lect­ed. In oth­er words, for a con­tain­ers.Ar­rayView.own­er, view[:].owner is view.owner al­ways holds.

view view data data view->data sliceN sliceN sliceN->data slice4 slice4 slice4->data slice3 slice3 slice3->data slice2 slice2 slice2->data slice1 slice1 slice1->data

The less-than-great as­pects of py­bind11

Throw­ing C++ ex­cep­tions is ac­tu­al­ly re­al­ly slow

While I was aware there’s some over­head in­volved with C++’s throw, I nev­er guessed the over­head would be so big. In most cas­es, this would not be a prob­lem as ex­cep­tions are ex­cep­tion­al but there’s one lit­tle cor­ner of Python where you have to use them — it­er­a­tion. In or­der to it­er­ate any­thing, Python calls __getitem__() with an in­creas­ing in­dex, and in­stead of check­ing against __len__(), sim­ply wait­ing un­til it rais­es In­dex­Er­ror. This is al­so how con­ver­sion from/to lists is used and al­so how numpy.ar­ray() pop­u­lates the ar­ray from a list-like type, un­less the type sup­ports Buf­fer Pro­to­col. Bind­ings for Mag­num’s Vec­tor3.__getitem__() orig­i­nal­ly looked like this:

.def("__getitem__", [](const T& self, std::size_t i) -> typename T::Type {
    if(i >= T::Size) throw pybind11::index_error{};
    return self[i];
})

Plain and sim­ple and seem­ing­ly not a perf prob­lem at all … un­til you start mea­sur­ing:

0.1356 ± 0.0049 µs 3.4824 ± 0.1484 µs 2.607 ± 0.1367 µs 0.4181 ± 0.0363 µs 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 µs pure Python pybind11 pybind11 pybind11 / CPython raise IndexError() throw pybind11::index_error{} throw pybind11::error_already_set{} PyErr_SetString() Cost of raising an exception

This is fur­ther blown out of pro­por­tion in case of numpy.ar­ray() — look­ing at the sources of PyArray_Fro­mAny(), it’s ap­par­ent­ly hit­ting the out-of-bounds con­di­tion three times — first when check­ing for di­men­sions, sec­ond when cal­cu­lat­ing a com­mon type for all el­e­ments and third when do­ing an ac­tu­al copy of the da­ta. This was most prob­a­bly not worth op­ti­miz­ing as­sum­ing sane ex­cep­tion per­for­mance, how­ev­er com­bined with py­bind11, it leads to a mas­sive slow­down:

0.5756 ± 0.0313 µs 17.2296 ± 1.1786 µs 14.2204 ± 0.3782 µs 6.3909 ± 0.1217 µs 0.6411 ± 0.0368 µs 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 µs from a list from Vector3 from Vector3 from Vector3 from Vector3 throw pybind11::index_error{} throw pybind11::error_already_set{} PyErr_SetString() buffer protocol Constructing numpy.ndarray

As hint­ed by the plots above, there’s a few pos­si­ble ways of coun­ter­ing the in­ef­fi­cien­cy:

  1. A lot of over­head in py­bind11 is re­lat­ed to ex­cep­tion trans­la­tion which can be sidestepped by call­ing Py­Er­r_Set­String() and telling py­bind11 an er­ror is al­ready set and it on­ly needs to prop­a­gate it:

    .def("__getitem__", [](const T& self, std::size_t i) -> typename T::Type {
        if(i >= T::Size) {
            PyErr_SetString(PyExc_IndexError, "");
            throw pybind11::error_already_set{};
        }
        return self[i];
    })

    As seen above, this re­sults in a mod­er­ate im­prove­ment with ex­cep­tions tak­ing ~1 µs less to throw (though for numpy.ar­ray() it doesn’t help much). This is what Mag­num Bind­ings glob­al­ly switched to af­ter dis­cov­er­ing the perf dif­fer­ence, and apart from Py­Er­r_Set­String(), there’s al­so Py­Er­r_­For­mat() able to stringi­fy Python ob­jects di­rect­ly us­ing "%A" — hard to beat that with any third-par­ty so­lu­tion.

  2. Even with the above, the throw and the whole ex­cep­tion bub­bling in­side py­bind is still re­spon­si­ble for quite a lot, so the next step is to on­ly call Py­Er­r_Set­String() and re­turn noth­ing to py­bind to in­di­cate we want to raise an ex­cep­tion in­stead:

    .def("__getitem__", [](const T& self, std::size_t i) -> pybind11::object {
        if(i >= T::Size) {
            PyErr_SetString(PyExc_IndexError, "");
            return pybind11::object{};
        }
        return pybind11::cast(self[i]);
    })

    This re­sults in quite a sig­nif­i­cant im­prove­ment, re­duc­ing the ex­cep­tion over­head from about 3.5 µs to 0.4 µs. It how­ev­er re­lies on a patch that’s not merged yet (see py­bind/py­bind11#1853) and it re­quires the bound API to re­turn a type­less ob­ject in­stead of a con­crete val­ue as there’s no oth­er way to ex­press a “null” val­ue oth­er­wise.

  3. If there’s a pos­si­bil­i­ty to use the Buf­fer Pro­to­col, pre­fer­ring it over __getitem__(). At first I was skep­ti­cal about this idea be­cause the buf­fer pro­to­col set­up with point­ers and shapes and for­mats and sizes and all re­lat­ed er­ror check­ing cer­tain­ly feels heav­ier than sim­ply it­er­at­ing a three-el­e­ment ar­ray. But the rel­a­tive heav­i­ness of ex­cep­tions makes it a win­ner. Py­bind has a builtin sup­port, so why not use it. Well, ex­cept …

Let’s al­lo­cate a bunch of vec­tors and strings to do a ze­ro-copy da­ta trans­fer

Python’s Buf­fer Pro­to­col, men­tioned above, is a re­al­ly nice ap­proach for da­ta trans­fers with min­i­mal over­head — if used cor­rect­ly. Let’s look again at the case of call­ing numpy.ar­ray() above:

0.5756 ± 0.0313 µs 0.4766 ± 0.0263 µs 0.6411 ± 0.0368 µs 0.5552 ± 0.0294 µs 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 µs from a list from array.array from Vector3 from Vector3 pybind11::buffer Py_buffer Creating numpy.array() from a list-like type

It’s clear that con­vert­ing a pure Python list to a numpy.ar­ray() is, even with all the ex­cep­tions in­volved, still faster than us­ing py­bind’s buf­fer pro­to­col im­ple­men­ta­tion to con­vert a Vec­tor3 to it. In com­par­i­son, ar­ray.ar­ray() (which im­ple­ments Buf­fer Pro­to­col as well, on­ly na­tive­ly in plain C) is quite speedy, so there’s def­i­nite­ly some­thing fishy in py­bind11.

struct buffer_info {
    void *ptr;
    ssize_t itemsize;
    std::string format;
    ssize_t ndim;
    std::vector<ssize_t> shape;
    std::vector<ssize_t> strides;
};

Oh, so that’s why.

All the std::string and std::vec­tor al­lo­ca­tions in pybind11::buffer_info add up to the over­head, so I de­cid­ed to side­step py­bind11 al­to­geth­er and in­ter­face di­rect­ly with the un­der­ly­ing Python C API in­stead. Be­cause the Py_buffer struc­ture is quite flex­i­ble, I end­ed up point­ing its mem­bers to stat­i­cal­ly de­fined da­ta for each ma­trix / vec­tor type, mak­ing the buf­fer pro­to­col op­er­a­tion com­plete­ly al­lo­ca­tion-less. In case of con­tain­ers.Ar­rayView and its strid­ed equiv­a­lent the struc­ture points to their in­ter­nal mem­bers, so noth­ing needs to be al­lo­cat­ed even in case of con­tain­ers.StridedAr­rayView4D. Ad­di­tion­al­ly, op­er­at­ing di­rect­ly with the C API al­lowed me to cor­rect­ly prop­a­gate read­on­ly prop­er­ties and the above-men­tioned da­ta own­er­ship as well.

Be­come a 10× pro­gram­mer with this one weird trick

Com­pile times with py­bind are some­thing I can’t get used to at all. Maybe this is noth­ing ex­tra­or­di­nary when you do a lot of Mod­ern C++, but an in­cre­men­tal build of a sin­gle file tak­ing 20 sec­onds is a bit too much for my taste. In com­par­i­son, I can re­com­pile the full Mag­num (with­out tests) in 15 sec­onds. This gets a lot worse when build­ing Re­lease, due to -flto be­ing passed to the com­pil­er — then an in­cre­men­tal build of that same file takes 90 sec­onds5, large part of the time spent in Link-Time Op­ti­miza­tion.

For­tu­nate­ly, by an­oth­er lucky ac­ci­dent, I re­cent­ly dis­cov­ered that GCC’s -flto flag has a par­al­lel op­tion6 — so if you have 8 cores, -flto=8 will make the LTO step run eight times faster, turn­ing the above 90 sec­onds in­to slight­ly-less-hor­rif­ic 35 sec­onds. Imag­ine that. This has how­ev­er a dan­ger­ous con­se­quence — the buildsys­tem is not aware of the LTO par­al­lel­ism, so it’s in­evitable that it will hap­pi­ly sched­ule 8 par­al­lelized link jobs at once, bring­ing your ma­chine to a grind­ing halt un­less you have 32 GB RAM and most of those free. If you use Nin­ja, it has job pools where you can tell it to not fire up more than one such link job at once, but as far as my un­der­stand­ing of this fea­ture goes, this will not af­fect the sched­ul­ing of com­pile and link in par­al­lel.

Once Clang 9 is out (and once I get some free time), I want to un­leash the new -ftime-trace op­tion on the py­bind code, to see if there’s any low-hang­ing fruit. But un­for­tu­nate­ly in the long term I’m afraid I’ll need to re­place even more parts of py­bind to bring com­pile times back to sane bounds.

5.
^ To give a per­spec­tive, the cov­er im­age of this ar­ti­cle (on the top) is gen­er­at­ed from pre­pro­cessed out­put of the file that takes 90 sec­onds to build. About 1%, few fad­ed lines in the front, is the ac­tu­al bind­ings code. The rest — as far as your eyes can see — is STL and py­bind11 head­ers.
6.
^ It’s cur­rent­ly opt-in, but GCC 10 is sched­uled to have it en­abled by de­fault. If you are on Clang, it has ThinL­TO, how­ev­er I was not able to con­vince it to run par­al­lel for me.

Ev­ery­one “just us­es Sphinx”. You?

The ob­vi­ous first choice when it comes to doc­u­ment­ing Python code is to use Sphinx — ev­ery­thing in­clud­ing the stan­dard li­brary us­es it and I don’t even re­mem­ber see­ing a sin­gle Python li­brary that doesn’t. How­ev­er, if you clicked on any of the above doc links, you prob­a­bly re­al­ized that … no, Mag­num is not us­ing it.

Ev­er since the doc­u­men­ta­tion search got in­tro­duced ear­ly last year, many de­vel­op­ers quick­ly be­came ad­dict­ed used to it. Whip­ping up some Sphinx docs, where both search per­for­mance and re­sult rel­e­vance is ex­treme­ly un­der­whelm­ing, would be ef­fec­tive­ly un­do­ing all us­abil­i­ty progress Mag­num made un­til now, so the on­ly op­tion was to bring the search to Python as well.

mag­num.Ma­trix4 doc­u­men­ta­tion
Type an­no­ta­tions are cen­tral to the Python doc gen­er­a­tor.

While at it, I made the doc gen­er­a­tor aware of all kinds of type an­no­ta­tions, prop­er­ly crosslink­ing ev­ery­thing to cor­re­spond­ing type def­i­ni­tions. And not just lo­cal types — sim­i­lar­ly to Doxy­gen’s tag­files, Sphinx has In­ter­sphinx, so link­ing to 3rd par­ty li­brary docs (such as NumPy) or even the stan­dard li­brary is pos­si­ble as well. Con­verse­ly, the m.css Python doc gen­er­a­tor ex­ports an In­ter­sphinx in­ven­to­ry file, so no mat­ter whether you use vanil­la Sphinx or m.css, you can link to m.css-gen­er­at­ed Python docs from your own doc­u­men­ta­tion as well.

If you want to try it out on your project, head over to the m.css web­site for a brief in­tro­duc­tion. Com­pared to Sphinx or Doxy­gen it be­haves more like Doxy­gen, as it im­plic­it­ly brows­es your mod­ule hi­er­ar­chy and gen­er­ates a ded­i­cat­ed page for each class and mod­ule. The way Sphinx does it was in­ter­est­ing for me at first, but over the time I re­al­ized it needs quite a lot of ef­fort from de­vel­op­er side to or­ga­nize well — and from the doc­u­men­ta­tion read­er side, it can lead to things be­ing hard­er to find than they should be (for ex­am­ple, docs for str.split­lines() are buried some­where in the mid­dle of a kilo­me­ter-long page doc­u­ment­ing all builtin types).

The doc gen­er­a­tor re­sem­bles Sphinx, but I de­cid­ed to ex­per­i­ment with a clean slate first in­stead of mak­ing it 100% com­pat­i­ble — some de­sign de­ci­sions in Sphinx it­self are his­tor­i­cal (such as type an­no­ta­tions in the doc block in­stead of in the code it­self) and it didn’t make sense for me to port those over. Well, at least for now, a full Sphinx com­pat­i­bil­i­ty is not com­plete­ly out of ques­tion.

What’s next?

The work is far from be­ing done — apart from ex­pos­ing APIs that are not ex­posed yet (which is just a rou­tine work, most­ly), I’m still not quite sat­is­fied with the per­for­ma­ce of bound types, so on my roadmap is try­ing to ex­pose a ba­sic type us­ing pure C Python APIs the most ef­fi­cient way pos­si­ble and then com­par­ing how long does it take to in­stan­ti­ate that type and call meth­ods on it. One of the things to try is a vec­tor­call call pro­to­col that’s new in Python 3.8 (PEP590) and the re­search couldn’t be com­plete with­out al­so try­ing a sim­i­lar thing in Mi­croPy­thon.

~ ~ ~