Mag­num docs were miss­ing the search func­tion­al­i­ty for some time be­cause I want­ed to im­ple­ment it prop­er­ly for the new theme. The prop­er im­ple­men­ta­tion is now ready.

Doxygen search GIF

“How hard can it be?”

That’s what I told to my­self about two weeks ago. I was so un­hap­py with the re­spon­sive­ness and us­abil­i­ty of the stock Doxy­gen search that I usu­al­ly went straight to Google in­stead. With that ex­pe­ri­ence, for the new theme I want­ed a so­lu­tion that:

  • Is fast and sta­ble, with­out re­sults be­ing re­ordered as they ap­pear
  • Is small enough to not need any com­plex serv­er-side so­lu­tion
  • Is show­ing rel­e­vant re­sults first, so I don’t need to scroll through end­less lists to find what I’m look­ing for
  • Is us­able for API dis­cov­ery (im­me­di­ate looka­head au­to­com­ple­tion in­stead of be­ing on­ly a full-text search)
  • Has an UX that al­lows me to pop it up any­where on a page, nav­i­gate us­ing just the key­board and re­turn back to the page with­out los­ing con­text

With not even 500 lines of pure JavaScript and rough­ly half a megabyte bi­na­ry con­tain­ing search da­ta I’m able to search among all 7500 sym­bols of the Mag­num en­gine faster than I can type.

The idea

The Trie search struc­ture seemed to fit the case for looka­head au­to­com­ple­tion, how­ev­er build­ing a huge tree struc­ture in JavaScript didn’t re­al­ly get my hopes for bet­ter per­for­mance up. First hit when search­ing for “javascript fast tries” was this blog post by John Re­sig (the au­thor of jQuery). I was quite hor­ri­fied that even the fast im­ple­men­ta­tion still in­volved build­ing a tree struc­ture un­til I saw that the ar­ti­cle was writ­ten in 2011 — in the age be­fore asm.js, typed ar­rays and Em­scripten.

Phew, okay. I de­cid­ed to drop “javascript” from my search query and a pa­per about Tight­ly packed tries fi­nal­ly popped up in the re­sults. This start­ed to look like the right di­rec­tion to fol­low.

Ini­tial im­ple­men­ta­tion

The ba­sic idea is: dur­ing out­put gen­er­a­tion in the m.css Doxy­gen theme, take the in­put search da­ta and pre­pro­cess them to a trie struc­ture us­ing sim­ple nest­ed Python dicts. Then, once ful­ly pop­u­lat­ed, lin­earize the trie to a bi­na­ry file. The bi­na­ry file will lat­er be load­ed us­ing JavaScript in the brows­er and search per­formed di­rect­ly on it with­out build­ing any new struc­ture out of it to min­i­mize ini­tial start­up time and mem­o­ry us­age.

Each node in the trie will store a list of re­sults (rep­re­sent­ed by sim­ple in­te­ger in­dices to an ex­ter­nal ar­ray, ex­pla­na­tion be­low) and a map from char­ac­ters to child nodes. I set a rea­son­able lim­it of 64k sym­bols in to­tal, max bi­na­ry file size to 256 MB, max 256 re­sults and at most 256 child 8-bit char­ac­ters for each node. With that, the bi­na­ry se­ri­al­iza­tion for­mat for the trie is rough­ly as fol­lows:

Field size Con­tent
LaTeX Math 32b Root node off­set
LaTeX Math 8b Node N re­sult count LaTeX Math r
LaTeX Math 8b Node N child count LaTeX Math c
LaTeX Math r \times 16b Node N re­sult in­dices
LaTeX Math 8b + 24b Node N child 1 char­ac­ter + byte off­set
LaTeX Math 8b + 24b Node N child 2 char­ac­ter + byte off­set
LaTeX Math 8b + 24b Node N child LaTeX Math c char­ac­ter + byte off­set

The trie will be tra­versed in post-or­der when se­ri­al­iz­ing, which means that each node off­set in the file is al­ready known when se­ri­al­iz­ing its par­ent. The on­ly ex­cep­tion is the root node off­set, which is up­dat­ed at the start of the file as the last step.

Now, what ac­tu­al­ly will be saved in­to the trie? Let’s start with a very sim­pli­fied case: hav­ing the Mag­num and Mag­num::Math names­paces, Mag­num::Math::Vec­tor and Mag­num::Math::Range class­es and just three func­tions Mag­num::Math::Vec­tor::min(), Mag­num::Math::Range::min() and Mag­num::Math::min(). The stock Doxy­gen search al­lows me to on­ly search us­ing the leaf name — if I want to look up Math::min(), en­ter­ing min() gives too many re­sults and I need to fil­ter the re­sults man­u­al­ly, which is a waste of brain mat­ter. The goal here is to be able to search us­ing any pre­fix, which means that each sym­bol has to be saved in­to the trie with all its pos­si­ble pre­fix­es.

Hav­ing good de­bug vi­su­al­iza­tion is key to un­der­stand­ing the da­ta. A tu­ple con­tain­ing the above sev­en sym­bols in­sert­ed with dif­fer­ent pre­fix­es looks like this — each col­umn is one tree depth lev­el (go­ing to the next col­umn in­volves search­ing for the cor­rect let­ter) and par­tic­u­lar leaf nodes con­tain re­sult IDs (shown in cyan). The search is case-in­sen­si­tive, so all strings are nor­mal­ized to low­er­case:

magnum [0]
|||   ::math [1]
|||         ::vector [2]
|||           |     ::min [3]
|||           range [4]
|||           |    ::min [5]
|||           min [6]
||th [1]
||| ::vector [2]
|||   |     ::min [3]
|||   range [4]
|||   |    ::min [5]
|||   min() [6]
|in [3, 5, 6]
vector [2]
|     ::min() [3]
range [4]
|    ::min() [5]

Sub­tree merg­ing

There’s one prob­lem with the above ap­proach — al­most all sym­bols are con­tained in the trie two or more times, de­pend­ing on their depth. With thou­sands of sym­bols this would be a sig­nif­i­cant size in­crease. Thanks to the de­bug vi­su­al­iza­tion, the so­lu­tion was im­me­di­ate­ly vis­i­ble — what if I would sim­ply elim­i­nate sub­trees that look the same?

Dur­ing se­ri­al­iza­tion the tree is tra­versed in post-or­der, which means that the fi­nal bi­na­ry form of a node is known when se­ri­al­iz­ing its par­ent. Then it’s enough to main­tain a lookup ta­ble of se­ri­al­ized sub­trees. When a new se­ri­al­ized sub­tree is al­ready in the lookup ta­ble, it’s not writ­ten to the file but the byte off­set of the al­ready ex­ist­ing in­stance is used in­stead. This of course re­quires that sub­trees with the same da­ta are se­ri­al­ized the same, in­de­pen­dent on their po­si­tion in the file — in or­der to achieve that, I had to switch from rel­a­tive child off­sets (as sug­gest­ed in the pa­per for space sav­ing) to ab­so­lute off­sets.

Com­par­ing with the vi­su­al­iza­tion above, # de­notes where a du­pli­cate sub­tree was elim­i­nat­ed by this method:

magnum [0]
|||   ::math [1]
|||         ::vector [2]
|||           |     ::min [3]
|||           range [4]
|||           |    ::min [5]
|||           min [6]
||t#
|in [3, 5, 6]
v#
r#

With the sub­tree merg­ing in place, the se­ri­al­ized size of this min­i­mal case went down from 592 bytes to not even a half: 290 bytes. In the full Mag­num search da­ta, this saved over half a megabyte. A nice thing about this ap­proach is that it’s com­plete­ly lan­guage-obliv­i­ous — it just works on the bi­na­ry da­ta.

Re­sult map

So far there was just the trie that con­tained re­sult in­dices such as [3, 5, 6]. That’s ob­vi­ous­ly not enough to show a link with search re­sult to the us­er. Next to the trie there has to be a map from those in­dices to ac­tu­al re­sult ti­tles and URLs.

Each ti­tle and URL is a string of a dif­fer­ent length and in or­der to make the lookup a con­stant-time op­er­a­tion, there needs to be first a ta­ble map­ping an in­dex to byte off­set with the string da­ta. Be­cause in the trie da­ta I’m us­ing on­ly 24 bits for byte off­set, I can do the same here and use the re­main­ing 8 bits for var­i­ous ad­di­tion­al in­for­ma­tion, such as sym­bol type.

Field size Con­tent
LaTeX Math 8b + 24b Re­sult 0 flags + byte off­set LaTeX Math o_0
LaTeX Math 8b + 24b Re­sult 1 flags + byte off­set LaTeX Math o_1
LaTeX Math 8b + 24b Re­sult N flags + byte off­set LaTeX Math o_n
LaTeX Math 32b File size LaTeX Math l
LaTeX Math o_1 - o_0 Re­sult 0 ti­tle, '\0', URL
LaTeX Math l - o_n Re­sult N ti­tle, '\0', URL

Look­ing up a re­sult LaTeX Math r in­volves read­ing da­ta from byte off­set range giv­en by in­dex LaTeX Math r and LaTeX Math r + 1 . The ti­tle and URL is null-sep­a­rat­ed. Re­sult da­ta cor­re­spond­ing to the above sim­pli­fied case look like this when vi­su­al­ized:

0: Magnum [type=NAMESPACE] -> namespaceMagnum.html
1: Magnum::Math [type=NAMESPACE] -> namespaceMagnum_1_1Math.html
2: Magnum::Math::Vector [type=CLASS] -> classMagnum_1_1Math_1_1Vector.html
3: Magnum::Math::Vector::min() [type=FUNC] ->
        classMagnum_1_1Math_1_1Vector.html#af029f9f7810201f0bd8d9580af273bde
4: Magnum::Math::Range [type=CLASS] -> classMagnum_1_1Math_1_1Range.html
5: Magnum::Math::Range::min() [type=FUNC] ->
        classMagnum_1_1Math_1_1Range.html#ad4919361a2086212fac96da0221e4dcd
6: Magnum::Math::min() [type=FUNC] ->
        namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21

Re­sult map pre­fix merg­ing

Again, thanks to the da­ta vi­su­al­iza­tion it’s im­me­di­ate­ly clear that there’s a lot of re­dun­dant in­for­ma­tion that’s neg­a­tive­ly af­fect­ing da­ta size. The re­dun­dant in­for­ma­tion is main­ly in pre­fix­es, so I reused the al­ready im­ple­ment­ed trie to per­form the op­ti­miza­tion. The fol­low­ing vi­su­al­iza­tion shows the re­sult — the de­tails are way too bor­ing to ex­plain in full here, but ba­si­cal­ly ev­ery en­try ti­tle can be op­tion­al­ly pre­fixed by an­oth­er en­try ti­tle (de­not­ed by prefix=M) and take the first [:N] char­ac­ters from that en­try URL, re­cur­sive­ly.

0: Magnum [type=NAMESPACE] -> namespaceMagnum.html
1: ::Math [prefix=0[:15], type=NAMESPACE] -> _1_1Math.html
2: ::Vector [prefix=1[:0], type=CLASS] -> classMagnum_1_1Math_1_1Vector.html
3: ::min() [prefix=2[:34], type=FUNC] -> #af029f9f7810201f0bd8d9580af273bde
4: ::Range [prefix=1[:0], type=CLASS] -> classMagnum_1_1Math_1_1Range.html
5: ::min() [prefix=4[:33], type=FUNC] -> #ad4919361a2086212fac96da0221e4dcd
6: ::min() [prefix=1[:28], type=FUNC] -> #ae22ef0cb2a5a5e4c5e626a3df670be21

In this par­tic­u­lar case the map size went down from 494 bytes to 321 bytes. In case of Mag­num da­ta, the sav­ing was over a megabyte.

Looka­head bar­ri­ers

Look­ing at the above, some­thing seems off. It’s def­i­nite­ly not the us­er’s ex­pec­ta­tion to get to know about all sym­bols that ev­er ex­ist­ed when en­ter­ing a bunch of char­ac­ters. For ex­am­ple, when I’m search­ing for a names­pace, I am not in­ter­est­ed in all its mem­bers. Or when a some pre­fix is ap­pear­ing both in func­tion name and en­clos­ing names­pace, I’ll get the re­sult twice in the list.

So how to solve that? When adding sym­bols to the trie, let’s add bar­ri­ers to sig­nal the looka­head au­to­com­ple­tion that it should not dig deep­er look­ing for pos­si­ble re­sults. How­ev­er, I don’t want to pre­vent API dis­cov­ery, so when one en­ters e.g. Magnum:, the search should go past the bar­ri­er and show all mem­bers of the names­pace. (But not nest­ed mem­bers!)

In the fol­low­ing vi­su­al­iza­tion, the $ char­ac­ters de­note a looka­head bar­ri­er. If the au­to­com­ple­tion reach­es a sym­bol that’s in front of it, it will not at­tempt to con­tin­ue fur­ther.

magnum [0]
|||   :$
|||    :math [1]
|||         :$
|||          :vector [2]
|||           |     :$
|||           |      :min [3]
|||           range [4]
|||           |    :$
|||           |     :min [5]
|||           min [6]
||th [1]
||| :$
|||  :vector [2]
|||   |     :$
|||   |      :min [3]
|||   range [4]
|||   |    :$
|||   |     :min [5]
|||   min [6]
|in [3, 5, 6]
vector [2]
|     :$
|      :min [3]
range [4]
|    :$
|     :min [5]

With these bar­ri­ers in place, search­ing for m gives the fol­low­ing out­put — unique sym­bols start­ing with m, sort­ed from short­est to long­est:

[{title: 'Magnum::Math::min()',
  url: 'namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21'},
 {title: 'Magnum::Math::Range::min()',
  url: 'classMagnum_1_1Math_1_1Range.html#a22af2191e4ab88b45f082ef14aa45185'},
 {title: 'Magnum::Math::Vector::min()',
  url: 'classMagnum_1_1Math_1_1Vector.html#af029f9f7810201f0bd8d9580af273bde'},
 {title: 'Magnum::Math',
  url: 'namespaceMagnum_1_1Math.html'},
 {title: 'Magnum',
  url: 'namespaceMagnum.html'}]

Search­ing for math gives just this:

[{title: 'Magnum::Math',
  url: 'namespaceMagnum_1_1Math.html'}]

While search­ing for math: pro­vides a list of all its im­me­di­ate mem­bers:

[{title: 'Magnum::Math::min()',
  url: 'namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21'},
 {title: 'Magnum::Math::Range',
  url: 'classMagnum_1_1Math_1_1Range.html'},
 {title: 'Magnum::Math::Vector',
  url: 'classMagnum_1_1Math_1_1Vector.html'}]

Ex­tra good­ies

Just repli­cat­ing the stock search func­tion­al­i­ty and mak­ing it faster would be bor­ing, so I added some cher­ries on top. First is prop­a­gat­ing in­fo about dep­re­cat­ed and deleted func­tions to search re­sults (and mem­ber list­ing as well) to save some the users ex­tra clicks when try­ing to find a func­tion that’s ac­tu­al­ly callable. You can see it in the GIF above.

The sec­ond fea­ture is abil­i­ty to spec­i­fy ad­di­tion­al search key­words for par­tic­u­lar sym­bols. This is use­ful es­pe­cial­ly in the OpenGL wrap­ping lay­er of Mag­num — one can search for a well-known OpenGL sym­bol to get to know what API it cor­re­sponds to. This is ex­posed through cus­tom @m_keywords, @m_keyword and @m_enum_values_as_keywords com­mands, head over to the docs for de­tails. Self-ex­plana­to­ry ex­am­ple:

/**
 * @brief Set texture storage
 *
 * @m_keywords{glTexStorage2D() glTextureStorage2D()}
 */
Texture2D& Texture2D::setStorage(...);

/**
 * @brief Renderer feature
 *
 * @m_enum_values_as_keywords
 */
enum class RendererFeature: GLenum {
    /** Depth test */
    DepthTest = GL_DEPTH_TEST,

    ...
};

Fi­nal im­ple­men­ta­tion

In this ar­ti­cle I tried to ex­plain main­ly the key as­pects of the search im­ple­men­ta­tion . The rest is im­por­tant too, but would make the ar­ti­cle too long and bor­ing.

  • In or­der to prop­er­ly high­light the cur­rent­ly typed part of the search re­sult, the trie lookup is count­ing num­ber of suf­fix char­ac­ters when gath­er­ing the re­sults. This is ex­tend­ed fur­ther in or­der to dis­play func­tion ar­gu­ments lists and const / r-val­ue over­loads (which are not part of the trie da­ta).
  • Due to re­stric­tions of Chro­mi­um-based browsers, it’s not pos­si­ble to down­load da­ta us­ing XMLHttpRequest when served from a lo­cal file-sys­tem. Be­cause that’s a very com­mon use case, I had to work around this by con­vert­ing the bi­na­ry to a Base85-en­cod­ed *.js file that’s load­ed di­rect­ly via <script>.
  • A lot of ef­for went in­to client-side UX, like key­board nav­i­ga­tion, cut­ting off over­ly long pre­fix­es on the prop­er side, avoid­ing page jumps etc. There are still some known is­sues left to be re­solved.

Some stats

With the cur­rent state of Mag­num doc­u­men­ta­tion, where there is 7565 unique sym­bols and I’m in the mid­dle of adding @m_keywords to all OpenGL wrap­per APIs, the da­ta size is as fol­lows, de­pend­ing on what fea­tures are en­abled:

Da­ta type Un­com­pressed Gzipped
Base­line, bi­na­ry 2635.6 kB 884.2 kB
Sub­tree merg­ing, bi­na­ry 2033.3 kB 537.8 kB
Sub­tree and pre­fix merg­ing, bi­na­ry 959.3 kB 501.5 kB
Sub­tree and pre­fix merg­ing, base85 1202.5 kB 752.5 kB

As the docs are ac­cessed through a web­serv­er, I can use the bi­na­ry and en­able serv­er-side gzip com­pres­sion. That gives slight­ly be­low half a megabyte, which, in com­par­i­son to sizes of con­tem­po­rary web­sites, is more than okay. The doc­u­men­ta­tion is up­dat­ed not more than once a week, so the browsers will usu­al­ly just hap­pi­ly serve this dataset from a cache.

One im­por­tant thing to note: the pre­fix merg­ing doesn’t give re­al­ly that much ad­van­tage when com­par­ing the gzipped size, as the da­ta are al­ready very nice­ly com­press­ible.

Bonus: Uni­code

Some­one might say that the above re­stric­tion to 8-bit char­ac­ters makes this not Uni­code-aware and thus use­less be­yond sim­ple ASCII sym­bols. Well, no. As­sum­ing the in­put is UTF-8-en­cod­ed, this is how a trie for two great Czech words “hýž­dě” and “hárá” will look like:

h0xc3
  0xbd
   0xc5
  | 0xbe
  |  d0xc4
  |    0x9b
  |      [13]
  0xa1
   r0xc3
  |  0xa1
  |    [42]

The im­ple­men­ta­tion sim­ply takes the UTF-8 rep­re­sen­ta­tion of the word “hýž­dě” ['h', 0xc3, 0xbd, 0xc5, 0xbe, 'd', 0xc4, 0x9b] byte-by-byte and in­serts that in­to the trie, not giv­ing much thought to what­ev­er en­cod­ing giv­en string is in. This al­so has the ef­fect that UTF-8 rep­re­sen­ta­tion of “ý” and “á” is [0xc3, 0xbd] and [0xc3, 0xa1] re­spec­tive­ly, shar­ing the same first byte 0xc3, which is then a sin­gle node in the trie. That means there’s no need to have more that 8 bits to rep­re­sent a char­ac­ter in the trie.

Try it your­self

Oh, did I men­tion that this whole thing is ful­ly open-source and ready for you to try it out? Head over to the full doc­u­men­ta­tion of the m.css Doxy­gen theme for de­tails. Bug re­ports very wel­come, and if you like what you see, con­sid­er mak­ing a do­na­tion. Thank you!