Mag­num docs were miss­ing the search func­tion­al­ity for some time be­cause I wanted to im­ple­ment it prop­erly for the new theme. The prop­er im­ple­ment­a­tion is now ready.

Doxygen search GIF

“How hard can it be?”

That’s what I told to my­self about two weeks ago. I was so un­happy with the re­spons­ive­ness and us­ab­il­ity of the stock Doxy­gen search that I usu­ally went straight to Google in­stead. With that ex­per­i­ence, for the new theme I wanted a solu­tion that:

  • Is fast and stable, without res­ults be­ing re­ordered as they ap­pear
  • Is small enough to not need any com­plex serv­er-side solu­tion
  • Is show­ing rel­ev­ant res­ults first, so I don’t need to scroll through end­less lists to find what I’m look­ing for
  • Is us­able for API dis­cov­ery (im­me­di­ate looka­head auto­com­ple­tion in­stead of be­ing only a full-text search)
  • Has an UX that al­lows me to pop it up any­where on a page, nav­ig­ate us­ing just the key­board and re­turn back to the page without los­ing con­text

With not even 500 lines of pure JavaS­cript and roughly half a mega­byte bin­ary con­tain­ing search data I’m able to search among all 7500 sym­bols of the Mag­num en­gine faster than I can type.

The idea

The Trie search struc­ture seemed to fit the case for looka­head auto­com­ple­tion, how­ever build­ing a huge tree struc­ture in JavaS­cript didn’t really get my hopes for bet­ter per­form­ance up. First hit when search­ing for “javas­cript fast tries” was this blog post by John Resig (the au­thor of jQuery). I was quite hor­ri­fied that even the fast im­ple­ment­a­tion still in­volved build­ing a tree struc­ture un­til I saw that the art­icle was writ­ten in 2011 — in the age be­fore asm.js, typed ar­rays and Em­scripten.

Phew, okay. I de­cided to drop “javas­cript” from my search query and a pa­per about Tightly packed tries fi­nally popped up in the res­ults. This star­ted to look like the right dir­ec­tion to fol­low.

Ini­tial im­ple­ment­a­tion

The ba­sic idea is: dur­ing out­put gen­er­a­tion in the m.css Doxy­gen theme, take the in­put search data and pre­pro­cess them to a trie struc­ture us­ing simple nes­ted Py­thon dicts. Then, once fully pop­u­lated, lin­ear­ize the trie to a bin­ary file. The bin­ary file will later be loaded us­ing JavaS­cript in the browser and search per­formed dir­ectly on it without build­ing any new struc­ture out of it to min­im­ize ini­tial star­tup time and memory us­age.

Each node in the trie will store a list of res­ults (rep­res­en­ted by simple in­teger in­dices to an ex­tern­al ar­ray, ex­plan­a­tion be­low) and a map from char­ac­ters to child nodes. I set a reas­on­able lim­it of 64k sym­bols in total, max bin­ary file size to 256 MB, max 256 res­ults and at most 256 child 8-bit char­ac­ters for each node. With that, the bin­ary seri­al­iz­a­tion format for the trie is roughly as fol­lows:

Field size Con­tent
32b Root node off­set
8b Node N res­ult count r
8b Node N child count c
r \times 16b Node N res­ult in­dices
8b + 24b Node N child 1 char­ac­ter + byte off­set
8b + 24b Node N child 2 char­ac­ter + byte off­set
8b + 24b Node N child c char­ac­ter + byte off­set

The trie will be tra­versed in post-or­der when seri­al­iz­ing, which means that each node off­set in the file is already known when seri­al­iz­ing its par­ent. The only ex­cep­tion is the root node off­set, which is up­dated at the start of the file as the last step.

Now, what ac­tu­ally will be saved in­to the trie? Let’s start with a very sim­pli­fied case: hav­ing the Mag­num and Mag­num::Math namespaces, Mag­num::Math::Vec­tor and Mag­num::Math::Range classes and just three func­tions Mag­num::Math::Vec­tor::min(), Mag­num::Math::Range::min() and Mag­num::Math::min(). The stock Doxy­gen search al­lows me to only search us­ing the leaf name — if I want to look up Math::min(), en­ter­ing min() gives too many res­ults and I need to fil­ter the res­ults manu­ally, which is a waste of brain mat­ter. The goal here is to be able to search us­ing any pre­fix, which means that each sym­bol has to be saved in­to the trie with all its pos­sible pre­fixes.

Hav­ing good de­bug visu­al­iz­a­tion is key to un­der­stand­ing the data. A tuple con­tain­ing the above sev­en sym­bols in­ser­ted with dif­fer­ent pre­fixes looks like this — each column is one tree depth level (go­ing to the next column in­volves search­ing for the cor­rect let­ter) and par­tic­u­lar leaf nodes con­tain res­ult IDs (shown in cy­an). The search is case-in­sens­it­ive, so all strings are nor­mal­ized to lower­case:

magnum [0]
|||   ::math [1]
|||         ::vector [2]
|||           |     ::min [3]
|||           range [4]
|||           |    ::min [5]
|||           min [6]
||th [1]
||| ::vector [2]
|||   |     ::min [3]
|||   range [4]
|||   |    ::min [5]
|||   min() [6]
|in [3, 5, 6]
vector [2]
|     ::min() [3]
range [4]
|    ::min() [5]

Sub­tree mer­ging

There’s one prob­lem with the above ap­proach — al­most all sym­bols are con­tained in the trie two or more times, de­pend­ing on their depth. With thou­sands of sym­bols this would be a sig­ni­fic­ant size in­crease. Thanks to the de­bug visu­al­iz­a­tion, the solu­tion was im­me­di­ately vis­ible — what if I would simply elim­in­ate sub­trees that look the same?

Dur­ing seri­al­iz­a­tion the tree is tra­versed in post-or­der, which means that the fi­nal bin­ary form of a node is known when seri­al­iz­ing its par­ent. Then it’s enough to main­tain a look­up table of seri­al­ized sub­trees. When a new seri­al­ized sub­tree is already in the look­up table, it’s not writ­ten to the file but the byte off­set of the already ex­ist­ing in­stance is used in­stead. This of course re­quires that sub­trees with the same data are seri­al­ized the same, in­de­pend­ent on their po­s­i­tion in the file — in or­der to achieve that, I had to switch from re­l­at­ive child off­sets (as sug­ges­ted in the pa­per for space sav­ing) to ab­so­lute off­sets.

Com­par­ing with the visu­al­iz­a­tion above, # de­notes where a du­plic­ate sub­tree was elim­in­ated by this meth­od:

magnum [0]
|||   ::math [1]
|||         ::vector [2]
|||           |     ::min [3]
|||           range [4]
|||           |    ::min [5]
|||           min [6]
||t#
|in [3, 5, 6]
v#
r#

With the sub­tree mer­ging in place, the seri­al­ized size of this min­im­al case went down from 592 bytes to not even a half: 290 bytes. In the full Mag­num search data, this saved over half a mega­byte. A nice thing about this ap­proach is that it’s com­pletely lan­guage-ob­li­vi­ous — it just works on the bin­ary data.

Res­ult map

So far there was just the trie that con­tained res­ult in­dices such as [3, 5, 6]. That’s ob­vi­ously not enough to show a link with search res­ult to the user. Next to the trie there has to be a map from those in­dices to ac­tu­al res­ult titles and URLs.

Each title and URL is a string of a dif­fer­ent length and in or­der to make the look­up a con­stant-time op­er­a­tion, there needs to be first a table map­ping an in­dex to byte off­set with the string data. Be­cause in the trie data I’m us­ing only 24 bits for byte off­set, I can do the same here and use the re­main­ing 8 bits for vari­ous ad­di­tion­al in­form­a­tion, such as sym­bol type.

Field size Con­tent
8b + 24b Res­ult 0 flags + byte off­set o_0
8b + 24b Res­ult 1 flags + byte off­set o_1
8b + 24b Res­ult N flags + byte off­set o_n
32b File size l
o_1 - o_0 Res­ult 0 title, '\0', URL
l - o_n Res­ult N title, '\0', URL

Look­ing up a res­ult r in­volves read­ing data from byte off­set range giv­en by in­dex r and r + 1 . The title and URL is null-sep­ar­ated. Res­ult data cor­res­pond­ing to the above sim­pli­fied case look like this when visu­al­ized:

0: Magnum [type=NAMESPACE] -> namespaceMagnum.html
1: Magnum::Math [type=NAMESPACE] -> namespaceMagnum_1_1Math.html
2: Magnum::Math::Vector [type=CLASS] -> classMagnum_1_1Math_1_1Vector.html
3: Magnum::Math::Vector::min() [type=FUNC] ->
        classMagnum_1_1Math_1_1Vector.html#af029f9f7810201f0bd8d9580af273bde
4: Magnum::Math::Range [type=CLASS] -> classMagnum_1_1Math_1_1Range.html
5: Magnum::Math::Range::min() [type=FUNC] ->
        classMagnum_1_1Math_1_1Range.html#ad4919361a2086212fac96da0221e4dcd
6: Magnum::Math::min() [type=FUNC] ->
        namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21

Res­ult map pre­fix mer­ging

Again, thanks to the data visu­al­iz­a­tion it’s im­me­di­ately clear that there’s a lot of re­dund­ant in­form­a­tion that’s neg­at­ively af­fect­ing data size. The re­dund­ant in­form­a­tion is mainly in pre­fixes, so I re­used the already im­ple­men­ted trie to per­form the op­tim­iz­a­tion. The fol­low­ing visu­al­iz­a­tion shows the res­ult — the de­tails are way too bor­ing to ex­plain in full here, but ba­sic­ally every entry title can be op­tion­ally pre­fixed by an­oth­er entry title (de­noted by prefix=M) and take the first [:N] char­ac­ters from that entry URL, re­curs­ively.

0: Magnum [type=NAMESPACE] -> namespaceMagnum.html
1: ::Math [prefix=0[:15], type=NAMESPACE] -> _1_1Math.html
2: ::Vector [prefix=1[:0], type=CLASS] -> classMagnum_1_1Math_1_1Vector.html
3: ::min() [prefix=2[:34], type=FUNC] -> #af029f9f7810201f0bd8d9580af273bde
4: ::Range [prefix=1[:0], type=CLASS] -> classMagnum_1_1Math_1_1Range.html
5: ::min() [prefix=4[:33], type=FUNC] -> #ad4919361a2086212fac96da0221e4dcd
6: ::min() [prefix=1[:28], type=FUNC] -> #ae22ef0cb2a5a5e4c5e626a3df670be21

In this par­tic­u­lar case the map size went down from 494 bytes to 321 bytes. In case of Mag­num data, the sav­ing was over a mega­byte.

Looka­head bar­ri­ers

Look­ing at the above, some­thing seems off. It’s def­in­itely not the user’s ex­pect­a­tion to get to know about all sym­bols that ever ex­is­ted when en­ter­ing a bunch of char­ac­ters. For ex­ample, when I’m search­ing for a namespace, I am not in­ter­ested in all its mem­bers. Or when a some pre­fix is ap­pear­ing both in func­tion name and en­clos­ing namespace, I’ll get the res­ult twice in the list.

So how to solve that? When adding sym­bols to the trie, let’s add bar­ri­ers to sig­nal the looka­head auto­com­ple­tion that it should not dig deep­er look­ing for pos­sible res­ults. How­ever, I don’t want to pre­vent API dis­cov­ery, so when one enters e.g. Magnum:, the search should go past the bar­ri­er and show all mem­bers of the namespace. (But not nes­ted mem­bers!)

In the fol­low­ing visu­al­iz­a­tion, the $ char­ac­ters de­note a looka­head bar­ri­er. If the auto­com­ple­tion reaches a sym­bol that’s in front of it, it will not at­tempt to con­tin­ue fur­ther.

magnum [0]
|||   :$
|||    :math [1]
|||         :$
|||          :vector [2]
|||           |     :$
|||           |      :min [3]
|||           range [4]
|||           |    :$
|||           |     :min [5]
|||           min [6]
||th [1]
||| :$
|||  :vector [2]
|||   |     :$
|||   |      :min [3]
|||   range [4]
|||   |    :$
|||   |     :min [5]
|||   min [6]
|in [3, 5, 6]
vector [2]
|     :$
|      :min [3]
range [4]
|    :$
|     :min [5]

With these bar­ri­ers in place, search­ing for m gives the fol­low­ing out­put — unique sym­bols start­ing with m, sor­ted from shortest to longest:

[{title: 'Magnum::Math::min()',
  url: 'namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21'},
 {title: 'Magnum::Math::Range::min()',
  url: 'classMagnum_1_1Math_1_1Range.html#a22af2191e4ab88b45f082ef14aa45185'},
 {title: 'Magnum::Math::Vector::min()',
  url: 'classMagnum_1_1Math_1_1Vector.html#af029f9f7810201f0bd8d9580af273bde'},
 {title: 'Magnum::Math',
  url: 'namespaceMagnum_1_1Math.html'},
 {title: 'Magnum',
  url: 'namespaceMagnum.html'}]

Search­ing for math gives just this:

[{title: 'Magnum::Math',
  url: 'namespaceMagnum_1_1Math.html'}]

While search­ing for math: provides a list of all its im­me­di­ate mem­bers:

[{title: 'Magnum::Math::min()',
  url: 'namespaceMagnum_1_1Math.html#ae22ef0cb2a5a5e4c5e626a3df670be21'},
 {title: 'Magnum::Math::Range',
  url: 'classMagnum_1_1Math_1_1Range.html'},
 {title: 'Magnum::Math::Vector',
  url: 'classMagnum_1_1Math_1_1Vector.html'}]

Ex­tra good­ies

Just rep­lic­at­ing the stock search func­tion­al­ity and mak­ing it faster would be bor­ing, so I ad­ded some cher­ries on top. First is propagat­ing info about de­prec­ated and deleted func­tions to search res­ults (and mem­ber list­ing as well) to save some the users ex­tra clicks when try­ing to find a func­tion that’s ac­tu­ally callable. You can see it in the GIF above.

The second fea­ture is abil­ity to spe­cify ad­di­tion­al search keywords for par­tic­u­lar sym­bols. This is use­ful es­pe­cially in the OpenGL wrap­ping lay­er of Mag­num — one can search for a well-known OpenGL sym­bol to get to know what API it cor­res­ponds to. This is ex­posed through cus­tom @m_keywords, @m_keyword and @m_enum_values_as_keywords com­mands, head over to the docs for de­tails. Self-ex­plan­at­ory ex­ample:

/**
 * @brief Set texture storage
 *
 * @m_keywords{glTexStorage2D() glTextureStorage2D()}
 */
Texture2D& Texture2D::setStorage(...);

/**
 * @brief Renderer feature
 *
 * @m_enum_values_as_keywords
 */
enum class RendererFeature: GLenum {
    /** Depth test */
    DepthTest = GL_DEPTH_TEST,

    ...
};

Fi­nal im­ple­ment­a­tion

In this art­icle I tried to ex­plain mainly the key as­pects of the search im­ple­ment­a­tion . The rest is im­port­ant too, but would make the art­icle too long and bor­ing.

  • In or­der to prop­erly high­light the cur­rently typed part of the search res­ult, the trie look­up is count­ing num­ber of suf­fix char­ac­ters when gath­er­ing the res­ults. This is ex­ten­ded fur­ther in or­der to dis­play func­tion ar­gu­ments lists and const / r-value over­loads (which are not part of the trie data).
  • Due to re­stric­tions of Chro­mi­um-based browsers, it’s not pos­sible to down­load data us­ing XMLHttpRequest when served from a loc­al file-sys­tem. Be­cause that’s a very com­mon use case, I had to work around this by con­vert­ing the bin­ary to a Base85-en­coded *.js file that’s loaded dir­ectly via <script>.
  • A lot of ef­for went in­to cli­ent-side UX, like key­board nav­ig­a­tion, cut­ting off overly long pre­fixes on the prop­er side, avoid­ing page jumps etc. There are still some known is­sues left to be re­solved.

Some stats

With the cur­rent state of Mag­num doc­u­ment­a­tion, where there is 7565 unique sym­bols and I’m in the middle of adding @m_keywords to all OpenGL wrap­per APIs, the data size is as fol­lows, de­pend­ing on what fea­tures are en­abled:

Data type Un­com­pressed Gzipped
Baseline, bin­ary 2635.6 kB 884.2 kB
Sub­tree mer­ging, bin­ary 2033.3 kB 537.8 kB
Sub­tree and pre­fix mer­ging, bin­ary 959.3 kB 501.5 kB
Sub­tree and pre­fix mer­ging, base85 1202.5 kB 752.5 kB

As the docs are ac­cessed through a web­serv­er, I can use the bin­ary and en­able serv­er-side gzip com­pres­sion. That gives slightly be­low half a mega­byte, which, in com­par­is­on to sizes of con­tem­por­ary web­sites, is more than okay. The doc­u­ment­a­tion is up­dated not more than once a week, so the browsers will usu­ally just hap­pily serve this data­set from a cache.

One im­port­ant thing to note: the pre­fix mer­ging doesn’t give really that much ad­vant­age when com­par­ing the gzipped size, as the data are already very nicely com­press­ible.

Bo­nus: Uni­code

Someone might say that the above re­stric­tion to 8-bit char­ac­ters makes this not Uni­code-aware and thus use­less bey­ond simple AS­CII sym­bols. Well, no. As­sum­ing the in­put is UTF-8-en­coded, this is how a trie for two great Czech words “hýž­dě” and “hárá” will look like:

h0xc3
  0xbd
   0xc5
  | 0xbe
  |  d0xc4
  |    0x9b
  |      [13]
  0xa1
   r0xc3
  |  0xa1
  |    [42]

The im­ple­ment­a­tion simply takes the UTF-8 rep­res­ent­a­tion of the word “hýž­dě” ['h', 0xc3, 0xbd, 0xc5, 0xbe, 'd', 0xc4, 0x9b] byte-by-byte and in­serts that in­to the trie, not giv­ing much thought to whatever en­cod­ing giv­en string is in. This also has the ef­fect that UTF-8 rep­res­ent­a­tion of “ý” and “á” is [0xc3, 0xbd] and [0xc3, 0xa1] re­spect­ively, shar­ing the same first byte 0xc3, which is then a single node in the trie. That means there’s no need to have more that 8 bits to rep­res­ent a char­ac­ter in the trie.

Try it your­self

Oh, did I men­tion that this whole thing is fully open-source and ready for you to try it out? Head over to the full doc­u­ment­a­tion of the m.css Doxy­gen theme for de­tails. Bug re­ports very wel­come, and if you like what you see, con­sider mak­ing a dona­tion. Thank you!