New Application implementation for Emscripten

If you build your Mag­num apps for the web, you can now make use of a new fea­ture-packed, smal­ler and more power-ef­fi­cient ap­plic­a­tion im­ple­ment­a­tion. It is us­ing the Em­scripten HTM­L5 APIs dir­ectly in­stead of go­ing through com­pat­ib­il­ity lay­ers.

Un­til now, the Plat­form::Sdl2Ap­plic­a­tion was the go-to solu­tion for most plat­forms in­clud­ing the web and mo­bile. How­ever, not every­body needs all the fea­tures SDL provides and, es­pe­cially on Em­scripten, apart from sim­pli­fy­ing port­ing it doesn’t really add any­thing ex­tra on top. On the con­trary, the ad­di­tion­al lay­er of trans­la­tion between HTM­L5 and SDL APIs in­creases the ex­ecut­able size and makes some fea­tures un­ne­ces­sar­ily hard to ac­cess.

To solve that, the new Plat­form::Em­scrip­tenAp­plic­a­tion, con­trib­uted in mosra/mag­num#300 by @Squareys, is us­ing Em­scripten HTM­L5 APIs dir­ectly, open­ing new pos­sib­il­it­ies while mak­ing the code smal­ler and more ef­fi­cient.

“SDL2” vs SDL2

Since there’s some con­fu­sion about SDL among Em­scripten users, let’s cla­ri­fy that first. Us­ing SDL in Em­scripten is ac­tu­ally pos­sible in two ways — the im­pli­cit sup­port, im­ple­men­ted in lib­rar­y_sdl.js, gives you a slightly strange hy­brid of SDL1 and SDL2 in a re­l­at­ively small pack­age. Not all SDL2 APIs are present there, on the oth­er hand it has enough from SDL2 to make it a vi­able al­tern­at­ive to the SDL2 every­one is used to. This is what Plat­form::Sdl2Ap­plic­a­tion is us­ing.

The oth­er way is a “full SDL2”, avail­able if you pass -s USE_SDL=2 to the linker. Two years ago we tried to re­move all Em­scripten-spe­cif­ic work­arounds from Plat­form::Sdl2Ap­plic­a­tion by switch­ing to this full SDL2, but quickly real­ized it was a bad de­cision — in total it re­moved 30 lines of code, but caused the res­ult­ing code to be al­most 600 kB lar­ger. The size in­crease was so ser­i­ous that it didn’t war­rant the very minor im­prove­ments in code main­tain­ab­il­ity. For the re­cord, the ori­gin­al pull re­quest is archived at mosra/mag­num#218.

The SDL-free Em­scrip­tenAp­plic­a­tion

All ap­plic­a­tion im­ple­ment­a­tions in Mag­num strive for al­most full API com­pat­ib­il­ity, with the goal of mak­ing it pos­sible to use an im­ple­ment­a­tion op­tim­al for chosen plat­form and use case. This was already the case with Plat­form::GlfwAp­plic­a­tion and Plat­form::Sdl2Ap­plic­a­tion, where switch­ing from one to the oth­er is in 90% cases just a mat­ter of us­ing a dif­fer­ent #include and passing a dif­fer­ent com­pon­ent to CMake’s find_package().

The new Plat­form::Em­scrip­tenAp­plic­a­tion con­tin­ues in this fash­ion and we por­ted all ex­ist­ing ex­amples and tools that formerly used Plat­form::Sdl2Ap­plic­a­tion to it to en­sure it works in broad use cases. Apart from that, the new im­ple­ment­a­tion fixes some of the long-stand­ing is­sues like mis­cal­cu­lated event co­ordin­ates on mo­bile web browsers or the De­lete key leak­ing through text in­put events.

Power-ef­fi­cient idle be­ha­vi­or

Since the very be­gin­ning, all Mag­num ap­plic­a­tion im­ple­ment­a­tions de­fault to re­draw­ing only when needed in or­der to save power — be­cause Mag­num is not just for games that have to an­im­ate some­thing every frame, it doesn’t make sense to use up all sys­tem re­sources by de­fault. While this is simple to im­ple­ment ef­fi­ciently on desktop apps where the ap­plic­a­tion has the full con­trol over the main loop (and thus can block in­def­in­itely wait­ing for an in­put event), it’s harder in the call­back-based browser en­vir­on­ment.

The ori­gin­al Plat­form::Sdl2Ap­plic­a­tion makes use of em­scripten_­set_­main_loop(), which peri­od­ic­ally calls win­dow.re­quest­An­im­a­tion­Frame() in or­der to main­tain a steady frame rate. For apps that need to re­draw only when needed this means the call­back will be called 60 times per second only to be a no-op. While that’s still sig­ni­fic­antly more ef­fi­cient than draw­ing everything each time, it still means the browser has to wake up 60 times per second to do noth­ing.

Plat­form::Em­scrip­tenAp­plic­a­tion in­stead makes use of re­quest­An­im­a­tion­Frame() dir­ectly — the next an­im­a­tion frame is im­pli­citly sched­uled, but can­celled again after the draw event if the app doesn’t wish to re­draw im­me­di­ately again. That takes the best of both worlds — re­draws are still VSync’d, but the browser is not loop­ing need­lessly if the app just wants to wait with a re­draw for the next in­put event. To give you some num­bers, be­low is a ten-second out­put of Chrome’s per­form­ance mon­it­or com­par­ing SDL and Em­scripten app im­ple­ment­a­tion wait­ing for an in­put event. You can re­pro­duce this with the Mag­num Play­er — no mat­ter how com­plex an­im­ated scene you throw at it, if you pause the an­im­a­tion it will use as much CPU as a plain stat­ic text web page.

DPI aware­ness re­vis­ited

Ar­gu­ably to sim­pli­fy port­ing, the Em­scripten SDL emu­la­tion re­cal­cu­lates all in­put event co­ordin­ates to match frame­buf­fer pixels. The ac­tu­al DPI scal­ing (or device pixel ra­tio) is then be­ing ex­posed through dpiS­cal­ing(), mak­ing it be­have the same as Linux, Win­dows and An­droid on high-DPI screens. In con­trast, HTM­L5 APIs be­have like ma­cOS / iOS and Plat­form::Em­scrip­tenAp­plic­a­tion fol­lows that be­ha­vi­or — frame­buf­fer­Size() thus matches device pixels while win­dowSize() (to which all events are re­lated) is smal­ler on HiDPI sys­tems. For more in­form­a­tion, check out the DPI aware­ness docs.

Ex­ecut­able size sav­ings

Be­cause we didn’t end up us­ing the heavy­weight “full SDL2” in the first place, the dif­fer­ence in ex­ecut­able size is noth­ing ex­treme — in total, in a Re­lease WebAssembly build, the JS size got smal­ler by about 20 kB, while the WASM file stays roughly the same.

111.9 kB 74.4 kB 52.1 kB 731.2 kB 226.3 kB 226.0 kB 0 100 200 300 400 500 600 700 800 kB Sdl2Application Sdl2Application EmscriptenApplication -s USE_SDL=2 -s USE_SDL=1 Download size (*.js, *.wasm)

Min­im­al runtime, or brain sur­gery with a chain­saw

On the oth­er hand, since the new ap­plic­a­tion doesn’t use any of the emscripten_set_main_loop() APIs from library_browser.js, it makes it a good can­did­ate for play­ing with the re­l­at­ively re­cent MIN­IM­AL_RUNTIME fea­ture of Em­scripten. Now, while Mag­num is mov­ing in the right dir­ec­tion, it’s not yet in a state where this would “just work”. Sup­port­ing MINIMAL_RUNTIME re­quires either mov­ing fast and break­ing lots of things or have the APIs slowly evolve in­to a state that makes it pos­sible. Be­cause re­li­able back­wards com­pat­ib­il­ity and pain­less up­grade path is a valu­able as­set in our port­fo­lio, we chose the lat­ter — it will even­tu­ally hap­pen, but not right now. An­oth­er reas­on is that while Mag­num it­self can be highly op­tim­ized to be com­pat­ible with min­im­al runtime, the usu­al ap­plic­a­tion code is not able to sat­is­fy those re­quire­ments without re­mov­ing and re­writ­ing most third-party de­pend­en­cies.

That be­ing said, why not spend one af­ter­noon with a chain­saw and try de­mol­ish­ing the code to see what could come out? It’s how­ever im­port­ant to note that MINIMAL_RUNTIME is still a very fresh fea­ture and thus it’s very likely that a lot of code will simply not work with it. All the dis­covered prob­lems are lis­ted be­low be­cause at this point there are no res­ults at all when googling them, so hope­fully this helps oth­er people stuck in sim­il­ar places:

  • std::getenv() or the environ vari­able (used by Util­ity::Ar­gu­ments) res­ults in writeAsciiToMemory() be­ing called, which is right now ex­pli­citly dis­abled for min­im­al runtime (and thus you either get a fail­ure at runtime or the Clos­ure Com­piler com­plain­ing about these names be­ing un­defined). Since Em­scripten’s en­vir­on­ment is just a bunch of hard­coded val­ues and Mag­num is us­ing Node.js APIs to get the real val­ues for com­mand-line apps any­way, solu­tion is to simply not use those func­tions.
  • Right now, Mag­num is us­ing C++ iostreams on three isol­ated places (Util­ity::De­bug be­ing the most prom­in­ent) and those uses are gradu­ally be­ing phased off. On Em­scripten, us­ing any­thing that even re­motely touches them will make the backend emit calls to llvm_stacksave() and llvm_stackrestore(). The JavaS­cript im­ple­ment­a­tions then call stackSave() and stackRestore() which how­ever do not get pulled in in MINIMAL_RUNTIME, again res­ult­ing in either a runtime er­ror every time you call in­to JS (so also all emscripten_set_mousedown_callback() func­tions) or when you use the Clos­ure Com­piler. After wast­ing a few hours try­ing to con­vince Em­scripten to emit these two by adding _llvm_stacksave__deps: ['$stackSave'] the ul­ti­mate solu­tion was to kill everything stream-re­lated. Con­sid­er­ing every­one who’s in­ter­ested in MINIMAL_RUNTIME prob­ably did that already, it ex­plains why this is an­oth­er un­google­able er­ror.
  • If you use C++ streams, the gen­er­ated JS driver file con­tains a full JavaS­cript im­ple­ment­a­tion of strftime() and the only way to get rid of it is re­mov­ing all stream us­age as well. Grep your JS file for Monday — if it’s there, you have a prob­lem.
  • JavaS­cript Em­scripten APIs like dynCall() or allocate() are not avail­able and put­ting them in­to either EXTRA_EXPORTED_RUNTIME_METHODS or RUNTIME_FUNCS_TO_IMPORT either didn’t do any­thing or moved the er­ror in­to a dif­fer­ent place. For the former it was pos­sible to work around it by dir­ectly call­ing one of its spe­cial­iz­a­tions (in that par­tic­u­lar case dynCall_ii()), the second res­ul­ted in a frus­trated table­flip and the rel­ev­ant piece of code get­ting cut off.

Be­low is a break­down of vari­ous op­tim­iz­a­tions on a min­im­al ap­plic­a­tion that does just a frame­buf­fer clear, each step chop­ping an­oth­er bit off the total down­load size. All sizes are un­com­pressed, built in Re­lease mode with -Oz, --llvm-lto 1 and --closure 1. Later on in the pro­cess, Bloaty McBloat­Face ex­per­i­ment­al WebAssembly sup­port was used to dis­cov­er what func­tions con­trib­ute the most to fi­nal code size.

Op­er­a­tion JS size WASM size
Ini­tial state 52.1 kB 226.3 kB
En­abling min­im­al runtime 1 36.3 kB 224.5 kB
Ad­di­tion­al slim­ming flags 2 35.7 kB 224.5 kB
Dis­abling filesys­tem 3 19.4 kB 224.5 kB
Chop­ping off all C++ stream us­age 14.7 kB 83.6 kB
En­abling COR­RADE_NO_ASSERT 14.7 kB 75.4 kB
Re­mov­ing a single use of std::sort() 4 14.7 kB 69.3 kB
Re­mov­ing one std::un­ordered_map 4 14.7 kB 62.6 kB
Us­ing em­mal­loc in­stead of dlmal­loc 5 14.7 kB 56.3 kB
Re­mov­ing all printf() us­age 6 14.7 kB 44 kB (es­tim­ate)
52.1 kB 36.3 kB 35.7 kB 19.4 kB 14.7 kB 14.7 kB 14.7 kB 14.7 kB 14.7 kB 14.7 kB 226.3 kB 224.5 kB 224.5 kB 224.5 kB 83.6 kB 75.4 kB 69.3 kB 62.6 kB 56.3 kB 44.0 kB 0 50 100 150 200 250 kB Initial state Enabling minimal runtime Additional slimming flags Disabling filesystem Chopping off all C++ stream usage Enabling CORRADE_NO_ASSERT Removing a single use of std::sort() Removing one std::unordered_map Using emmalloc instead of dlmalloc Removing all printf() usage Download size (*.js, *.wasm)
1.
^ -s MINIMAL_RUNTIME=2 -s ENVIRONMENT=web -lGL plus tem­por­ar­ily en­abling also -s IGNORE_CLOSURE_COMPILER_ERRORS=1 in or­der to make Clos­ure Com­piler sur­vive un­defined vari­able er­rors due to iostreams and oth­er, men­tioned above
2.
^ -s SUPPORT_ERRNO=0 -s GL_EMULATE_GLES_VERSION_STRING_FORMAT=0 -s GL_EXTENSIONS_IN_PREFIXED_FORMAT=0 -s GL_SUPPORT_AUTOMATIC_ENABLE_EXTENSIONS=0 -s GL_TRACK_ERRORS=0 -s DISABLE_DEPRECATED_FIND_EVENT_TARGET_BEHAVIOR=1 — ba­sic­ally dis­abling what’s en­abled by de­fault. In par­tic­u­lar, the GL_EXTENSIONS_IN_PREFIXED_FORMAT=0 is not sup­por­ted by Mag­num right now, caus­ing it to not re­port any ex­ten­sions, but that can be eas­ily fixed. The res­ult of dis­abling all these is … un­der­whelm­ing.
3.
^ -s FILESYSTEM=0, makes Em­scripten not emit any filesys­tem-re­lated code. Mag­num provides filesys­tem ac­cess through vari­ous APIs (Util­ity::Dir­ect­ory, GL::Shader::add­File(), Trade::Ab­strac­tIm­port­er::open­File(), …) and at the mo­ment there’s no pos­sib­il­ity to com­pile all these out, so this is a nuc­le­ar op­tion that works.
4.
^ a b GL::Con­text uses a std::sort() and a std::un­ordered_map to check for ex­ten­sion pres­ence and print their list in the en­gine star­tup log. It was fright­en­ing to see a re­mov­al of a single std::sort() caus­ing a 10% drop in ex­ecut­able size — since WebGL has roughly two dozens ex­ten­sions (com­pared to > 200 on desktop and ES), maybe a space-ef­fi­cient al­tern­at­ive im­ple­ment­a­tion could be done for this tar­get in­stead.
5.
^ Doug Lea‘s mal­loc() is a gen­er­al-pur­pose al­loc­at­or, used by glibc among oth­ers. It’s very per­form­ant and a good choice for code that does many small al­loc­a­tions (std::un­ordered_map, I’m look­ing at you). The down­side is its lar­ger size, and code do­ing few­er lar­ger al­loc­a­tions might want to use -s MALLOC=emmalloc in­stead. We don’t pre­tend Mag­num is at that state yet, but oth­er pro­jects su­cess­fully switched to it, shav­ing more bytes off the down­load size.
6.
^ After re­mov­ing all of the above, std::printf() in­tern­als star­ted ap­pear­ing at the top of Bloaty’s size re­port, totalling at about 10% of the ex­ecut­able size. Mag­num doesn’t use it any­where dir­ectly and all trans­it­ive us­age of it was killed to­geth­er with iostreams; fur­ther dig­ging re­vealed that it gets called from libc++’s abort_mes­sage(), for ex­ample when abort­ing due to a pure vir­tu­al func­tion call. In­de­pend­ent meas­ure­ment showed that std::printf() is around 12 kB of ad­di­tion­al code com­pared to std::puts(), mainly due to the in­her­ent com­plex­ity of float­ing-point string con­ver­sion. It’s planned to use the much sim­pler and smal­ler Ryū al­gorithm for Mag­num’s std::printf() re­place­ment, ad­di­tion­ally en­sur­ing that float-to-string con­ver­sions can be DCE-d when not used. We might be look­ing in­to patch­ing Em­scripten’s libc++ to not use the ex­pens­ive im­ple­ment­a­tion in its abort mes­sages.

While all of the above size re­duc­tions were done in a hack-and-slash man­ner, the fi­nal ex­ecut­able still ini­tial­izes and ex­ecutes prop­erly, clear­ing the frame­buf­fer and re­act­ing to in­put events. For ref­er­ence, check out diffs of the chainsaw-surgery branches in cor­rade and mag­num.

The above is def­in­itely not all that can be done — es­pe­cially con­sid­er­ing that re­mov­ing two uses of semi-heavy STL APIs led to al­most 20% save in code size, there are most prob­ably more of such low hanging fruits. The above tasks were ad­ded to mosra/mag­num#293 (if not there already) and will get gradu­ally in­teg­rated in­to master.

Con­clu­sion

Bright times ahead! The new Plat­form::Em­scrip­tenAp­plic­a­tion is the first step to truly min­im­al WebAssembly builds and the above hints that it’s pos­sible to have down­load sizes not too far from code care­fully writ­ten in plain C. To give a fair com­par­is­on, the ba­sic frame­buf­fer clear sample from @floooh‘s Sokol Samples is 42 kB in total, while the above equi­val­ent is roughly 59 kB. Us­ing C++(11), but not over­us­ing it — and that’s just the be­gin­ning.