New Application implementation for Emscripten

If you build your Mag­num apps for the web, you can now make use of a new fea­ture-packed, small­er and more pow­er-ef­fi­cient ap­pli­ca­tion im­ple­men­ta­tion. It is us­ing the Em­scripten HTM­L5 APIs di­rect­ly in­stead of go­ing through com­pat­i­bil­i­ty lay­ers.

Un­til now, the Plat­form::Sdl2Ap­pli­ca­tion was the go-to so­lu­tion for most plat­forms in­clud­ing the web and mo­bile. How­ev­er, not ev­ery­body needs all the fea­tures SDL pro­vides and, es­pe­cial­ly on Em­scripten, apart from sim­pli­fy­ing port­ing it doesn’t re­al­ly add any­thing ex­tra on top. On the con­trary, the ad­di­tion­al lay­er of trans­la­tion be­tween HTM­L5 and SDL APIs in­creas­es the ex­e­cutable size and makes some fea­tures un­nec­es­sar­i­ly hard to ac­cess.

To solve that, the new Plat­form::Em­scripte­nAp­pli­ca­tion, con­trib­uted in mosra/mag­num#300 by @Squareys, is us­ing Em­scripten HTM­L5 APIs di­rect­ly, open­ing new pos­si­bil­i­ties while mak­ing the code small­er and more ef­fi­cient.

“SDL2” vs SDL2

Since there’s some con­fu­sion about SDL among Em­scripten users, let’s clar­i­fy that first. Us­ing SDL in Em­scripten is ac­tu­al­ly pos­si­ble in two ways — the im­plic­it sup­port, im­ple­ment­ed in li­brary_s­dl.js, gives you a slight­ly strange hy­brid of SDL1 and SDL2 in a rel­a­tive­ly small pack­age. Not all SDL2 APIs are present there, on the oth­er hand it has enough from SDL2 to make it a vi­able al­ter­na­tive to the SDL2 ev­ery­one is used to. This is what Plat­form::Sdl2Ap­pli­ca­tion is us­ing.

The oth­er way is a “full SDL2”, avail­able if you pass -s USE_SDL=2 to the link­er. Two years ago we tried to re­move all Em­scripten-spe­cif­ic work­arounds from Plat­form::Sdl2Ap­pli­ca­tion by switch­ing to this full SDL2, but quick­ly re­al­ized it was a bad de­ci­sion — in to­tal it re­moved 30 lines of code, but caused the re­sult­ing code to be al­most 600 kB larg­er. The size in­crease was so se­ri­ous that it didn’t war­rant the very mi­nor im­prove­ments in code main­tain­abil­i­ty. For the record, the orig­i­nal pull re­quest is archived at mosra/mag­num#218.

The SDL-free Em­scripte­nAp­pli­ca­tion

All ap­pli­ca­tion im­ple­men­ta­tions in Mag­num strive for al­most full API com­pat­i­bil­i­ty, with the goal of mak­ing it pos­si­ble to use an im­ple­men­ta­tion op­ti­mal for cho­sen plat­form and use case. This was al­ready the case with Plat­form::GlfwAp­pli­ca­tion and Plat­form::Sdl2Ap­pli­ca­tion, where switch­ing from one to the oth­er is in 90% cas­es just a mat­ter of us­ing a dif­fer­ent #include and pass­ing a dif­fer­ent com­po­nent to CMake’s find_package().

The new Plat­form::Em­scripte­nAp­pli­ca­tion con­tin­ues in this fash­ion and we port­ed all ex­ist­ing ex­am­ples and tools that for­mer­ly used Plat­form::Sdl2Ap­pli­ca­tion to it to en­sure it works in broad use cas­es. Apart from that, the new im­ple­men­ta­tion fix­es some of the long-stand­ing is­sues like mis­cal­cu­lat­ed event co­or­di­nates on mo­bile web browsers or the Delete key leak­ing through text in­put events.

Pow­er-ef­fi­cient idle be­hav­ior

Since the very be­gin­ning, all Mag­num ap­pli­ca­tion im­ple­men­ta­tions de­fault to re­draw­ing on­ly when need­ed in or­der to save pow­er — be­cause Mag­num is not just for games that have to an­i­mate some­thing ev­ery frame, it doesn’t make sense to use up all sys­tem re­sources by de­fault. While this is sim­ple to im­ple­ment ef­fi­cient­ly on desk­top apps where the ap­pli­ca­tion has the full con­trol over the main loop (and thus can block in­def­i­nite­ly wait­ing for an in­put event), it’s hard­er in the call­back-based brows­er en­vi­ron­ment.

The orig­i­nal Plat­form::Sdl2Ap­pli­ca­tion makes use of em­scripten_set_­main_loop(), which pe­ri­od­i­cal­ly calls win­dow.re­ques­tAni­ma­tion­Frame() in or­der to main­tain a steady frame rate. For apps that need to re­draw on­ly when need­ed this means the call­back will be called 60 times per sec­ond on­ly to be a no-op. While that’s still sig­nif­i­cant­ly more ef­fi­cient than draw­ing ev­ery­thing each time, it still means the brows­er has to wake up 60 times per sec­ond to do noth­ing.

Plat­form::Em­scripte­nAp­pli­ca­tion in­stead makes use of re­ques­tAni­ma­tion­Frame() di­rect­ly — the next an­i­ma­tion frame is im­plic­it­ly sched­uled, but can­celled again af­ter the draw event if the app doesn’t wish to re­draw im­me­di­ate­ly again. That takes the best of both worlds — re­draws are still VSync’d, but the brows­er is not loop­ing need­less­ly if the app just wants to wait with a re­draw for the next in­put event. To give you some num­bers, be­low is a ten-sec­ond out­put of Chrome’s per­for­mance mon­i­tor com­par­ing SDL and Em­scripten app im­ple­men­ta­tion wait­ing for an in­put event. You can re­pro­duce this with the Mag­num Play­er — no mat­ter how com­plex an­i­mat­ed scene you throw at it, if you pause the an­i­ma­tion it will use as much CPU as a plain stat­ic text web page.

DPI aware­ness re­vis­it­ed

Ar­guably to sim­pli­fy port­ing, the Em­scripten SDL em­u­la­tion re­cal­cu­lates all in­put event co­or­di­nates to match frame­buffer pix­els. The ac­tu­al DPI scal­ing (or de­vice pix­el ra­tio) is then be­ing ex­posed through dpiS­cal­ing(), mak­ing it be­have the same as Lin­ux, Win­dows and An­droid on high-DPI screens. In con­trast, HTM­L5 APIs be­have like mac­OS / iOS and Plat­form::Em­scripte­nAp­pli­ca­tion fol­lows that be­hav­ior — frame­buf­fer­Size() thus match­es de­vice pix­els while win­dow­Size() (to which all events are re­lat­ed) is small­er on HiD­PI sys­tems. For more in­for­ma­tion, check out the DPI aware­ness docs.

Ex­e­cutable size sav­in­gs

Be­cause we didn’t end up us­ing the heavy­weight “full SDL2” in the first place, the dif­fer­ence in ex­e­cutable size is noth­ing ex­treme — in to­tal, in a Re­lease We­bAssem­bly build, the JS size got small­er by about 20 kB, while the WASM file stays rough­ly the same.

111.9 kB 74.4 kB 52.1 kB 731.2 kB 226.3 kB 226.0 kB 0 100 200 300 400 500 600 700 800 kB Sdl2Application Sdl2Application EmscriptenApplication -s USE_SDL=2 -s USE_SDL=1 Download size (*.js, *.wasm)

Min­i­mal run­time, or brain surgery with a chain­saw

On the oth­er hand, since the new ap­pli­ca­tion doesn’t use any of the emscripten_set_main_loop() APIs from library_browser.js, it makes it a good can­di­date for play­ing with the rel­a­tive­ly re­cent MIN­I­MAL_RUN­TIME fea­ture of Em­scripten. Now, while Mag­num is mov­ing in the right di­rec­tion, it’s not yet in a state where this would “just work”. Sup­port­ing MINIMAL_RUNTIME re­quires ei­ther mov­ing fast and break­ing lots of things or have the APIs slow­ly evolve in­to a state that makes it pos­si­ble. Be­cause re­li­able back­wards com­pat­i­bil­i­ty and pain­less up­grade path is a valu­able as­set in our port­fo­lio, we chose the lat­ter — it will even­tu­al­ly hap­pen, but not right now. An­oth­er rea­son is that while Mag­num it­self can be high­ly op­ti­mized to be com­pat­i­ble with min­i­mal run­time, the usu­al ap­pli­ca­tion code is not able to sat­is­fy those re­quire­ments with­out re­mov­ing and rewrit­ing most third-par­ty de­pen­den­cies.

That be­ing said, why not spend one af­ter­noon with a chain­saw and try de­mol­ish­ing the code to see what could come out? It’s how­ev­er im­por­tant to note that MINIMAL_RUNTIME is still a very fresh fea­ture and thus it’s very like­ly that a lot of code will sim­ply not work with it. All the dis­cov­ered prob­lems are list­ed be­low be­cause at this point there are no re­sults at all when googling them, so hope­ful­ly this helps oth­er peo­ple stuck in sim­i­lar places:

  • std::getenv() or the environ vari­able (used by Util­i­ty::Ar­gu­ments) re­sults in writeAsciiToMemory() be­ing called, which is right now ex­plic­it­ly dis­abled for min­i­mal run­time (and thus you ei­ther get a fail­ure at run­time or the Clo­sure Com­pil­er com­plain­ing about these names be­ing un­de­fined). Since Em­scripten’s en­vi­ron­ment is just a bunch of hard­cod­ed val­ues and Mag­num is us­ing Node.js APIs to get the re­al val­ues for com­mand-line apps any­way, so­lu­tion is to sim­ply not use those func­tions.
  • Right now, Mag­num is us­ing C++ iostreams on three iso­lat­ed places (Util­i­ty::De­bug be­ing the most prom­i­nent) and those us­es are grad­u­al­ly be­ing phased off. On Em­scripten, us­ing any­thing that even re­mote­ly touch­es them will make the back­end emit calls to llvm_stacksave() and llvm_stackrestore(). The JavaScript im­ple­men­ta­tions then call stackSave() and stackRestore() which how­ev­er do not get pulled in in MINIMAL_RUNTIME, again re­sult­ing in ei­ther a run­time er­ror ev­ery time you call in­to JS (so al­so all emscripten_set_mousedown_callback() func­tions) or when you use the Clo­sure Com­pil­er. Af­ter wast­ing a few hours try­ing to con­vince Em­scripten to emit these two by adding _llvm_stacksave__deps: ['$stackSave'] the ul­ti­mate so­lu­tion was to kill ev­ery­thing stream-re­lat­ed. Con­sid­er­ing ev­ery­one who’s in­ter­est­ed in MINIMAL_RUNTIME prob­a­bly did that al­ready, it ex­plains why this is an­oth­er un­googleable er­ror.
  • If you use C++ streams, the gen­er­at­ed JS driv­er file con­tains a full JavaScript im­ple­men­ta­tion of strftime() and the on­ly way to get rid of it is re­mov­ing all stream us­age as well. Grep your JS file for Monday — if it’s there, you have a prob­lem.
  • JavaScript Em­scripten APIs like dynCall() or allocate() are not avail­able and putting them in­to ei­ther EXTRA_EXPORTED_RUNTIME_METHODS or RUNTIME_FUNCS_TO_IMPORT ei­ther didn’t do any­thing or moved the er­ror in­to a dif­fer­ent place. For the for­mer it was pos­si­ble to work around it by di­rect­ly call­ing one of its spe­cial­iza­tions (in that par­tic­u­lar case dynCall_ii()), the sec­ond re­sult­ed in a frus­trat­ed table­flip and the rel­e­vant piece of code get­ting cut off.

Be­low is a break­down of var­i­ous op­ti­miza­tions on a min­i­mal ap­pli­ca­tion that does just a frame­buffer clear, each step chop­ping an­oth­er bit off the to­tal down­load size. All sizes are un­com­pressed, built in Re­lease mode with -Oz, --llvm-lto 1 and --closure 1. Lat­er on in the process, Bloaty McBloat­Face ex­per­i­men­tal We­bAssem­bly sup­port was used to dis­cov­er what func­tions con­trib­ute the most to fi­nal code size.

Op­er­a­tion JS size WASM size
Ini­tial state 52.1 kB 226.3 kB
En­abling min­i­mal run­time 1 36.3 kB 224.5 kB
Ad­di­tion­al slim­ming flags 2 35.7 kB 224.5 kB
Dis­abling filesys­tem 3 19.4 kB 224.5 kB
Chop­ping off all C++ stream us­age 14.7 kB 83.6 kB
En­abling COR­RADE_NO_ASSERT 14.7 kB 75.4 kB
Re­mov­ing a sin­gle use of std::sort() 4 14.7 kB 69.3 kB
Re­mov­ing one std::un­or­dered_map 4 14.7 kB 62.6 kB
Us­ing em­mal­loc in­stead of dl­mal­loc 5 14.7 kB 56.3 kB
Re­mov­ing all printf() us­age 6 14.7 kB 44 kB (es­ti­mate)
52.1 kB 36.3 kB 35.7 kB 19.4 kB 14.7 kB 14.7 kB 14.7 kB 14.7 kB 14.7 kB 14.7 kB 226.3 kB 224.5 kB 224.5 kB 224.5 kB 83.6 kB 75.4 kB 69.3 kB 62.6 kB 56.3 kB 44.0 kB 0 50 100 150 200 250 kB Initial state Enabling minimal runtime Additional slimming flags Disabling filesystem Chopping off all C++ stream usage Enabling CORRADE_NO_ASSERT Removing a single use of std::sort() Removing one std::unordered_map Using emmalloc instead of dlmalloc Removing all printf() usage Download size (*.js, *.wasm)
1.
^ -s MINIMAL_RUNTIME=2 -s ENVIRONMENT=web -lGL plus tem­po­rar­ily en­abling al­so -s IGNORE_CLOSURE_COMPILER_ERRORS=1 in or­der to make Clo­sure Com­pil­er sur­vive un­de­fined vari­able er­rors due to iostreams and oth­er, men­tioned above
2.
^ -s SUPPORT_ERRNO=0 -s GL_EMULATE_GLES_VERSION_STRING_FORMAT=0 -s GL_EXTENSIONS_IN_PREFIXED_FORMAT=0 -s GL_SUPPORT_AUTOMATIC_ENABLE_EXTENSIONS=0 -s GL_TRACK_ERRORS=0 -s DISABLE_DEPRECATED_FIND_EVENT_TARGET_BEHAVIOR=1 — ba­si­cal­ly dis­abling what’s en­abled by de­fault. In par­tic­u­lar, the GL_EXTENSIONS_IN_PREFIXED_FORMAT=0 is not sup­port­ed by Mag­num right now, caus­ing it to not re­port any ex­ten­sions, but that can be eas­i­ly fixed. The re­sult of dis­abling all these is … un­der­whelm­ing.
3.
^ -s FILESYSTEM=0, makes Em­scripten not emit any filesys­tem-re­lat­ed code. Mag­num pro­vides filesys­tem ac­cess through var­i­ous APIs (Util­i­ty::Di­rec­to­ry, GL::Shad­er::addFile(), Trade::Ab­strac­tIm­porter::open­File(), …) and at the mo­ment there’s no pos­si­bil­i­ty to com­pile all these out, so this is a nu­cle­ar op­tion that works.
4.
^ a b GL::Con­text us­es a std::sort() and a std::un­or­dered_map to check for ex­ten­sion pres­ence and print their list in the en­gine start­up log. It was fright­en­ing to see a re­moval of a sin­gle std::sort() caus­ing a 10% drop in ex­e­cutable size — since We­bGL has rough­ly two dozens ex­ten­sions (com­pared to > 200 on desk­top and ES), maybe a space-ef­fi­cient al­ter­na­tive im­ple­men­ta­tion could be done for this tar­get in­stead.
5.
^ Doug Lea’s mal­loc() is a gen­er­al-pur­pose al­lo­ca­tor, used by glibc among oth­ers. It’s very per­for­mant and a good choice for code that does many small al­lo­ca­tions (std::un­or­dered_map, I’m look­ing at you). The down­side is its larg­er size, and code do­ing few­er larg­er al­lo­ca­tions might want to use -s MALLOC=emmalloc in­stead. We don’t pre­tend Mag­num is at that state yet, but oth­er projects sucess­ful­ly switched to it, shav­ing more bytes off the down­load size.
6.
^ Af­ter re­mov­ing all of the above, std::printf() in­ter­nals start­ed ap­pear­ing at the top of Bloaty’s size re­port, to­talling at about 10% of the ex­e­cutable size. Mag­num doesn’t use it any­where di­rect­ly and all tran­si­tive us­age of it was killed to­geth­er with iostreams; fur­ther dig­ging re­vealed that it gets called from libc++’s abort_mes­sage(), for ex­am­ple when abort­ing due to a pure vir­tu­al func­tion call. In­de­pen­dent mea­sure­ment showed that std::printf() is around 12 kB of ad­di­tion­al code com­pared to std::puts(), main­ly due to the in­her­ent com­plex­i­ty of float­ing-point string con­ver­sion. It’s planned to use the much sim­pler and small­er Ryū al­go­rithm for Mag­num’s std::printf() re­place­ment, ad­di­tion­al­ly en­sur­ing that float-to-string con­ver­sions can be DCE-d when not used. We might be look­ing in­to patch­ing Em­scripten’s libc++ to not use the ex­pen­sive im­ple­men­ta­tion in its abort mes­sages.

While all of the above size re­duc­tions were done in a hack-and-slash man­ner, the fi­nal ex­e­cutable still ini­tial­izes and ex­e­cutes prop­er­ly, clear­ing the frame­buffer and re­act­ing to in­put events. For ref­er­ence, check out diffs of the chainsaw-surgery branch­es in cor­rade and mag­num.

The above is def­i­nite­ly not all that can be done — es­pe­cial­ly con­sid­er­ing that re­mov­ing two us­es of se­mi-heavy STL APIs led to al­most 20% save in code size, there are most prob­a­bly more of such low hang­ing fruits. The above tasks were added to mosra/mag­num#293 (if not there al­ready) and will get grad­u­al­ly in­te­grat­ed in­to master.

Con­clu­sion

Bright times ahead! The new Plat­form::Em­scripte­nAp­pli­ca­tion is the first step to tru­ly min­i­mal We­bAssem­bly builds and the above hints that it’s pos­si­ble to have down­load sizes not too far from code care­ful­ly writ­ten in plain C. To give a fair com­par­i­son, the ba­sic frame­buffer clear sam­ple from @floooh’s Sokol Sam­ples is 42 kB in to­tal, while the above equiv­a­lent is rough­ly 59 kB. Us­ing C++(11), but not overus­ing it — and that’s just the be­gin­ning.