Table 1
Overview of libraries for audio analysis and MIR on web clients compared to Essentia.js, including libraries written purely in JS or cross-compiled for Wasm, in terms of their target applications and the number algorithms suitable for MIR out of the box. *Csound and Faust are very extensive programming languages for audio DSP which require cross-compilation.
| Name | Implementation | MIR algorithms | Applications | Last updated |
|---|---|---|---|---|
| CsoundEmscripten (Lazzarini et al., 2014) | asm.js4 | * | processing, synthesis | 2021 |
| Meyda (Fiala et al., 2015) | plain JS | ∼20 | analysis | 2021 |
| JS-xtract (Jillings et al., 2016) | plain JS | ∼70 | analysis | 2021 |
| Piper (Thompson et al., 2017) | Wasm | ∼20 | analysis, processing | 2018 |
| Faust (Letz et al., 2017) | Wasm | * | processing, synthesis | 2021 |
| lfo (Matuszewski and Schnell, 2017) | plain JS | ∼15 | analysis, processing | 2017 |
| MMLL (Collins and Knotts, 2019) | plain JS | ∼15 | analysis | 2020 |
| Essentia.js | Wasm | ∼200 | analysis, processing, synthesis | 2021 |

Figure 1
Overview of the Essentia.js library in terms of its abstraction levels.

Listing 1
A simple example of offline audio feature extraction using Essentia.js via ES6 style imports.
Table 2
Transfer learning classifiers.
| Task | Classes | |
|---|---|---|
| genre | dortmund | alternative, blues, electronic, folkcountry, funksoulrnb, jazz, pop, raphiphop, rock |
| gtzan | blues, classic, country, disco, hip hop, jazz, metal, pop, reggae, rock | |
| rosamerica | classic, dance, hip hop, jazz, pop, rhythm and blues, rock, speech | |
| mood | acoustic | acoustic, non acoustic |
| aggressive | aggressive, non aggressive | |
| electronic | electronic, non electronic | |
| happy | happy, non happy | |
| party | party, non party | |
| relaxed | relaxed, non relaxed | |
| sad | sad, non sad | |
| misc. | danceability | danceable, non danceable |
| voice/instrum. | voice, instrumental | |
| gender | male, female | |
| tonal/atonal | atonal, tonal | |
| urbansound8k | air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, street music | |
| fs-loop-ds | bass, chords, fx, melody, percussion |
Table 3
The Essentia models. RF: Receptive field, AT: Auto-tagging, TL: Transfer learning.
| Model | RF (s) | Params. | Size (MB) | Purpose |
|---|---|---|---|---|
| MusiCNN | 3 | 787K | 3.1 | AT/TL |
| VGG | 3 | 605K | 2.4 | AT/TL |
| VGGish | 1 | 62M | 276 | TL |
| TempoCNN | 12 | [27K–1.2M] | [0.1–4.7] | Tempo |

Figure 2
Activations for the MSD and MTT auto-tagging taxonomies and all the transfer learning classifiers for Bohemian Rhapsody by Queen.

Listing 2
Example of offline audio feature extraction for the MusiCNN-based models using essentia.js-model via ES6 style imports.

Listing 3
Example of inference of MusiCNN-based models from the feature input computed in Listing 2 using essentia.js-model via ES6 style imports.

Figure 3
Essentia.js demo applications: (a) real-time mel-spectrogram (top-left), pitch estimation (top-right), HPCP (bottom-left) and music auto-tagging (bottom-right), (b) five Essentia.js transfer learning models for mood classification, (c) industrial application by SonoSuite for audio problem detection.
Table 4
Platform versions for each device used in the JavaScript benchmarks.
| Device | Chrome | Firefox | Node.js |
|---|---|---|---|
| Linux | 89.0.4389.114 (64-bit) | 87.0 (64-bit) | 14.15.1 |
| macOS | 89.0.4389.114 (64 bit) | 87.0 (64-bit) | 14.13.0 |
| Android | 92.04484.6 | Nightly 210421 | – |
| iOS | 87.0.4280.77 | 33.0 | – |

Figure 4
Mean execution time (in seconds) for common audio features on a 30-second music track. If the standard deviation is smaller than 0.005 it is not printed. The algorithms marked with * in (a) are only available in Essentia.js.

Figure 5
Mean execution time (in seconds) for Essentia.js model algorithms on a 30-second music track, comparing TensorFlow.js back ends (a) Wasm, and (b) WebGL. If standard deviation is smaller than 0.005 it is not printed. N.C. stands for not computed values, where the benchmark suite was unable to complete execution. (a) CPU acceleration (Wasm on browsers). (b) GPU acceleration with WebGL. Since not all Android and iOS devices support WebGL or have powerful enough GPUs, only browser benchmarks on Linux and MacOS are shown.
