Audio and Music Analysis on the Web using Essentia.js

Albin Correya; Jorge Marcos-Fernández; Luis Joglar-Ongay; Pablo Alonso-Jiménez; Xavier Serra; Dmitry Bogdanov

doi:10.5334/tismir.111

Audio and Music Analysis on the Web using Essentia.js

Transactions of the International Society for Music Information Retrieval

Volume 4 (2021): Issue 1

By: Albin Correya , Jorge Marcos-Fernández, Luis Joglar-Ongay , Pablo Alonso-Jiménez , Xavier Serra and Dmitry Bogdanov

Open Access

|Nov 2021

Figures & Tables

Table 1

Overview of libraries for audio analysis and MIR on web clients compared to Essentia.js, including libraries written purely in JS or cross-compiled for Wasm, in terms of their target applications and the number algorithms suitable for MIR out of the box. *Csound and Faust are very extensive programming languages for audio DSP which require cross-compilation.

Name	Implementation	MIR algorithms	Applications	Last updated
CsoundEmscripten (Lazzarini et al., 2014)	asm.js⁴	*	processing, synthesis	2021
Meyda (Fiala et al., 2015)	plain JS	∼20	analysis	2021
JS-xtract (Jillings et al., 2016)	plain JS	∼70	analysis	2021
Piper (Thompson et al., 2017)	Wasm	∼20	analysis, processing	2018
Faust (Letz et al., 2017)	Wasm	*	processing, synthesis	2021
lfo (Matuszewski and Schnell, 2017)	plain JS	∼15	analysis, processing	2017
MMLL (Collins and Knotts, 2019)	plain JS	∼15	analysis	2020
Essentia.js	Wasm	∼200	analysis, processing, synthesis	2021

Overview of the *Essentia.js* library in terms of its abstraction levels.

A simple example of offline audio feature extraction using *Essentia.js* via ES6 style imports.

Table 2

Transfer learning classifiers.

	Task	Classes
genre	dortmund	alternative, blues, electronic, folkcountry, funksoulrnb, jazz, pop, raphiphop, rock
genre	gtzan	blues, classic, country, disco, hip hop, jazz, metal, pop, reggae, rock
rosamerica	classic, dance, hip hop, jazz, pop, rhythm and blues, rock, speech
mood	acoustic	acoustic, non acoustic
	aggressive	aggressive, non aggressive
	electronic	electronic, non electronic
	happy	happy, non happy
party	party, non party
relaxed	relaxed, non relaxed
sad	sad, non sad
misc.	danceability	danceable, non danceable
	voice/instrum.	voice, instrumental
	gender	male, female
tonal/atonal	atonal, tonal
urbansound8k	air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, street music
fs-loop-ds	bass, chords, fx, melody, percussion

Table 3

The Essentia models. RF: Receptive field, AT: Auto-tagging, TL: Transfer learning.

Model	RF (s)	Params.	Size (MB)	Purpose
MusiCNN	3	787K	3.1	AT/TL
VGG	3	605K	2.4	AT/TL
VGGish	1	62M	276	TL
TempoCNN	12	[27K–1.2M]	[0.1–4.7]	Tempo

Activations for the MSD and MTT auto-tagging taxonomies and all the transfer learning classifiers for *Bohemian Rhapsody* by *Queen*.

Example of offline audio feature extraction for the MusiCNN-based models using *essentia.js-model* via ES6 style imports.

Example of inference of MusiCNN-based models from the feature input computed in Listing 2 using *essentia.js-model* via ES6 style imports.

*Essentia.js* demo applications: (a) real-time mel-spectrogram (top-left), pitch estimation (top-right), HPCP (bottom-left) and music auto-tagging (bottom-right), (b) five *Essentia.js* transfer learning models for mood classification, (c) industrial application by SonoSuite for audio problem detection.

Table 4

Platform versions for each device used in the JavaScript benchmarks.

Device	Chrome	Firefox	Node.js
Linux	89.0.4389.114 (64-bit)	87.0 (64-bit)	14.15.1
macOS	89.0.4389.114 (64 bit)	87.0 (64-bit)	14.13.0
Android	92.04484.6	Nightly 210421	–
iOS	87.0.4280.77	33.0	–

Mean execution time (in seconds) for common audio features on a 30-second music track. If the standard deviation is smaller than 0.005 it is not printed. The algorithms marked with * in (a) are only available in *Essentia.js*.

Mean execution time (in seconds) for *Essentia.js* model algorithms on a 30-second music track, comparing TensorFlow.js back ends (a) Wasm, and (b) WebGL. If standard deviation is smaller than 0.005 it is not printed. *N.C*. stands for not computed values, where the benchmark suite was unable to complete execution. (a) CPU acceleration (Wasm on browsers). (b) GPU acceleration with WebGL. Since not all Android and iOS devices support WebGL or have powerful enough GPUs, only browser benchmarks on Linux and MacOS are shown.

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/tismir.111 | Journal eISSN: 2514-3298

Journal RSS Feed

Language: English

Submitted on: Apr 24, 2021

Accepted on: Sep 2, 2021

Published on: Nov 22, 2021

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

Software,

web audio,

audio analysis,

music signal processing,

music audio classification,

deep learning

© 2021 Albin Correya, Jorge Marcos-Fernández, Luis Joglar-Ongay, Pablo Alonso-Jiménez, Xavier Serra, Dmitry Bogdanov, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 4 (2021): Issue 1

Audio and Music Analysis on the Web using Essentia.js

Figures & Tables

Table 1

Figure 1

Listing 1

Table 2

Table 3

Figure 2

Listing 2

Listing 3

Figure 3

Table 4

Figure 4

Figure 5

Paradigm

My account