TENT: Technique-Embedded Note Tracking for Real-World Guitar Solo Recordings

Ting-Wei Su; Yuan-Ping Chen; Li Su; Yi-Hsuan Yang

doi:10.5334/tismir.23

TENT: Technique-Embedded Note Tracking for Real-World Guitar Solo Recordings

Transactions of the International Society for Music Information Retrieval

Volume 2 (2019): Issue 1

By: Ting-Wei Su , Yuan-Ping Chen, Li Su and Yi-Hsuan Yang

Open Access

|Jul 2019

Figures & Tables

From top to bottom: **(a)** guitar sheet music and tablature generated by Guitar Pro (https://www.guitar-pro.com) **(b)** spectrogram of the example guitar phrase **(c)** discrete note events, pitch contour, and the intervals of playing techniques. In (c), ‘B’ denotes `Bend`, ‘R’ stands for `Release`, ‘V’ for `Vibrato`, ‘H’ for `Hammer-on`, ‘P’ for `Pull-off`, and ‘S’ for `Slide` (see Section 1.1 for definitions). From left to right, the first ‘B’ shows a note *full-bended* from `A4` to `B4` gradually. The second ‘R’ is a note *pre-bended* to `B4`, i.e., bend the note without sounding it, and then *released* to `A4` after playing the note. The third ‘B’ shows a *half-step* bend. The two ‘V’s are respectively a very subtle vibrato with smaller extent and a wide vibrato with larger extent. The last ‘S’ is a slide-out.

Flowchart of the proposed model for solo guitar transcription. The three stages are: 1. melody extraction, 2. playing technique detection and note tracking, and 3. note merging.

**A running example of TENT in action. 3a): Cut melody into sub-melodies.** The upper half shows the mel-spectrogram, and the lower half is the pitch contour. The vertical dash line denotes the border of the two sub-melodies. **3b): Find local extrema and calculate the slope of each pattern.** The vertical arrows indicate the local maximum or minimum, and the intervals between two adjacent extrema are referred to as patterns. GS means the slope of a pattern. **3c): Get trend.** The blue line represents the trend labels, with values 1, 0, or –1. **3d): Get techniques by rules and a classifier.** In this example, a `Bend` (B), a `Vibrato` (V), and a `Slide-out` (SO) are detected. `Bend` is decided by a CNN classifier, whereas `Vibrato` and `Slide-out` are recognized by rules. **3e): Get the initial estimate of note tracking.** The pink highlights are the estimated note events before merging. **3f): Merge note events.** The adjusted red highlights are the final note events.

Table 1

Architecture of the four neural network based classifiers we implemented and evaluated for playing technique recognition. Notation: ‘Conv(number of filters, filter size, stride size)’ denotes a convolutional layer and its hyperparameters. ‘Max-Pool(stride size)’ and ‘MeanPool(stride size)’ are max-pooling and mean-pooling layers. ‘FC(number of neurons)’ is a fully-connected layer. All convolutional and fully-connected layers have 0.5 dropout rate. The total number of parameters for these four models are around 2.9M, 2.3M, 7.5M, and 11M, respectively.

CNN2	CNN3	MLP4	MLP7
Conv (256,3,1)	Conv (256,3,1)	FC (1800)	FC (1800)
MaxPool (2)	MaxPool (2)	FC (1800)	FC (1800)
Conv (128,3,1)	Conv (128,3,1)	FC (900)	FC (1800)
MeanPool (2)	MaxPool (2)	FC (900)	FC (1800)
FC (1800)	Conv (128,3,1)	FC (4)	FC (900)
FC (900)	MeanPool (2)		FC (900)
FC (4)	FC (1800)		FC (900)
	FC (900)		FC (4)
	FC (4)

Table 2

Performance of CNN2 using different input features on the G&N-Seg dataset. The numbers in the parentheses of the latter three columns in this table, and Tables 3 and 4, are the standard deviation of the results of the five-fold cross validation.

Features (#Dim)	Precision	Recall	F-score
MFCC (39)	0.771	0.752	0.759
	(±0.059)	(±0.058)	(±0.057)
MFCC+PC (41)	0.809	0.792	0.797
	(±0.054)	(±0.050)	(±0.050)
LMSpec (128)	0.718	0.670	0.686
	(±0.044)	(±0.028)	(±0.032)
LMSpec+PC (130)	0.720	0.667	0.683
	(±0.045)	(±0.023)	(±0.029)
MFCC+LMSpec (167)	0.769	0.749	0.757
	(±0.035)	(±0.034)	(±0.034)
All (169)	0.778	0.760	0.767
	(±0.040)	(±0.029)	(±0.033)

Table 3

Performance of different models on the G&N-Seg dataset using MFCC+PC as the input features.

Model	Precision	Recall	F-score
CNN2	0.809	0.792	0.797
	(±0.054)	(±0.050)	(±0.050)
CNN3	0.812	0.784	0.793
	(±0.053)	(±0.049)	(±0.050)
MLP4	0.778	0.7534	0.762
	(±0.057)	(±0.055)	(±0.053)
MLP7	0.780	0.759	0.765
	(±0.053)	(±0.052)	(±0.049)

Table 4

Per-class result of CNN2 MFCC+PC model.

Technique	Precision	Recall	F-score
Bend/Release	0.714	0.829	0.767
	(±0.088)	(±0.108)	(±0.086)
Hammer-on	0.644	0.723	0.681
	(±0.331)	(±0.222)	(±0.281)
Normal	0.933	0.832	0.880
	(±0.044)	(±0.070)	(±0.045)
Pull-off	0.665	0.810	0.730
	(±0.089)	(±0.118)	(±0.074)
Slide	0.383	0.394	0.388
	(±0.113)	(±0.159)	(±0.132)
All	0.809	0.792	0.797
	(±0.054)	(±0.050)	(±0.050)

Confusion matrix of the 10 CNN2 MFCC+PC models, aggregated over all the evaluation runs. Each box shows the total classified number and the recall rate.

Table 5

Sensitivity test for the following three parameters of TENT: the slope parameter α, and the first and second playing technique candidate thresholds β₁ and β₂ (in semitones). We show the F-scores of playing technique recognition within the note tracking experiment and also the onset+F0 F-score for note tracking over the G&N dataset. Here, the default values of (α, β₁, β₂) are (0.5, 0.3, 0.5), although we use (0.05, 0.3, 0.5) in other experiments in the paper.

	α			β₁			β₂
	0.05	0.5	0.95	0.1	0.3	0.5	0.3	0.5	0.8
Bend+Release	0.606	0.602	0.602	0.570	0.602	0.585	0.633	0.602	0.561
Hammer-on	0.606	0.544	0.556	0.561	0.544	0.561	0.333	0.544	0.500
Pull-off	0.577	0.578	0.526	0.527	0.578	0.570	0.527	0.578	0.574
Slide	0.414	0.355	0.362	0.327	0.355	0.333	0.021	0.355	0.491
Vibrato	0.675	0.688	0.711	0.463	0.688	0.648	0.652	0.688	0.588
Note Tracking (onset+F0)	0.859	0.858	0.856	0.835	0.858	0.851	0.834	0.858	0.862

Table 6

Performance of different methods for note tracking on the G&N dataset, in terms of precision, recall, and F-score. The evaluated methods are Peeters (2006), pYIN (Mauch and Dixon, 2014), Melodia (Salamon and Gómez, 2012), the note tracking function of Tony (Mauch et al., 2015), and the proposed TENT model.

Melody extraction method	Note tracking method	Onset + F0 + Offset			Onset + F0			Onset
Melody extraction method	Note tracking method	Prec.	Recall	F₁	Prec.	Recall	F₁	Prec.	Recall	F₁
Peeters	Tony	0.384	0.645	0.481	0.517	0.869	0.648	0.542	0.912	0.680
pYIN	Tony	0.538	0.659	0.592	0.674	0.826	0.742	0.729	0.893	0.803
Melodia	Tony	0.499	0.671	0.573	0.652	0.877	0.748	0.679	0.913	0.779
Peeters	TENT	0.477	0.562	0.516	0.661	0.778	0.715	0.696	0.819	0.752
pYIN	TENT	0.518	0.607	0.559	0.743	0.872	0.802	0.779	0.913	0.841
Melodia	TENT	0.679	0.697	0.688	0.841	0.877	0.858	0.879	0.903	0.891

In these examples, the upper image is the power spectrogram. The red lines on the lower graph are the pitch contours. Characters above the pitch contours as well as the vertical dotted lines are the ground-truth labels, and characters below are the predicted labels. Behind the pitch contours, there are some pink horizontal highlights (best viewed in color) that represent the note events.

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/tismir.23 | Journal eISSN: 2514-3298

Journal RSS Feed

Language: English

Submitted on: Oct 2, 2018

Accepted on: Feb 27, 2019

Published on: Jul 9, 2019

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

Automatic music transcription,

guitar performance analysis,

music signal processing,

playing technique detection,

expression style recognition,

expressive music performance

© 2019 Ting-Wei Su, Yuan-Ping Chen, Li Su, Yi-Hsuan Yang, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 2 (2019): Issue 1

TENT: Technique-Embedded Note Tracking for Real-World Guitar Solo Recordings

Figures & Tables

Figure 1

Figure 2

Figure 3

Table 1

Table 2

Table 3

Table 4

Figure 4

Table 5

Table 6

Figure 5

Paradigm

My account