Skip to main content
Have a personal or library account? Click to login
TENT: Technique-Embedded Note Tracking for Real-World Guitar Solo Recordings Cover

TENT: Technique-Embedded Note Tracking for Real-World Guitar Solo Recordings

Open Access
|Jul 2019

Figures & Tables

Figure 1

From top to bottom: (a) guitar sheet music and tablature generated by Guitar Pro (https://www.guitar-pro.com) (b) spectrogram of the example guitar phrase (c) discrete note events, pitch contour, and the intervals of playing techniques. In (c), ‘B’ denotes Bend, ‘R’ stands for Release, ‘V’ for Vibrato, ‘H’ for Hammer-on, ‘P’ for Pull-off, and ‘S’ for Slide (see Section 1.1 for definitions). From left to right, the first ‘B’ shows a note full-bended from A4 to B4 gradually. The second ‘R’ is a note pre-bended to B4, i.e., bend the note without sounding it, and then released to A4 after playing the note. The third ‘B’ shows a half-step bend. The two ‘V’s are respectively a very subtle vibrato with smaller extent and a wide vibrato with larger extent. The last ‘S’ is a slide-out.

Figure 2

Flowchart of the proposed model for solo guitar transcription. The three stages are: 1. melody extraction, 2. playing technique detection and note tracking, and 3. note merging.

Figure 3

A running example of TENT in action. 3a): Cut melody into sub-melodies. The upper half shows the mel-spectrogram, and the lower half is the pitch contour. The vertical dash line denotes the border of the two sub-melodies. 3b): Find local extrema and calculate the slope of each pattern. The vertical arrows indicate the local maximum or minimum, and the intervals between two adjacent extrema are referred to as patterns. GS means the slope of a pattern. 3c): Get trend. The blue line represents the trend labels, with values 1, 0, or –1. 3d): Get techniques by rules and a classifier. In this example, a Bend (B), a Vibrato (V), and a Slide-out (SO) are detected. Bend is decided by a CNN classifier, whereas Vibrato and Slide-out are recognized by rules. 3e): Get the initial estimate of note tracking. The pink highlights are the estimated note events before merging. 3f): Merge note events. The adjusted red highlights are the final note events.

Table 1

Architecture of the four neural network based classifiers we implemented and evaluated for playing technique recognition. Notation: ‘Conv(number of filters, filter size, stride size)’ denotes a convolutional layer and its hyperparameters. ‘Max-Pool(stride size)’ and ‘MeanPool(stride size)’ are max-pooling and mean-pooling layers. ‘FC(number of neurons)’ is a fully-connected layer. All convolutional and fully-connected layers have 0.5 dropout rate. The total number of parameters for these four models are around 2.9M, 2.3M, 7.5M, and 11M, respectively.

CNN2CNN3MLP4MLP7
Conv (256,3,1)Conv (256,3,1)FC (1800)FC (1800)
MaxPool (2)MaxPool (2)FC (1800)FC (1800)
Conv (128,3,1)Conv (128,3,1)FC (900)FC (1800)
MeanPool (2)MaxPool (2)FC (900)FC (1800)
FC (1800)Conv (128,3,1)FC (4)FC (900)
FC (900)MeanPool (2)FC (900)
FC (4)FC (1800)FC (900)
FC (900)FC (4)
FC (4)
Table 2

Performance of CNN2 using different input features on the G&N-Seg dataset. The numbers in the parentheses of the latter three columns in this table, and Tables 3 and 4, are the standard deviation of the results of the five-fold cross validation.

Features (#Dim)PrecisionRecallF-score
MFCC (39)0.7710.7520.759
(±0.059)(±0.058)(±0.057)
MFCC+PC (41)0.8090.7920.797
(±0.054)(±0.050)(±0.050)
LMSpec (128)0.7180.6700.686
(±0.044)(±0.028)(±0.032)
LMSpec+PC (130)0.7200.6670.683
(±0.045)(±0.023)(±0.029)
MFCC+LMSpec (167)0.7690.7490.757
(±0.035)(±0.034)(±0.034)
All (169)0.7780.7600.767
(±0.040)(±0.029)(±0.033)
Table 3

Performance of different models on the G&N-Seg dataset using MFCC+PC as the input features.

ModelPrecisionRecallF-score
CNN20.8090.7920.797
(±0.054)(±0.050)(±0.050)
CNN30.8120.7840.793
(±0.053)(±0.049)(±0.050)
MLP40.7780.75340.762
(±0.057)(±0.055)(±0.053)
MLP70.7800.7590.765
(±0.053)(±0.052)(±0.049)
Table 4

Per-class result of CNN2 MFCC+PC model.

TechniquePrecisionRecallF-score
Bend/Release0.7140.8290.767
(±0.088)(±0.108)(±0.086)
Hammer-on0.6440.7230.681
(±0.331)(±0.222)(±0.281)
Normal0.9330.8320.880
(±0.044)(±0.070)(±0.045)
Pull-off0.6650.8100.730
(±0.089)(±0.118)(±0.074)
Slide0.3830.3940.388
(±0.113)(±0.159)(±0.132)
All0.8090.7920.797
(±0.054)(±0.050)(±0.050)
Figure 4

Confusion matrix of the 10 CNN2 MFCC+PC models, aggregated over all the evaluation runs. Each box shows the total classified number and the recall rate.

Table 5

Sensitivity test for the following three parameters of TENT: the slope parameter α, and the first and second playing technique candidate thresholds β1 and β2 (in semitones). We show the F-scores of playing technique recognition within the note tracking experiment and also the onset+F0 F-score for note tracking over the G&N dataset. Here, the default values of (α, β1, β2) are (0.5, 0.3, 0.5), although we use (0.05, 0.3, 0.5) in other experiments in the paper.

αβ1β2
0.050.50.950.10.30.50.30.50.8
Bend+Release0.6060.6020.6020.5700.6020.5850.6330.6020.561
Hammer-on0.6060.5440.5560.5610.5440.5610.3330.5440.500
Pull-off0.5770.5780.5260.5270.5780.5700.5270.5780.574
Slide0.4140.3550.3620.3270.3550.3330.0210.3550.491
Vibrato0.6750.6880.7110.4630.6880.6480.6520.6880.588
Note Tracking (onset+F0)0.8590.8580.8560.8350.8580.8510.8340.8580.862
Table 6

Performance of different methods for note tracking on the G&N dataset, in terms of precision, recall, and F-score. The evaluated methods are Peeters (2006), pYIN (Mauch and Dixon, 2014), Melodia (Salamon and Gómez, 2012), the note tracking function of Tony (Mauch et al., 2015), and the proposed TENT model.

Melody extraction methodNote tracking methodOnset + F0 + OffsetOnset + F0Onset
Prec.RecallF1Prec.RecallF1Prec.RecallF1
PeetersTony0.3840.6450.4810.5170.8690.6480.5420.9120.680
pYINTony0.5380.6590.5920.6740.8260.7420.7290.8930.803
MelodiaTony0.4990.6710.5730.6520.8770.7480.6790.9130.779
PeetersTENT0.4770.5620.5160.6610.7780.7150.6960.8190.752
pYINTENT0.5180.6070.5590.7430.8720.8020.7790.9130.841
MelodiaTENT0.6790.6970.6880.8410.8770.8580.8790.9030.891
Figure 5

In these examples, the upper image is the power spectrogram. The red lines on the lower graph are the pitch contours. Characters above the pitch contours as well as the vertical dotted lines are the ground-truth labels, and characters below are the predicted labels. Behind the pitch contours, there are some pink horizontal highlights (best viewed in color) that represent the note events.

DOI: https://doi.org/10.5334/tismir.23 | Journal eISSN: 2514-3298
Language: English
Submitted on: Oct 2, 2018
Accepted on: Feb 27, 2019
Published on: Jul 9, 2019
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2019 Ting-Wei Su, Yuan-Ping Chen, Li Su, Yi-Hsuan Yang, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.