
Figure 1
From top to bottom: (a) guitar sheet music and tablature generated by Guitar Pro (https://www.guitar-pro.com) (b) spectrogram of the example guitar phrase (c) discrete note events, pitch contour, and the intervals of playing techniques. In (c), ‘B’ denotes Bend, ‘R’ stands for Release, ‘V’ for Vibrato, ‘H’ for Hammer-on, ‘P’ for Pull-off, and ‘S’ for Slide (see Section 1.1 for definitions). From left to right, the first ‘B’ shows a note full-bended from A4 to B4 gradually. The second ‘R’ is a note pre-bended to B4, i.e., bend the note without sounding it, and then released to A4 after playing the note. The third ‘B’ shows a half-step bend. The two ‘V’s are respectively a very subtle vibrato with smaller extent and a wide vibrato with larger extent. The last ‘S’ is a slide-out.

Figure 2
Flowchart of the proposed model for solo guitar transcription. The three stages are: 1. melody extraction, 2. playing technique detection and note tracking, and 3. note merging.

Figure 3
A running example of TENT in action. 3a): Cut melody into sub-melodies. The upper half shows the mel-spectrogram, and the lower half is the pitch contour. The vertical dash line denotes the border of the two sub-melodies. 3b): Find local extrema and calculate the slope of each pattern. The vertical arrows indicate the local maximum or minimum, and the intervals between two adjacent extrema are referred to as patterns. GS means the slope of a pattern. 3c): Get trend. The blue line represents the trend labels, with values 1, 0, or –1. 3d): Get techniques by rules and a classifier. In this example, a Bend (B), a Vibrato (V), and a Slide-out (SO) are detected. Bend is decided by a CNN classifier, whereas Vibrato and Slide-out are recognized by rules. 3e): Get the initial estimate of note tracking. The pink highlights are the estimated note events before merging. 3f): Merge note events. The adjusted red highlights are the final note events.
Table 1
Architecture of the four neural network based classifiers we implemented and evaluated for playing technique recognition. Notation: ‘Conv(number of filters, filter size, stride size)’ denotes a convolutional layer and its hyperparameters. ‘Max-Pool(stride size)’ and ‘MeanPool(stride size)’ are max-pooling and mean-pooling layers. ‘FC(number of neurons)’ is a fully-connected layer. All convolutional and fully-connected layers have 0.5 dropout rate. The total number of parameters for these four models are around 2.9M, 2.3M, 7.5M, and 11M, respectively.
| CNN2 | CNN3 | MLP4 | MLP7 |
|---|---|---|---|
| Conv (256,3,1) | Conv (256,3,1) | FC (1800) | FC (1800) |
| MaxPool (2) | MaxPool (2) | FC (1800) | FC (1800) |
| Conv (128,3,1) | Conv (128,3,1) | FC (900) | FC (1800) |
| MeanPool (2) | MaxPool (2) | FC (900) | FC (1800) |
| FC (1800) | Conv (128,3,1) | FC (4) | FC (900) |
| FC (900) | MeanPool (2) | FC (900) | |
| FC (4) | FC (1800) | FC (900) | |
| FC (900) | FC (4) | ||
| FC (4) |
Table 2
Performance of CNN2 using different input features on the G&N-Seg dataset. The numbers in the parentheses of the latter three columns in this table, and Tables 3 and 4, are the standard deviation of the results of the five-fold cross validation.
| Features (#Dim) | Precision | Recall | F-score |
|---|---|---|---|
| MFCC (39) | 0.771 | 0.752 | 0.759 |
| (±0.059) | (±0.058) | (±0.057) | |
| MFCC+PC (41) | 0.809 | 0.792 | 0.797 |
| (±0.054) | (±0.050) | (±0.050) | |
| LMSpec (128) | 0.718 | 0.670 | 0.686 |
| (±0.044) | (±0.028) | (±0.032) | |
| LMSpec+PC (130) | 0.720 | 0.667 | 0.683 |
| (±0.045) | (±0.023) | (±0.029) | |
| MFCC+LMSpec (167) | 0.769 | 0.749 | 0.757 |
| (±0.035) | (±0.034) | (±0.034) | |
| All (169) | 0.778 | 0.760 | 0.767 |
| (±0.040) | (±0.029) | (±0.033) |
Table 3
Performance of different models on the G&N-Seg dataset using MFCC+PC as the input features.
| Model | Precision | Recall | F-score |
|---|---|---|---|
| CNN2 | 0.809 | 0.792 | 0.797 |
| (±0.054) | (±0.050) | (±0.050) | |
| CNN3 | 0.812 | 0.784 | 0.793 |
| (±0.053) | (±0.049) | (±0.050) | |
| MLP4 | 0.778 | 0.7534 | 0.762 |
| (±0.057) | (±0.055) | (±0.053) | |
| MLP7 | 0.780 | 0.759 | 0.765 |
| (±0.053) | (±0.052) | (±0.049) |
Table 4
Per-class result of CNN2 MFCC+PC model.
| Technique | Precision | Recall | F-score |
|---|---|---|---|
| Bend/Release | 0.714 | 0.829 | 0.767 |
| (±0.088) | (±0.108) | (±0.086) | |
| Hammer-on | 0.644 | 0.723 | 0.681 |
| (±0.331) | (±0.222) | (±0.281) | |
| Normal | 0.933 | 0.832 | 0.880 |
| (±0.044) | (±0.070) | (±0.045) | |
| Pull-off | 0.665 | 0.810 | 0.730 |
| (±0.089) | (±0.118) | (±0.074) | |
| Slide | 0.383 | 0.394 | 0.388 |
| (±0.113) | (±0.159) | (±0.132) | |
| All | 0.809 | 0.792 | 0.797 |
| (±0.054) | (±0.050) | (±0.050) |

Figure 4
Confusion matrix of the 10 CNN2 MFCC+PC models, aggregated over all the evaluation runs. Each box shows the total classified number and the recall rate.
Table 5
Sensitivity test for the following three parameters of TENT: the slope parameter α, and the first and second playing technique candidate thresholds β1 and β2 (in semitones). We show the F-scores of playing technique recognition within the note tracking experiment and also the onset+F0 F-score for note tracking over the G&N dataset. Here, the default values of (α, β1, β2) are (0.5, 0.3, 0.5), although we use (0.05, 0.3, 0.5) in other experiments in the paper.
| α | β1 | β2 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 0.05 | 0.5 | 0.95 | 0.1 | 0.3 | 0.5 | 0.3 | 0.5 | 0.8 | |
| Bend+Release | 0.606 | 0.602 | 0.602 | 0.570 | 0.602 | 0.585 | 0.633 | 0.602 | 0.561 |
| Hammer-on | 0.606 | 0.544 | 0.556 | 0.561 | 0.544 | 0.561 | 0.333 | 0.544 | 0.500 |
| Pull-off | 0.577 | 0.578 | 0.526 | 0.527 | 0.578 | 0.570 | 0.527 | 0.578 | 0.574 |
| Slide | 0.414 | 0.355 | 0.362 | 0.327 | 0.355 | 0.333 | 0.021 | 0.355 | 0.491 |
| Vibrato | 0.675 | 0.688 | 0.711 | 0.463 | 0.688 | 0.648 | 0.652 | 0.688 | 0.588 |
| Note Tracking (onset+F0) | 0.859 | 0.858 | 0.856 | 0.835 | 0.858 | 0.851 | 0.834 | 0.858 | 0.862 |
Table 6
Performance of different methods for note tracking on the G&N dataset, in terms of precision, recall, and F-score. The evaluated methods are Peeters (2006), pYIN (Mauch and Dixon, 2014), Melodia (Salamon and Gómez, 2012), the note tracking function of Tony (Mauch et al., 2015), and the proposed TENT model.
| Melody extraction method | Note tracking method | Onset + F0 + Offset | Onset + F0 | Onset | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Prec. | Recall | F1 | Prec. | Recall | F1 | Prec. | Recall | F1 | ||
| Peeters | Tony | 0.384 | 0.645 | 0.481 | 0.517 | 0.869 | 0.648 | 0.542 | 0.912 | 0.680 |
| pYIN | Tony | 0.538 | 0.659 | 0.592 | 0.674 | 0.826 | 0.742 | 0.729 | 0.893 | 0.803 |
| Melodia | Tony | 0.499 | 0.671 | 0.573 | 0.652 | 0.877 | 0.748 | 0.679 | 0.913 | 0.779 |
| Peeters | TENT | 0.477 | 0.562 | 0.516 | 0.661 | 0.778 | 0.715 | 0.696 | 0.819 | 0.752 |
| pYIN | TENT | 0.518 | 0.607 | 0.559 | 0.743 | 0.872 | 0.802 | 0.779 | 0.913 | 0.841 |
| Melodia | TENT | 0.679 | 0.697 | 0.688 | 0.841 | 0.877 | 0.858 | 0.879 | 0.903 | 0.891 |

Figure 5
In these examples, the upper image is the power spectrogram. The red lines on the lower graph are the pitch contours. Characters above the pitch contours as well as the vertical dotted lines are the ground-truth labels, and characters below are the predicted labels. Behind the pitch contours, there are some pink horizontal highlights (best viewed in color) that represent the note events.
