
Figure 1
Basic building blocks of the Transformer, the bi-directional Transformer for chord recognition (BTC), and the Harmony Transformer (HT) models, using on multi-head attention (MHA) and feed-forward networks (FFN). Note that both the encoder and the decoder have repetitive layers which are not shown in the figure.
Table 1
Number of parameters in the MHA and the FFN blocks; h, d, and n stand for the number of heads, the feature size of the partitioned keys (Ki), and the kernel size of the convolution, respectively. We set h = 4, d = 32, and n = 3 for the experiments.
| Computational Block | Parameter | Size | Total |
|---|---|---|---|
| Multi-head Attention | WC | hd × hd | 4h2d2 |
| hd × d | |||
| Fully-connected FFN | W1 | hd × 4hd | 8h2d2 |
| W2 | 4hd × hd | ||
| Convolutional FFN | n × hd × hd | 2nh2d2 |

Figure 2
Improved Harmony Transformer (HT*).
Table 2
Annotated chord qualities and the mapping to the major-minor vocabulary.
| Quality | Major-Minor Mapping |
|---|---|
| Major (M) | M |
| Minor (m) | m |
| Augmented (a) | others |
| Diminished (d) | others |
| Major Seventh (M7) | M |
| Minor Seventh (m7) | m |
| Dominant Seventh (D7) | M |
| Diminished Seventh (d7) | others |
| Half-diminished Seventh (h7) | others |
| Augmented Sixth (a6) | M |

Figure 3
Statistics of the chord quality and degree annotations (some minor cases are omitted).
Table 3
Vocabularies of the functional harmony recognition task. The 21 tonics include {C,D,E,F,G,A,B} by {♮,♯,♭}; the 2 modes are {major, minor}; the 9 primary degrees are {1,2,3,4,5,6,7,b2,b7}; the 14 secondary degrees are {1,2,3,4,5,6,7,#1,#3,#4,b1,b3,b6,b7}; the 4 inversions are {root position,1st,2nd,3rd}.
| Output | Component | Vocabulary Size |
|---|---|---|
| Key | 21 tonics | 42 |
| 2 modes | ||
| Roman Numeral | 9 primary degrees | 5040 |
| 14 secondary degrees | ||
| 10 qualities | ||
| 4 inversions |
Table 4
Evaluations with the BPS-FH dataset and the Bach Preludes. All the scores (in percentage) are averaged over 4 validation sets; the standard deviations of the scores are also provided.
| BPS-FH | |||||
|---|---|---|---|---|---|
| Model | Chord Symbol Recognition | Functional Harmony Recognition | |||
| Accuracy | Segmentation | Key | Roman numeral | Segmentation | |
| BTC | 82.46±1.55 | 81.30±1.08 | 77.65±1.83 | 37.98±1.34 | 66.73±4.05 |
| BTC-singleBi | 82.16±1.66 | 80.78±1.39 | 75.96±0.79 | 35.77±1.85 | 68.83±1.69 |
| BTC-FC | 82.06±1.83 | 81.24±1.26 | 78.40±2.10 | 37.60±1.76 | 65.56±3.86 |
| HT | 83.19±1.65 | 83.47±1.22 | 77.94±2.24 | 37.00±2.88 | 71.93±2.72 |
| HT-noW | 83.06±1.58 | 83.26±0.71 | 77.13±1.78 | 36.84±2.39 | 73.53±1.26 |
| HT-noReg | 83.19±1.31 | 83.33±1.26 | 76.70±1.26 | 35.33±1.79 | 70.51±1.16 |
| CRNN | 79.79±0.84 | 81.49±1.91 | 75.56±2.84 | 34.83±1.38 | 67.75±3.59 |
| HT* | 83.98±1.08 | 85.09±0.96 | 79.07±2.70 | 41.74±2.63 | 75.50±1.72 |
| Bach Preludes | |||||
| Model | Chord Symbol Recognition | Functional Harmony Recognition | |||
| Accuracy | Segmentation | Key | Roman numeral | Segmentation | |
| BTC | 74.12±0.12 | 77.20±3.64 | 48.63±4.48 | 25.25±1.76 | 64.19±2.08 |
| BTC-singleBi | 75.67±1.42 | 78.85±4.81 | 46.24±5.90 | 23.35±1.99 | 60.40±6.25 |
| BTC-FC | 75.53±1.22 | 77.81±4.51 | 46.05±1.84 | 22.97±2.19 | 57.24±4.17 |
| HT | 77.18±1.24 | 80.46±3.36 | 51.15±2.47 | 23.75±2.20 | 66.82±4.52 |
| HT-noW | 76.51±1.45 | 81.14±3.31 | 48.95±2.88 | 24.99±1.32 | 67.61±4.75 |
| HT-noReg | 76.33±1.23 | 80.76±4.40 | 50.62±3.93 | 23.82±2.43 | 65.23±4.80 |
| CRNN | 69.79±1.15 | 79.47±2.03 | 47.03±6.59 | 18.53±2.23 | 61.79±1.83 |
| HT* | 78.54±2.06 | 83.86±2.24 | 56.28±2.53 | 25.95±1.67 | 73.60±1.80 |

Figure 4
Examples of the attention maps in the MHA units; the color bars show the relative intensity of attention. The input segment is Beethoven’s Piano Sonata No. 1, MM. 1–8. (a) Two attention heads in the intra-MHA of the decoder. The vertical and the horizontal axes respectively represent the queries and the keys, both of which indicate the same sequence to be recognized. (b) Two attention heads in the inter-MHA of the decoder. The vertical axis is the decoder sequence to be recognized (queries), and the horizontal axis is the encoder sequence (keys) for the chord change estimation (the positions where the chords change are indicated by vertical dashed lines).
