The Potential of Unsupervised Induction of Harmonic Syntax for Jazz

Ruben Cartuyvels; John Koslovsky; Marie-Francine Moens

doi:10.5334/tismir.217

1 Introduction

Music theorists have argued that, for a wide range of Western tonal music genres, chord sequences that make up the harmony of a piece of music can be ordered in tree structures according to musical relations like prolongation or preparation (Lerdahl and Jackendoff, 1983; Rohrmeier, 2011, 2020; Steedman, 1984; Winograd, 1968). Figure 1 shows a harmonic tree resulting from the parsing of a chord sequence according to syntax rules of a context‑free grammar. Such a grammar consists of substitution rules that can be recursively applied to expand a preliminary sequence containing nonterminal or preterminal symbols until one arrives at a desired sequence consisting entirely of chord symbols.

**(a)** First bars of lead sheet (from the iRealPro forum^²) and **(b)** syntax tree representing the harmony of ‘Take the “A” train' by Billy Strayhorn. Image based on Harasim (2020). Dm⁷ is a terminal chord symbol, V/V is a preterminal since it is an abstract symbol representing a single terminal chord, and the highest occurring V is a nonterminal since it represents multiple lower‑level symbols. The turnaround has been omitted.

An example of such a rule is X $\to$ V/X X, meaning that the fifth (or dominant) chord relative to any chord X (i.e., in the diatonic series with X as tonic; e.g., G⁷ relative to C^Δ) can prepare chord X. The dominant chord holds musical tension relative to its key, which can be released by playing the tonic chord afterward. Different but similar rules could be given by replacing the ‘V’ chord above; e.g., by the tritone substitution for the dominant, or by the (half‑) diminished seventh chord. In this manner, hierarchical representations of harmony can aid the analysis of music, in a similar manner to Schenkerian analysis, where foreground notes are related to the Ursatz—a deeper structure (Schenker, 1979). Hierarchical analyses of harmony are also found to predict listeners' understanding of music by cognitive scientists (Herff et al., 2021, 2024).

Formal grammars such as context‑free grammars, along with their computational methodologies, were popularized for modeling human language. The question of whether they can model music too raises interesting questions about similar underlying mechanisms in two quite different but both sequential modalities with which humans communicate. Katz and Pesetsky (2011), e.g., propose the identity thesis stating that, conceptually, language and music follow identical grammar rules. The possibility of a shared human brain faculty for hierarchical processing of both language and music has been proposed, but remains a debated issue (Patel, 2003).

In this work, we propose deep learning methods inspired by natural language processing to induce a grammar for chord sequences from jazz pieces. To our knowledge, for the first time, we do this in an entirely unsupervised manner, i.e., from raw textually encoded sequences of chord symbols, without access to annotated parse trees (except for evaluation of sequences unseen during training) and while adding minimal music theoretical knowledge. This allows us to use more data than was possible in prior work: datasets with tree annotations for music pieces contain little more than 100 samples (Hamanaka et al., 2014; Harasim et al., 2020), while those with raw chord sequences have up to 20K samples (de Berardinis et al., 2023).

More specifically, we use neural networks to exploit parameter sharing when estimating rule probabilities for probabilistic context‑free grammars (PCFGs). This enables us to induce not only the rule probabilities but also the entire grammar, including the ‘meaning‘ of its symbols and rules. Since the meaning of the rules and symbols is not fixed, the resulting model can parse any sequence, while grammars with hardcoded rules might find some sequences unparsable. PCFGs are generative models (which means, loosely speaking, that one can sample data points—like chord sequences—from them) that model sequential data with recursive grammar rules. This yields a mechanism for unsupervised training: one can simply maximize the likelihood of training sequences. The parse trees are given as latent variables. For example, this would not work for the Generative Theory of Tonal Music of Lerdahl and Jackendoff (1983), as it does not define a formal generative grammar (in spite of its name).

Section 2 discusses related work. Section 3 explains the PCFG that is employed, how Kim et al. (2019) implement it with neural networks, and our proposed adaptations specific to jazz harmony. We discuss evaluations and results in Section 4, and Section 5 presents concluding remarks.

2 Related Work

Several authors have defined grammars for music harmony, e.g., for jazz, for classical music, or for Western tonal music in general (Lerdahl and Jackendoff, 1983; Ogura et al., 2020; Rohrmeier and Moss, 2021; Rohrmeier and Neuwirth, 2015; Rohrmeier, 2011; Rohrmeier, 2020; Steedman, 1984; Winograd, 1968). Some have provided implementations of parsers for their grammars (Melkonian, 2019), and some grammars are probabilistic, where the rule probabilities are learned from data (Granroth‑Wilding and Steedman, 2012; Harasim et al., 2018; Kirlin and Jensen, 2011) but the rules have to be manually defined and hardcoded into the model. Foscarin et al. (2023) use neural networks to parse chord sequences into dependency and constituent trees, and they explain how these structures, initially conceived for natural language, apply to music. Since they train models in a supervised manner, they are bound to small corpora of tree‑annotated music pieces. The Probabilistic Abstract Context‑Free Grammar (PACFG) of Harasim et al. (2018) uses manually defined PCFG rules of which the probabilities can be learned from data. The model is abstract in the sense that the rules are invariant to transposition. Rules in our neural PCFGs are learned as being instead of manually defined, and they can in principle be learned a being transposition invariant. Explicitly imposing this constraint on our models would be an interesting avenue for future work.

In language, there exist several approaches for unsupervised grammar induction with neural networks. One line of work uses probabilistic context‑free grammars. The fact that these grammars are generative allows the use of sentence reconstruction as an objective to induce latent parse trees (Kim et al., 2019; Yang et al., 2021b). The sentence likelihood is given by the partition function, computed using only the inside phase of the inside–outside dynamic programming algorithm that efficiently iterates over the exponentially many possible binary trees over words in a sentence. We start from the neural model proposed for language by Kim et al. (2019), and propose alterations to adapt the model to harmonic sequences. We also do experiments with their compound PCFG, which additionally conditions every rule probability on a sequence‑specific variable computed by a sequence encoder such as a long short‑term memory (LSTM) network (which also has to be trained). Another study uses masked language modeling as the reconstruction objective (Drozdov et al., 2019). Those authors implement the recursion of the parsing algorithm not just for scores but for embedding vectors, and they need to run the outside phase too, both of which entail a high computational cost.

3 Methods

Probabilistic context‑free grammars are first defined formally (Section 3.1). Section 3.2 explains how neural networks parameterize the PCFG rule probabilities, making use of multilayer perceptrons (MLPs) and embeddings for symbols in the grammar. We define and compare different ways of arriving at embeddings for chord symbols in Section 3.3. Finally, we explain an optional additional objective, besides the objective that maximizes training sequence likelihood, to maximize the likelihood of the occurrence of short and frequent chord progressions in parse trees in Section 3.4.

3.1 PCFG

In this study, we assume that a PCFG generates every sequence in the data. Such a grammar $G = (S, N, P, V, R)$ consists of a set of rules $r \in R$ in Chomsky normal form,^³ which can be of form $S \to A$ , $A \to B_{1} B_{2}$ , or $P \to c$ . $S$ is the start symbol, $A$ is a nonterminal symbol (representing a group of chords) from a collection $N$ of size $N_{N}$ , $P$ is a preterminal symbol (representing a single chord) from a collection $P$ of size $N_{P}$ , $B_{1}$ , and $B_{2}$ are either nonterminal or preterminal, and $c$ is a chord from a vocabulary $V$ of chord symbols of size $V$ . Each rule $r$ is associated with a probability $π_{r}$ . Probabilities of rules with the same symbol on the left of the $\to$ sum to 1.

Starting with the start symbol, the rules can be recursively applied to arrive at a binary tree with only chord symbols $\in V$ as leaves. The probability of any such parse tree $t$ is given by the product of the probabilities of the rules that $t$ consists of: $p_{tree} (t) = \prod_{r \in t} π_{r}$ . Given that all sequences are generated by a PCFG, and that any sequence $s$ might have several (exponentially in the sequence length $L$ ) parse trees, the leaves of which form $s$ , we arrive at a probability distribution over sequences:

1

\begin{align} p_{seq} (s) = \sum_{t \in T (s)} p_{tree} (t), \end{align}

where $T (s)$ is the set of trees of which the leaves form $s$ . Given a sequence $s$ , the conditional distribution over parses $t$ for $s$ is given by:

2

\begin{align} p_{cond} (t | s) = \frac{1_{[t \in T (s)]} p_{tree} (t)}{\sum_{t^{'} \in T (s)} p_{tree} (t^{'})}, \end{align}

where $1_{[t \in T (s)]}$ equals 1 for $t \in T (s)$ and 0 otherwise.

A span $(i, j)$ is part of a parse tree $t$ if there is a subtree of which the leaves form exactly $s_{i : j}$ ; the subsequence of $s$ that runs from position $i$ to $j$ . E.g., Dm⁷ G⁷ corresponds to a span in Figure 1, but C⁶ D^7(#11) does not.

3.2 Neural PCFG

This section explains how neural networks are used and trained to compute the rule probabilities $π_{r}$ of a PCFG. We use the parameterization of Kim et al. (2019), in which rule probabilities are computed by MLPs from embeddings representing each root, non‑, or preterminal symbol. Figure 2 gives an overview of the model. Neural networks compute rule probabilities as outputs, so we say they estimate the probabilities. We then induce the rule probabilities from training data, meaning we train the neural networks to provide better estimates.

Graphical diagram of the neural PCFG, after Kim et al. (2019). $S$ is the start symbol; $A_{1}$ , $A_{2} \in N$ are nonterminal; $P_{1}$ , $P_{2}$ , $P_{3} \in P$ preterminal; and $c_{1}$ , $c_{2}$ , $c_{3} \in V$ chord symbols. $π_{r}$ are rule probabilities, computed from embeddings $E$ by neural networks $f$ . Full lines between circles are PCFG rule applications, dotted lines indicate probabilistic dependencies, and full lines between rectangles indicate computations by neural networks.

Neural network.

For an input sequence $s \in V^{L}$ of length $L$ , the network computes the following. To model rules of type $P \to c$ , an MLP $f_{θ}$ projects the set of preterminal embeddings $E_{P} \in R^{N_{P} \times h}$ to a categorical probability distribution over the chord vocabulary ( $h$ is the hidden dimension size). The tokens in $s$ index into the resulting probabilities to retrieve a probability per token per preterminal symbol, representing the likelihood of each of the preterminals to have generated each of the tokens, giving $π_{P \to s} \in R^{N_{P} \times L}$ .^⁴ To model rules of type $S \to A$ , a second MLP $f_{ϕ}$ projects the start symbol, embedding $e_{S} \in R^{h}$ into a categorical probability distribution over the nonterminal symbols: $π_{S \to A} \in R^{N_{N}}$ .

Finally, to model rules of type $A \to B_{1} B_{2}$ , a linear layer $f_{η}$ projects the set of nonterminal embeddings $E_{N} \in R^{N_{N} \times h}$ to a categorical probability distribution over the product set of symbols: $(N \cup P) \times (N \cup P) = M \times M$ . This gives $π_{A \to B_{1} B_{2}} \in R^{N_{N} \times N_{M} \times N_{M}}$ , with $N_{M} = N_{N} + N_{P}$ . The MLP input is always the embedding for the symbol on the left of a rule, and the denominator of the softmax (to obtain normalized probabilities) sums over all combinations of symbols on the right of the $\to$ of a rule:

3

\begin{array}{l} π_{P \to c} & = {softmax}_{c \in V} (f_{θ} (E_{P})) & \in R^{N_{P} \times V} \\ π_{S \to A} & = {softmax}_{A \in N} (f_{ϕ} (e_{S})) & \in R^{N_{N}} \\ π_{A \to B_{1} B_{2}} & = {softmax}_{B_{1}, B_{2} \in M} (f_{η} (E_{N})) & \in R^{N_{N} \times N_{M} \times N_{M}} . \end{array}

Training.

The probability of a sequence of chords $s$ is given by the sum of the probabilities of every parse tree $t$ of which the sequence of leaves is equal to $c$ . This yields a mechanism for training without access to trees annotated by experts: maximize the likelihood of a training sequence by maximizing the sum of the likelihoods of all its possible parse trees. In other words, our objective is to minimize the negative log‑likelihood of $p_{seq} (s)$ , as in Equation 1.

The exponential sum can be obtained in polynomial time from probabilities $π_{P \to s}$ , $π_{S \to A}$ and $π_{A \to B_{1} B_{2}}$ by using a differentiable dynamic programming algorithm; the inside algorithm (Baker, 1979).^⁵ The parameters to train are the set of input embeddings and the MLP and linear projection parameters.

Inference.

With either Viterbi^⁶ (Sakai, 1961) or minimum Bayes risk (MBR) decoding (‘bracketed recall parsing’ in Goodman, 1996), a parse tree is predicted from estimated probabilities $π_{r}$ . In short, Viterbi gives the tree with the highest probability (according to the model) of being entirely correct, while MBR gives the tree with the highest expected number of correct spans.

3.3 Chord embeddings

This section explains a number of ways to obtain $π_{P \to s} \in R^{N_{P} \times L}$ , which are probabilities for each of the $L$ chord tokens in the input sequence, for each of the $N_{P}$ preterminal symbols to have generated that token. $π_{P \to s}$ is obtained either from preterminal embeddings $E_{P}$ (as in the previous section and in Kim et al. (2019)), or, as we propose, from chord embeddings $E_{V}$ .

Tokenization.

We followed Foscarin et al. (2023) to take the vocabulary of chords as the Cartesian product of the set of 12 possible root notes, 6 chord forms, and 4 essentials^⁷: $V = {Root note: A, B ♭, . . ., G #} \times {form: major, minor, diminished, augmented, suspended,$ $half-diminished} \times {essential: major 6^{th}, minor 7^{th}, major$ $7^{th}, null}$ , with size $V = 288$ . We dropped the duration and metrical strength dimensions since they are not available for all datasets and did not improve performance.

Half‑diminished chords could both be encoded as having diminished or half‑diminished form (in both cases with a minor seventh essential), while other combinations (such as an augmented triad extended by a major sixth) do not exist. In the former case, we hope that the model learns from data that both encodings have similar semantics.^⁸ The latter case results in some of the 288 entries never being used in practice, which poses no problem. We employ this scheme for simplicity and comparability with Foscarin et al. (2023), but its clarity could certainly be improved in future work. The ordering in frequency space (A follows G and precedes B) and periodicity (A is as close to G as it is to B) of root notes could also be taken into account.

Flat.

As explained in Section 3.2, we represented each chord by its index in the Cartesian product vocabulary $V$ . The sequence's chords index into the probabilities over the entire vocabulary $π_{P \to c}$ to obtain $π_{P \to s} \in R^{N_{P} \times L}$ , with a log‑prob per chord that any of the given preterminals generated the chord.

Components.

We represented the components of the Cartesian product separately (root note, form, essential) and summed the component log‑probs. An MLP $f_{γ}$ produces three distinct component probabilities ( $0 < i \leq 3$ ):

4

\begin{align} π_{P \to C^{i}} = {softmax}_{C \in C^{i}} (f_{γ} (E_{P})) \in R^{N_{P} \times N_{C}^{i}}, \end{align}

where $C^{i}$ is the component vocabulary (e.g., all possible root notes) and $N_{C}^{i}$ its cardinality (e.g., 12). The chords in $s$ have three components, each of which corresponds to a log‑probability in the corresponding $π_{P \to C^{i}}$ , and these are summed to obtain $π_{P \to s}$ .

The probability distribution per preterminal over the chord vocabulary is decomposed into components, where independence of the components is assumed but not necessarily given. For example, in the Flat embedding scheme, one preterminal might prefer all major, minor, and diminished chords in the diatonic triad series of a key, e.g., in the key of C major: C, Dm, Em, ..., B^o. However, in this embedding scheme, one preterminal can only prefer, e.g., major over minor chords, or chords with root C over root D, but not chords in a given key over chords in another key.

Comp2flat.

To mitigate this issue, we used three embedding matrices $E_{C^{i}} \in R^{N_{C}^{i} \times h}$ to embed the three components of chord symbols, which we then summed into one embedding of dimension $h$ . We computed this embedding for all 288 chords in the Cartesian product vocabulary: $E_{V} \in R^{V \times h}$ (the row rank of $E_{V}$ is at most the sum of the row ranks of the $E_{C^{i}}$ ). An MLP $f_{ψ}$ projects $E_{V}$ to a categorical distribution $π_{P \to c}$ per term over the vocabulary:

5

\begin{align} E_{V} [j] & = & E_{C^{0}} [root (V [j])] + E_{C^{1}} [form (V [j])] \\ + E_{C^{2}} [ext (V [j])], 0 < j \leq 288 \end{align}

6

\begin{align} π_{P \to c} & = {softmax}_{c \in V} {(f_{ψ} (E_{V}))}^{T} \in R^{N_{P} \times V}, \end{align}

where $E [j]$ denotes the $j$ 'th entry of $E$ . One difference with how $π_{P \to c}$ was obtained in Section 3.2 is that, here, vocabulary embeddings $E_{V}$ are projected to $N_{P}$ dimensions, while, before, the preterminal embeddings $E_{P}$ were projected to $V$ dimensions. This has the advantage of both 1) sharing parameters for shared underlying components (e.g., every root note has a single representation instead of one per chord that is built on it) and 2) yet allowing probabilistic dependencies between chords and not just between components to be modeled.

3.3.1 Chord embeddings comparison

Table 1 shows the validation F1 with annotated trees from the Jazz Harmony Treebank (JHT) (Harasim et al., 2020) for different embedding schemes, after unsupervised training on the iRealPro dataset as reformatted by Harasim et al. (2020). The validation F1 is defined as in Section 4.1 to compute overlap between predicted and annotated trees from data splits as explained in Section 4.2. Results are averaged over three runs. Disentangling the components (Components) does not work significantly better than using a naive flat representation (Flat), presumably because of the wrongfully assumed component independence. Comp2flat mitigates the independence issue and does improve upon Flat.^⁹

Table 1

Chord embeddings comparison on JHT.

Embeddings	val. F1
Flat	.428
Comp2flat	.453
Components	.431

3.4 Chord progression objectives

This section introduces optional training objectives, designed to steer the training toward the inclusion of well‑known short and common subsequences. These objectives are used in addition to the maximum‑likelihood objective that was discussed in Section 3.2.

Harmony in Western music, and especially in jazz, is well known to often feature established chord progressions, such as the – in jazz – prominent ii – V – I progression (Broze and Shanahan, 2013; Mauch et al., 2007; Rohrmeier, 2020; Shanahan et al., 2012).

1) We defined a number of relations (Section 3.4.1) or, as a preprocessing step, scanned the training sequences for the most frequent subsequences of chords (Section 3.4.2). 2) During training, we searched each training sequence for subsequences that fit one of the defined relations or that occur in the set of frequent subsequences. We save these subsequences, per training sequence $s$ , as a set of spans by their start and end indices: $Σ (s) = {(i, j)}_{k}$ . 3) The inside algorithm builds in a bottom‑up fashion a chart of log‑probs (inside scores) from estimated rule log‑probs $π_{r}$ as in Equation 3. Each cell corresponds to a given subsequence of chords (a span) and contains the total likelihood of that subsequence having being generated by the grammar from a nonterminal symbol (as the sum of the likelihoods of all possible subtrees for that span). We added a training objective to maximize those inside log‑probs in the chart that correspond to one of the spans $(i, j)$ in the saved set $Σ (s)$ , which corresponds to maximizing $p_{seq} (s_{i : j})$ .

3.4.1 Rule‑based patterns

Simple, short progressions can easily be parsed from raw chord sequences based on simple rules: e.g., a chord with a dominant function is likely to be followed by a chord with a tonic function. The root of the dominant may lie a perfect fifth (V), a major seventh (vii^o), or a minor second ( $♭$ II) above the tonic note of the key. Furthermore, the spans of chords that make out these progressions likely occur in the harmonic parse tree since the type of relations underlying the progressions are exactly those relations used to define and build the parse trees (even if the parse trees reflect a wider variety of such relations, on both long and short timescales).

This adds some music‑theoretical bias to the induced rules, while we claimed earlier that our models learn grammars entirely from data, with minimal bias. However, our neural PCFG also learns without these extra objectives, and, moreover, the bias that the objectives add is minimal compared to the bias provided by supervised training or hardcoded rules.

Chord relations.

Chords in short progressions are more likely to have a preparatory relation to the next or a nearby chord than to a further chord, so we saved spans for contiguous strings that satisfy the relations in a left‑branching manner, e.g., V/V – V – I or D⁷ – G⁷ – C^Δ becomes ((V/V V) I) or ((D⁷ G⁷) C^Δ), meaning $(i, i + 1)$ and $(i, i + 2)$ are saved, if $i$ is the index of V/V (or D⁷).

Perfect fifth: We searched for subsequences of chords in which the next chord is a perfect fifth (seven semitones) below the previous chord. The interval of a perfect fifth underlies (parts of) many common progressions, such as sequences of descending fifths, the ii – V – I progression, or V/X – X (sub)progressions that allow for tonicizations through the use of secondary dominants.
We allowed $♭$ II – I; the tritone substitution for a dominant V, as well as vii^o – I; the diminished triad of the seventh degree of a major or (harmonic) minor key that can function as dominant (including the half‑diminished seventh chord in major and the diminished seventh chord in minor).
It is common for a chord to have a preparatory relationship to another chord that does not immediately succeed it: one example could be the progression ii/V – V/V – ii – V – ... To capture the relationship between V/V and V (rather than between V/V and ii, here carrying the same root), we allow relations between chords separated by at most one other chord. In this case, we save the spans not as entirely left‑branching, since subsequent relations take precedence over longer‑range preparatory relations. For instance, the example would be saved as ((ii/V V/V) (ii V)) and not as (ii/V (V/V (ii V))).

3.4.2 Statistical patterns

Instead of coding rules that chord subsequences should satisfy in order to be included into the objective, we collect frequently occurring subsequences from training sequences as a preprocessing step. We transpose all chords in the subsequences so the last chord has C as a root, then collect the C‑transposed subsequences with a frequency above a threshold in a set $Σ_{stat} (s)$ . During training, we apply the objective to the subsequences that occur in $Σ_{stat} (s)$ as described above. This approach has the advantage of being entirely unsupervised and requiring no manual work, while inducing minimal bias.

3.4.3 Objective function

The objective function in machine learning is minimized during training, in order to find the optimal parameter values. Our chord progression objective minimizes the negative log‑probs corresponding to the progression spans $Σ_{rule} (s)$ or $Σ_{stat} (s)$ :

7

\begin{align} L (s, θ) = - \sum_{(i, j) \in Σ (s)} chart (i, j), \end{align}

with $θ$ , the MLP weights, and ‘chart,’ the chart, filled with inside log‑probs returned by the inside algorithm (or the chart of marginals returned by the outside algorithm).^¹⁰

Marginals.

As previously described, the inside algorithm computes log‑probs, which are summed in a recursive bottom‑up fashion to obtain the sequence likelihood $p_{seq} (s)$ . During training, we maximized this likelihood, which is the sum of the likelihoods of the exponentially many parse trees with $s$ as leaves.

To compute the minimum risk tree during inference, we needed the posterior expected (anchored) counts ( $\in [0, 1]$ ) of rules across all possible parse trees for a given sequence. These counts, the marginals $ρ$ , are obtained by the outside algorithm,^¹¹ implemented efficiently by differentiating (using backpropagation), w.r.t. the inside log‑probs, the partition function (Eisner, 2016): $ρ_{i, j} = \frac{\partial log Z}{\partial chart (i, j)}$ with $Z = p_{seq} (s) = \sum_{t \in T (s)} p_{tree} (t)$ . This backpropagation is executed solely to obtain expected counts and to make a prediction, not to perform a parameter update.

Instead of maximizing a span's inside score as computed by the inside algorithm, we maximized the span's expected count across all global trees for the sequence, obtained by the outside algorithm (the loss in Equation 7 is used).^¹² The parameters were updated with a second‑order derivative of the partition function. This means we have to differentiate the partition function not just during inference but also (twice instead of once) during training, which comes at significant computational cost.

3.4.4 Progression objective comparison

Table 2 shows validation results obtained with different progression objectives. We tuned the relative loss weight, along with the maximum length $n = 5$ of the frequent subsequences, on validation data. We obtained small improvements with rule‑based spans and small decreases with statistical spans. Maximizing marginals (M) works better than maximizing inside (log‑) probabilities (I) but comes at a higher (training) cost: 12 h for a training run versus 7 h, and 16 GB instead of 4 GB of RTX 3090 GPU memory for batch size 4. The margin loss works best when maximizing log‑probs, but it is not needed for maximizing marginals. We use the rule‑based span loss trained with marginals in subsequent experiments.

Table 2

Validation F1 on JHT for progression losses when training on $JHT + iRealPro$ . I/M distinguishes between inside scores and marginals.

Spans	I/M	F1	$Δ$
No loss		.453
Rule	I	.466	$+$ .013
	M	.484	$+$ .03
Statistical	I	.438	$-$ .015
	M	.465	$+$ .012

4 Results

In this section, the main evaluations and results are discussed. First, Section 4.1 outlines the configuration of the experiments. Then, Section 4.2 describes the datasets that are used. Hyperparameter values are compared in Section 4.3, and the overlap of predicted trees with trees annotated by experts is compared for baselines and proposed models in Section 4.4. Section 4.5 contains more in‑depth analyses of results, such as the discussion of examples of predicted trees.

4.1 Experimental set‑up

We used the efficient neural PCFG implementations of Yang et al. (2021b) in PyTorch (Paszke et al., 2019). Unlike Kim et al. (2019), we found that it does not help to gradually increase the sequence length throughout training. The MLPs used by the neural PCFG consist of two residual layers. We scaled all models so they have $\sim$ 1.2M parameters. We used MBR decoding (Goodman, 1996) since it gave the best results.

We found neither the compound PCFG proposed by Kim et al. (2019) nor the efficient low‑rank approximations proposed by Yang et al. (2021b, a, 2022) or Liu et al. (2023) improved the results. The addition of a sentence encoding LSTM in the compound PCFG increases significantly the number of parameters, rendering training more difficult. The low‑rank models assume that the approximation of the inside algorithm gets offset by the gain in efficiency, which allows it to train on more data. We hypothesize that our relatively small—compared to NLP—dataset is not large enough for these trade‑offs to be beneficial. We were not able to get the model of Drozdov et al. (2019) to converge. We tried initializing the neural PCFG weights to weights pretrained on language datasets from Kim et al. (2019), but this decreased overlap with annotated trees.

Metrics.

We averaged all results over three runs with the same settings but different random seeds and report the sequence‑level recall and F1 on spans of length $\geq$ 2. The recall is equal to the ‘tree accuracy’ and the ‘span accuracy’ reported by Harasim et al. (2018) and Foscarin et al. (2023), respectively. The F1 is the harmonic mean of the precision and recall. Precision is the proportion of predictions that is correct; recall is the proportion of annotations that is present in predictions.

4.2 Data

All training samples are sequences of chord tokens, e.g., C D⁷ Dm⁷ G⁷ C, with lengths between 4 and 100. We used 120 of the 150 songs of the JHT (Harasim et al., 2020) for (un)supervised training and kept 30 songs and their annotated parse trees for evaluation. Unlike Foscarin et al. (2023), we used the ‘complete’ trees and not the trees with ‘open’ constituents, since this gave better results. The JHT was annotated for songs in the iRealPro dataset (Shanahan et al., 2012), of which we used the remaining 1K songs, as reformatted and published alongside the JHT, for unsupervised training. We additionally extended the unsupervised training set with different subsets of the larger (20K) ChoCo dataset (de Berardinis et al., 2023). During training, we randomly transposed songs to one of 12 possible keys. During evaluation, we used all 12 transpositions of every song to keep the evaluation free of randomness. Hence, we obtained 360 evaluation songs. When we combined training sets, we removed all songs with titles occurring in the evaluation set.

When training unsupervised models, we used the 120 training set trees for validation (these were never seen by the models, which only saw the chord sequences). For supervised models, we simply selected the checkpoint from the epoch with the smallest average training loss, like Foscarin et al. (2023).

JHT annotators omitted turnarounds or appended a final tonic, so, if possible, we did the same with sequences from other datasets.^¹³ We checked whether the song ends in a tonic (using the annotated key) if we removed the last couple of chords. If not, we appended a tonic chord.

Results.

Only using the 120 JHT songs for unsupervised training achieves a validation F1 of 0.392. Additionally using the remaining 1K iRealPro songs for unsupervised training improves the F1 to 0.453. Adding subsets of ChoCo improves results further, but only if we 1) select the jazz subsets—the JAAH (Eremenko et al., 2018), the Jazz Corpus (Granroth‑Wilding and Steedman, 2014), and The Real Book (Mauch et al., 2007), together amounting to 2300 extra songs—and discard the rest, and 2) remove both repeated chords from all training data and subtrees consisting entirely of repeated chords from evaluation trees: 0.476 F1 on validation spans.^¹⁴

For this reason, all experiments in this study were performed on sequences and trees with repeated chords removed. This makes results not directly comparable to those in prior work, unless explicitly stated.

4.3 Hyperparameters

We used hyperparameters^¹⁵ from Kim et al. (2019)'s neural PCFG model for language and scaled down the MLP hidden dimension from 256 to 128, since this increased validation F1 (Figure 3b). Figure 3a shows that the number of preterminal and nonterminal symbols is optimal at $N_{N} = 60, N_{P} = 120$ ; however, to limit computational resource usage, we set $N_{N} = 30, N_{P} = 60$ . Most of the computational cost is caused by the forward and backward passes through the dynamic program that computes the log‑prob chart from rule probabilities $π_{S \to A} \in R^{N_{N}}$ , $π_{P \to s} \in R^{N_{P} \times L}$ and $π_{A \to B_{1} B_{2}} \in R^{N_{N} \times (N_{N} + N_{P}) \times (N_{N} + N_{P})}$ , of which all dimensions depend only on $N_{N}$ , $N_{P}$ , and the sequence length.

Validation F1 for **(a)** the number of nonterminal (NT) and preterminal (P) symbols and **(b)** the hidden dimension of the MLPs used in the N‑PCFG. Shaded region represents standard deviation.

4.4 Comparison to baselines

Table 3 compares the overlap with JHT trees of predicted trees by models proposed in this study to trivial and non‑trivial baselines. N‑PCFG is our neural PCFG model and MuDeP is the supervised dependency parser of Foscarin et al. (2023), trained on the 120 JHT trees only (with complete trees and repeated chords removed, so results are comparable).

Table 3

Test F1 and recall (Re) of our models and baselines. Unsupervised models are on top, supervised models are in the middle, and trivial baselines are below. Best results per category in bold.

Model	Train data	F1	Re
N‑PCFG	JHT	.384	.391
N‑PCFG	$+$ iRealPro	.450	.462
$+$ Prog. loss		.461	.476
N‑PCFG	$+$ ChoCo	.474	.482
Prog. loss		.487	.495
SupCon	JHT	.640	.643
MuDeP	JHT	.607	.606
Random		.178	.200
Left‑branching		.134	.142
Right‑branching		.176	.211

SupCon is a supervised model like the model of Foscarin et al. (2023), adapted to predict constituency trees instead of dependency trees: it consists of a transformer encoder and a span decoder and is trained by binary cross‑entropy to predict for every span $(i, j)$ , from the concatenation of the $i$ 'th and the $j$ 'th chord's transformer output, whether it is a constituent or not.

The neural PCFGs achieve significant overlap with annotated trees (consider the low scores of the trivial baselines), both without (minimal inductive bias) and with (small inductive bias) progression objectives. Using more training data and the progression objective improve test scores. However, the gap between supervised scores remains considerable. It is of course possible that the unsupervised models learn a different form of grammar, reflecting a different underlying structure that does not align entirely with the JHT annotations. However, the fact that there is significant overlap means that a PCFG, to learn a useful model of harmonic sequences from data, uses rules that are at least to an extent similar to the rules used by JHT annotators.

Abstract PCFG.

Since there is no runnable and publicly available implementation for the abstract PCFG (PACFG) of Harasim et al. (2018), we could only consider the results for the PACFG as reported by Foscarin et al. (2023): their average recall (or ‘arc accuracy’) for our test set was 0.74.^¹⁶ Training N‑PCFG on open trees and without removing repeated chords (as for the PACFG) and without progression loss resulted in a recall of 0.46. Still, these results are not directly comparable since the PACFG models are computed with a leave‑one‑out strategy and we did not have the resources to do this.

The rules of the PACFG have been defined and hardcoded by music theorists, using the exact principles that were used to annotate the JHT trees.^¹⁷ It is therefore not surprising that the model achieves high recall on the JHT evaluation set. Songs with chord patterns that do not adhere to the rules, e.g., some of the rock or classical pieces in ChoCo, might not be parsable (as they note in their study). Our N‑PCFG models are more flexible; they can, in principle, parse any sequence of chords from the vocabulary.

4.5 Analysis

The analyses in this section are performed, unless otherwise mentioned, on the best (in terms of val. F1) model trained on all datasets without the progression loss to investigate what the model learns purely from data, with minimal music‑theoretical bias.

Example trees.

Figures 4 and 5 show examples of trees predicted by the best N‑PCFG (without progression loss) and their annotated counterparts from the evaluation set. The model correctly finds many substructures; for example, the second half of ‘Blue Bossa’ in Figure 4. In the first half, D^∅7 G⁷ Cm⁷ is correctly identified as a ii – V – I progression, but the model fails to recognize Fm⁷ as predominant—like D^∅7—to dominant G⁷.

‘Blue Bossa’ by Kenny Dorham: **(a)** prediction and **(b)** annotation. Lead sheet in Suppl. Figure 9a.

For Sunny, in Figure 5, the model's prediction analyzed the composition as a series of VI – I alternations (F^Δ Am⁷; these chords occur at the higher levels of the tree), with the dominant E⁷ in a subordinate role. In the prediction, the VI can be seen as a substitution for I. The annotation, on the other hand, analyzed the composition as a series of dominant tonic progressions where F^Δ is understood as a preparation for dominant E⁷, possibly as tritone substitution for E⁷'s relative dominant B⁷. However, in that case, one would expect F⁷ rather than F^Δ since the former contains A and E $♭$ as third and seventh, which function as enharmonically equivalent guide tones leading to E⁷. The subsequence C⁷ F^Δ B^∅7 E⁷ Am⁷ can be seen as a III (or V/VI) – VI – ii – V – i progression, and whether F^Δ (as opposed to F⁷) is best understood as standing in relation to the tonic or as preparation to the dominant could be debated over. This provides an example where, we argue, the unsupervised model has found another viable interpretation. The model recognizes the dominant tonic relation between tritone substitution B♭⁷ (for E⁷) and Am⁷ toward the end but does not assign the preceding C⁷ F^Δ as preparatory to B♭⁷, consistent with the interpretation of the composition as repeated VI – I alternations.

‘Sunny’ by Bobby Hebb: (a, left) prediction and (b, right) annotation. Lead sheet in Suppl. Figure 9b.

Interpretation of induced nonterminal symbols.

Since we do not impose specific rules on the PCFG, it is free to use rules and symbols as it sees fit. Therefore, the nonterminal (NT, abbreviated) symbols (representing groups of chords) do not have an obvious interpretation, but we inspected when rules with any NT in the left‑hand side (e.g., $A$ in $A \to B C$ ) are used and how often these applications resulted in correctly predicted spans. Figure 6 plots all (used^¹⁸) NTs on the vertical axes. The horizontal axes show a sample of the most frequent constituents grouped according to their sequence of leaf symbols (Figure 6b) or according to their ending chord (Figure 6a). The cell color indicates NT precision: the ratio of times that the prediction of a particular NT for a chord group results in a span that overlaps with annotations (versus the total number of times the NT is predicted for that chord group).

NT precision versus a sample of **(a)** span ending chords sorted by frequency and **(b)** spans of chords. ‘o' indicates a diminished chord, ‘%' a half‑diminished chord, ‘m' a minor chord, ‘^$\land$7’ a major seventh, ‘7’ a minor seventh, and ‘6' a major sixth. Some nonterminals are not used in this plot, which can be explained by the plot not showing all possible spans.

Nonterminals ‘specialize,’ both in progressions and in ending chords, in the sense that they are often (correctly) used for certain groups of chords and not for others. NT‑0 and NT‑2 in Figure 6b, for instance, are used for Gm⁷ C⁷ F^Δ or C $♯$ m⁷ F $♯^{7}$ B^Δ, two ii – V – I progressions. Interestingly, the root notes of the chords in these two progressions are separated by a tritone interval, and the same goes for many other different chord groups modelled by particular NTs (such as NT‑9 and NT‑11, both used for D^∅7 G⁷ or G $♯^{\emptyset 7}$ C $♯^{7}$ ). NT‑18 and NT‑6 in Figure 6a are used to model groups of chords that end with C $♯$ , while NT‑17 is used to model groups of chords ending with either D or G $♯$ (again, separated from each other by a tritone).

Figure 8a in appendix A shows the same analysis, but after transposing every span so its end chord has C as the root. In Figure 8b, spans are not transposed, but chords are replaced by a symbol for which of the octatonic Tonfelder scales their notes belong to (or which of the hexatonic scales if they are augmented or include a major seventh, since these chords cannot be formed by notes of a single octatonic Tonfelder scale).^¹⁹ Some specialization can still be seen: e.g., NT‑19 in Figure 8b models spans that transition from a dominant seventh chord from the first octatonic scale to a dominant seventh chord from the third octatonic scale. However, overall, the plots are less sparse, meaning that NTs model several of the newly formed groups and the groups are modeled by several NTs. Hence, the aggregation (by either C‑transposing or grouping by Tonfelder) does not perfectly align with what NTs model. We conclude that our model does not by itself learn to be invariant to transposition, nor to changes within the same octatonic Tonfelder scales (as opposed to the PACFG (Harasim et al., 2018), which uses hardcoded rules to be invariant to transposition).

Induced rules.

Figure 7 shows the rules learned by the model to which it assigns the highest probabilities. Rules in the form as defined in Section 3.1 with all combinations of symbols are allowed and can be used by the model if it assigns nonzero probability to a rule. This results in too many combinations to print,^²⁰ so we only print the two most likely rules for any symbol on the left. The entire grammar is available in our code repository, and we welcome other researchers to investigate the rules.

All printed preterminal symbols represent chords of which the roots lie a tritone (six semitones) from each other (these chords, if we consider their triad form and ignore the optional essential fourth note, do not have any notes in common) and that share form and essential. Some nonterminal rules seem to be used as structural rules rather than representing specific progressions: e.g., $A_{0} \to A_{0} A_{0}$ , which can be used to create repeated patterns, or $A_{16} \to A_{16} A_{24}$ , which creates a recursive left‑branching structure.

Rule $A_{0} \to P_{14} P_{52}$ can create any of two V – I patterns (G⁷ C or C#⁷ G $♭$ ) and two $♭$ II – I patterns (G⁷ G $♭$ or C#⁷ C). In contrast, rule $A_{25} \to P_{58} P_{28}$ results in a major chord followed by a dominant seventh chord with its root either a perfect fifth or a semitone higher, which would have been closer to the theory of Rohrmeier (2020) and the JHT if it was reversed ( $P_{28} P_{58}$ ; to capture a preparatory V – I relationship). V – I subsequences do not necessarily occur more frequently than I – V subsequences, and the unsupervised model does not a priori know the theory interpreting a V chord as preparing a subsequent I rather than ‘prolonging’ a preceding I. Rules $A_{24} \to A_{19} P_{38}$ and $A_{19} \to P_{40} P_{59}$ create together ((ii – V) – I) or ((ii – $♭$ II⁷) – I) patterns by substituting $A_{19}$ in $A_{24} \to A_{19} P_{38}$ for $P_{40} P_{59}$ : either Bm⁷ E⁷ A^Δ or Bm⁷ B $♭^{7}$ A^Δ or Fm⁷ B $♭^{7}$ E $♭^{Δ}$ or Fm⁷ E⁷ E $♭^{Δ}$ .

5 Conclusion

We have successfully trained neural PCFGs on chord sequences of jazz compositions in an entirely unsupervised manner. The predictions show significant overlap with expert annotations, and at times offer alternative interpretations of a piece. There remains a considerable gap between supervised models or models that hardcode rules on the one hand and our unsupervised models on the other. We were able to reduce the gap by using more data, by using a smarter embedding scheme, and by adding a small amount of music‑theoretical bias in the form of a learning objective. A lack of diverse annotated data made quantitative evaluation difficult since, as training is unsupervised, it is possible that the model learns different but valid structures. It would be interesting to investigate further to what extent that is the case, as well as how to further improve unsupervised parsing of harmony, especially given the good accuracy of unsupervised parsing for human language.

Acknowledgments

This work is part of the CALCULUS project, which is funded by the ERC Advanced Grant H2020‑ERC‑2017 ADG 788506.^²¹ We thank the reviewers and action editors of the TISMIR journal for their insightful feedback and comments.

Reproducibility

All code, data, and models to reproduce the results described in this article are available at https://github.com/rubencart/jazz-pcfg.

Contributions

R. C.: conceptualization, methodology, software, analysis, investigation, data curation, writing – original draft, and visualization. J. K.: analysis, investigation, and writing – review & editing. M.‑F. M.: conceptualization, methodology, resources, writing – review & editing, supervision, project administration, and funding acquisition.

Competing Interests

The authors have no competing interests to declare.

Notes

[1] https://github.com/rubencart/jazz-pcfg.

[2] https://irealb.com/forums/.

[3] Chapter 18 of Jurafsky and Martin (2024) contains an explanation about formal grammars and Chomsky normal form.

[4] We mention probabilities and log‑probs interchangeably. In practice, we perform computations on logarithms of probabilities for numerical stability.

[5] We refer to Manning and Schütze (2001), p. 392, for an explanation of the inside and outside algorithms.

[6] Also known as the CKY algorithm in this context (Sakai, 1961).

[7] While Foscarin et al. (2023) use extension, we use the term essential for a fourth note addition to a triad, following Terefenko (2014). Like Foscarin et al. (2023), we omit ninth notes and other essentials besides six and seventh notes.

[8] Which seems reasonable, since it is an inevitable consequence of using chord symbols that certain identical sets of notes (ignoring inversions) can be encoded with distinct chord symbols, such as Am⁷ and C⁶.

[9] We additionally tested pretrained chord embeddings proposed in literature (Lazzari et al., 2022, 2023), albeit without success.

[10] We experimented with variants of this loss 1) that use a margin proportional to span length or 2) where negative spans that overlap with the rule‑based positive spans are sampled for which the log‑probs are minimized, but this hurt performance.

[11] We refer to Manning and Schütze (2001), p. 392, for an explanation of the inside and outside algorithms.

[12] In practice, we minimize the log‑marginal.

[13] Turnarounds are subsequences of 1–4 chords at the end of a song that are only played to cycle back to the beginning of the song.

[14] JHT songs contain repeated chord symbols if a chord is played during several bars, while most of ChoCo songs do not.

[15] Settings for the model or training process: e.g., the learning rate, batch size, or the number of epochs.

[16] https://github.com/fosfrancesco/musicparser/blob/3f15e5366cdf60 be517aed945238016f4107d12f/results/JHT%20result%20table.csv.

[17] In fact, all authors of Harasim et al. (2018) are authors of the JHT Harasim et al. (2020) too.

[18] Some NTs might never be used by the model, or just not for shorter spans but rather for longer spans higher up the tree, which explains why the plot only has 19 rows. Note that the numbering does not correspond to the numbering of nonterminal $A$ symbols in equations 7.

[19] Tonfelder (‘tone fields’) are collections of pitches as given by grouping nodes on a graph of pitches with major/minor third and perfect fifth relations. They were defined by Simon (1983) (see also Haas, 2004; Rohrmeier and Moss, 2021) and can account for harmony in extended tonality as found in Western music from the Romantic period onward, including in jazz.

[20] $N_{N}$ rules with $S$ on the left, $N_{P} \cdot V$ rules with a preterminal on the left, and $N_{N} \cdot N_{M}^{2}$ rules with a nonterminal on the left, results in 260K rules for $N_{N} = 30, N_{P} = 60$ .

[21] https://calculus-project.eu/.

Additional File

The additional file for this article can be found using the links below:

Supplementary Appendix

DOI: https://doi.org/10.5334/tismir.217.s1.