Measure by Measure: Measure-Based Automatic Music Composition with Modern Staff Notation

Yujia Yan; Zhiyao Duan

doi:10.5334/tismir.163

Measure by Measure: Measure-Based Automatic Music Composition with Modern Staff Notation

Transactions of the International Society for Music Information Retrieval

Volume 7 (2024): Issue 1

By: Yujia Yan and Zhiyao Duan

Open Access

|Nov 2024

Figures & Tables

Overview of the proposed approach. The measure encoder encodes the entire piece into a grid, where each cell contains a vector summarizing the content of a measure within a part (left) and a matrix capturing the detailed structure of the measure (right). The example score is from J.S. Bach’s *Art of Fugue*.

The common object hierarchy of a single part‑wise measure. This hierarchy specifies the compositional structure of a part‑wise measure, which includes four levels from high to low: measure, voice, chord, and note.

A high‑level illustration of the proposed measure model. The measure/voice encoder and decoder have two outputs/inputs that are vector‑valued and matrix‑valued, respectively.

Example of cross‑beat note splitting, from the left two measures to the right two measures: the note C crossing the (3 + 3)/8 division is split into an eighth note and a quarter note; similarly, the dotted half note is split into two dotted quarter notes.

Linearization of the measure grid for autoregressive modeling using the measure model.

Some statistics of music pieces in our dataset. Each dot represents a piece. **(a)** Length versus number of parts (staffs). **(b)** Length versus number of note events.

Table 1

Effects of different encoding key functions and memory reader mechanisms on negative log likelihood (nll) and length generation accuracy (len acc) of measure generation.

key( $\cdot$ )	nll	len acc	nll (aug.)	len acc (aug.)
Reader: position‑dependent attention
identity	13.49	$96.99 %$	55.94	$37.25 %$
softmax	12.13	$97.38 %$	45.53	$37.24 %$
sigmoid	14.35	$95.85 %$	77.02	$35.33 &$
skewgauss	12.18	$96.90 %$	73.20	$37.25 %$
Reader: position‑dependent only
identity	13.70	$97.97 %$	69.35	$38.29 %$
softmax	13.60	$96.65 %$	89.22	$37.37 %$
sigmoid	12.62	$97.55 %$	56.88	$36.93 %$
skewgauss	12.33	$95.48 %$	$37.82$	$39.91 %$

	nll	len acc	nll (aug.)	len acc (aug.)
key( $\cdot$ ): softmax
−pos	$12.13$ 12.17	$97.38 %$ $93.72 %$	45.53 57.99	$37.24 %$ $29.90 %$
key( $\cdot$ ) skewgauss
−pos	12.33 13.14	$95.48 %$ $92.68 %$	$37.82$ 54.10	$39.91 %$ $30.60 %$
No measure memory matrices
−pos	13.81 14.13	$95.70 %$ $91.49 %$	92.23 103.79	$36.32 %$ $33.21 %$

	nll	len acc
key( $\cdot$ ): softmax
−pos	12.74 12.74	$98.09 %$ $92.97 %$
key( $\cdot$ ): skewgauss
−pos	12.51 $12.34$	$97.77 %$ $96.14 %$
No measure memory matrices
−pos	13.44 14.30	$91.67 %$ $78.67 %$

Subjective listening tests among different self‑reported expertise groups. **(a)** Overall ratings. **(b)** Overall ratings among those with the expertise “none/beginner”. **(c)** Overall ratings among those with the expertise “intermediate/experienced/expert”.

Subjective listening ratings along four aspects across all expertise groups. **(a)** Fluency. **(b)** Expressivity. **(c)** Novelty. **(d)** Organization.

One‑sided Mann–Whitney rank test for the pairwise comparison of ratings: p‑value indicates the probability for the hypothesis that the median of *population A* (row) is larger than the median of *population B* (column). **(a)** Overall. **(b)** Fluency. **(c)** Expressivity. **(d)** Novelty. **(e)** Organization.

Table 3

Ranking scores (and ranks) by the Bradley–Terry model on the Mann–Whitney U‑statistics.

	AR	AR Measure	CSD Measure	Human
Overall	$0.210 (4)$	$0.218 (3)$	$0.254 (2)$	$0.318 (1)$
Fluency	$0.209 (4)$	$0.214 (3)$	$0.275 (2)$	$0.302 (1)$
Expressivity	$0.234 (3)$	$0.206 (4)$	$0.235 (2)$	$0.325 (1)$
Novelty	$0.246 (3)$	$0.238 (4)$	$0.267 (1)$	$0.249 (2)$
Organization	$0.203 (4)$	$0.231 (3)$	$0.251 (2)$	$0.316 (1)$