
Figure 1
Overview of the proposed approach. The measure encoder encodes the entire piece into a grid, where each cell contains a vector summarizing the content of a measure within a part (left) and a matrix capturing the detailed structure of the measure (right). The example score is from J.S. Bach’s Art of Fugue.

Figure 2
The common object hierarchy of a single part‑wise measure. This hierarchy specifies the compositional structure of a part‑wise measure, which includes four levels from high to low: measure, voice, chord, and note.

Figure 3
A high‑level illustration of the proposed measure model. The measure/voice encoder and decoder have two outputs/inputs that are vector‑valued and matrix‑valued, respectively.

Figure 4
Example of cross‑beat note splitting, from the left two measures to the right two measures: the note C crossing the (3 + 3)/8 division is split into an eighth note and a quarter note; similarly, the dotted half note is split into two dotted quarter notes.

Figure 5
Linearization of the measure grid for autoregressive modeling using the measure model.

Figure 6
Masked conditional specification.

Algorithm 1:
Sampling/Inference procedure


Figure 7
Some statistics of music pieces in our dataset. Each dot represents a piece. (a) Length versus number of parts (staffs). (b) Length versus number of note events.
Table 1
Effects of different encoding key functions and memory reader mechanisms on negative log likelihood (nll) and length generation accuracy (len acc) of measure generation.
| key() | nll | len acc | nll (aug.) | len acc (aug.) |
|---|---|---|---|---|
| Reader: position‑dependent attention | ||||
| identity | 13.49 | 55.94 | ||
| softmax | 12.13 | 45.53 | ||
| sigmoid | 14.35 | 77.02 | ||
| skewgauss | 12.18 | 73.20 | ||
| Reader: position‑dependent only | ||||
| identity | 13.70 | 69.35 | ||
| softmax | 13.60 | 89.22 | ||
| sigmoid | 12.62 | 56.88 | ||
| skewgauss | 12.33 | |||
| nll | len acc | nll (aug.) | len acc (aug.) | |
|---|---|---|---|---|
| key(): softmax | ||||
−pos | 12.17 | 45.53 57.99 | ||
| key() skewgauss | ||||
−pos | 12.33 13.14 | 54.10 | ||
| No measure memory matrices | ||||
−pos | 13.81 14.13 | 92.23 103.79 | ||
| nll | len acc | |
|---|---|---|
| key(): softmax | ||
−pos | 12.74 12.74 | |
| key(): skewgauss | ||
−pos | 12.51 | |
| No measure memory matrices | ||
−pos | 13.44 14.30 | |



Figure 8
Subjective listening tests among different self‑reported expertise groups. (a) Overall ratings. (b) Overall ratings among those with the expertise “none/beginner”. (c) Overall ratings among those with the expertise “intermediate/experienced/expert”.




Figure 9
Subjective listening ratings along four aspects across all expertise groups. (a) Fluency. (b) Expressivity. (c) Novelty. (d) Organization.





Figure 10
One‑sided Mann–Whitney rank test for the pairwise comparison of ratings: p‑value indicates the probability for the hypothesis that the median of population A (row) is larger than the median of population B (column). (a) Overall. (b) Fluency. (c) Expressivity. (d) Novelty. (e) Organization.
Table 3
Ranking scores (and ranks) by the Bradley–Terry model on the Mann–Whitney U‑statistics.
| AR | AR Measure | CSD Measure | Human | |
|---|---|---|---|---|
| Overall | ||||
| Fluency | ||||
| Expressivity | ||||
| Novelty | ||||
| Organization |


Figure 11
With the measure model, a beat sync error (left, m.18, second violin having an extra beat) can be recovered starting the next measure; without the measure model, this similar error (right, m.41, all parts having an extra 16th note) results in shifts in all subsequent note onset positions. (a) AR, with Measure Model. (b) AR, without Measure Model, output MIDI file imported by Sibelius.
