A Review of String Instrument Synthesis Methods for Use in Interactive Systems

Yaozhong Zhang; Sebastian von Mammen; Christof Weiß

doi:10.5334/tismir.267

1 Introduction

String instruments represent some of the most expressive musical instruments. Their rich harmonic structures, dynamic timbral qualities, and wide range of playing techniques have made them central to musical traditions across cultures and eras (Baines, 1992). These characteristics also make them especially demanding for computational simulation. As a result, accurately reproducing the sound of string instruments is widely regarded as a challenging task in sound synthesis and computational acoustics (Fletcher and Rossing, 1998; Hawthorne et al., 2019; Rossing, 2010).

From early analog systems that used voltage‑controlled circuitry (Moog, 1965) to digital methods such as frequency modulation (FM) (Chowning, 1973), advances in digital signal processing (DSP) have enabled increasingly flexible sound synthesis (Roberts and Mullis, 1987). Physical modeling (PM) enables sound synthesis through simulated physical interactions in real instruments, enhancing realism and expressivity (Rossing, 2010; Smith, 2004). Recently, data‑driven methods learn, directly from audio data, how to generate sounds, offering a new level of fidelity and adaptability (Engel et al., 2020). These developments now support real‑time, interactive synthesis across virtual and hybrid environments (Puckette, 1996).

The technological advancements of sound synthesis have enabled a variety of interactive systems in which users can manipulate sound output in real time through gestures, sensors, controllers, or embodied movement (Miranda and Wanderley, 2006). Such systems are used in areas such as sound art installations (Fraisse et al., 2021), music education (Gehlhaar et al., 2011), instrument restoration (Silvin Willemsen et al., 2020), and rehabilitation (Tuominen and Saarni, 2024). In these contexts, synthesis methods should not only provide high audio fidelity but also support detailed control of parameters, the ability to emulate instruments within the same category, and low‑latency responses. As shown in Figure 1, we compare the feedback loops (human control–auditory feedback) of traditional acoustic instruments with interactive systems, emphasizing the aforementioned requirements.

A comparison of human control–sound and audio feedback loops in a traditional grand piano (left) and an interactive system (right). In the traditional setting, the human performer physically interacts with the piano keyboard, triggering its mechanical sound production and receiving immediate acoustic feedback. In an interactive system, human actions (e.g., via a computer keyboard) are mapped to synthesis parameters (e.g., amplitude, frequency), which generate digital audio feedback (e.g., guitar, violin, piano timbre) rendered through audio output devices.

While interactive systems for musical instruments have been explored already, few studies have examined the sound synthesis methods that form their foundation—especially for acoustically complex instruments such as strings. Reviews in this area have emphasized different aspects of interactive music systems. Turchet et al. (2021) review the emergence of music in extended realities; integrate technical, artistic, perceptual, and methodological perspectives; define ‘Musical XR (extended reality)’ through expert insights; and propose a research agenda motivating systematic evaluation frameworks and comparative studies—concerns that we address in this review. Gómez‑Sirvent et al. (2025) review the use of musical instruments in XR environments, identifying applications, technologies, interaction methods, and user experience factors. Serafin et al. (2016) provide an overview of virtual reality musical instruments, focusing on design principles and evaluation methods from a performer’s perspective. Complementing these interaction‑ and application‑oriented perspectives, Hawley (2020) focuses specifically on sound synthesis, contrasting PM with recent advances in deep learning (DL) methods for musical instrument sound synthesis.

To the best of our knowledge, this is the first scoping review centered on string instrument sound synthesis for use in interactive systems, with a focus on the interactivity enabled by the underlying synthesis methods. Following the PRISMA‑ScR (‘Preferred Reporting Items for Systematic Reviews and Meta‑Analyses for Scoping Reviews’) checklist, we identified 67 eligible studies and evaluated them through a three‑tier interactivity scheme and a four‑dimensional evaluation framework. Our analysis reveals systematic trends—such as the predominance of PM for plucked strings and the superiority of neural audio synthesis (NAS) for bowed and hammered strings—while also highlighting gaps in instrument diversity and the documentation of latencies. These findings clarify how tradeoffs in the four‑dimensional evaluation framework manifest across synthesis methods and entail recommendations regarding cross‑instrument generalization and the standardization of interactivity‑related evaluation.

Guided by these considerations, our review addresses the following research questions:

RQ1: Which types of synthesis methods have been or could be applied to synthesize string instrument sounds within interactive systems?
RQ2: How are interactivity requirements addressed and evaluated across existing string instrument synthesis methods?
RQ3: What are the structural tradeoffs and limitations that shape the suitability of synthesis methods for interactive use?

To address these questions, we structure our article as follows: Section 2 describes how studies were identified and categorized, Section 3 reviews the synthesis methods, and Sections 4 and 5 provide the evaluation framework and discussion of findings. Section 6 concludes the paper, offering directions for future work.

2 Methodology

We adopt a scoping review methodology (Arksey and O’Malley, 2005; Tricco et al., 2018) guided by the PRISMA‑ScR checklist. We chose a scoping design because research on interactive string instrument synthesis is highly heterogeneous: studies differ in their objectives, synthesis methods, and implementation status. Our goal was to map the breadth of existing methods, classify them across different tiers of interactivity, and identify gaps for future work.

2.1 Scope and review criteria

The focus of this review is on string instrument synthesis methods for use in interactive systems. We envision interactive musical systems to empower users to provide live input (e.g., by 3D gestures), to receive immediate auditory feedback, and to adjust their behavior in response to the feedback. Depending on the intended application—ranging from live performance to sound design—the requirements for response time (latency) and implementation vary substantially. Latency should not be presented as isolated numbers but contextualized relative to perceptual thresholds—specifically, distinguishing between the 10‑ms ideal for imperceptible delay and the 30‑ms ceiling generally acceptable for playability (Jack et al., 2016; Wessel and Wright, 2002).

Based on the implementation status observed across the 67 included studies, we classify studies into three mutually exclusive tiers:

Tier 1 covers fully implemented interactive systems where users directly control the synthesis in a closed loop.
Tier 2 includes synthesis methods that reported real‑time implementation for interactive use but have not yet been integrated into an actual interactive setup.
Tier 3 refers to offline systems and implementations whose reported latency or processing constraints prevent interactive use.

We considered studies in all three categories to capture the full breadth of technical methods.

To define the instrument scope, we categorize the acoustic instruments by their excitation mechanism: plucked (e.g., guitar, guqin), bowed (e.g., violin, cello), hammered (e.g., piano, dulcimer), and ‘other’ (e.g., the wind‑driven aeolian harp). We explicitly include the piano since its simulation fundamentally concerns string vibration and hammer–string interaction, which is highly relevant to string synthesis research.

2.2 Literature search strategy

We conducted a structured search across five major databases: DBLP, the ACM Digital Library, IEEE Xplore, SpringerLink, and Scopus. The search combined descriptors for instruments (e.g., ‘string instrument,’ ‘violin’), synthesis methods (e.g., ‘physical modeling,’ ‘neural audio synthesis’), and interactivity (e.g., ‘real‑time’, ‘HCI’). We used Boolean operators to combine these sets. We restricted the search to English‑language publications from 1983 to 2025, as 1983 marks the introduction of the Karplus–Strong algorithm (KSA) (Karplus and Strong, 1983), which laid the foundations for modern digital string synthesis.

2.3 Inclusion and exclusion

We selected publications for inclusion if the proposed methods were explicitly designed to synthesize acoustic string instruments and were aligned with one of the three interactivity tiers described in Section 2.1.

Conversely, we applied specific exclusion criteria to keep our study focused. First, we excluded works dealing exclusively with electric string instruments (e.g., electric guitar). Second, we omitted publications focusing solely on signal analysis or recognition without a synthesis component. Third, we excluded synthesis methods based purely on offline sample playback without generative capabilities, as well as studies providing insufficient methodological detail. Finally, regarding data‑driven methods, we excluded general NAS methods unless they provided explicit empirical evidence of training or evaluation on string instrument datasets. Peer‑reviewed journal and conference papers formed the primary sources of the review. We considered preprints only if they provided complete technical descriptions and evaluations of string instruments and if no peer‑reviewed version was available.

2.4 Screening process

We followed a two‑stage screening workflow. First, we removed duplicates and screened titles and abstracts regarding eligibility criteria. Second, we assessed the full texts. The database search retrieved 2,360 records (ACM: 540; IEEE: 410; SpringerLink: 630; Scopus: 780), and, through scanning their bibliographies, we identified an additional 54 records. After removing 1,098 duplicates, 1,316 records entered title and abstract screening, of which 1,096 were excluded. We assessed the remaining 220 full texts and excluded 150 for one of the following reasons: no string instrument focus (38), analysis only without synthesis (29), reliance on offline sample playback (41), electric string instruments only (18), insufficient methodological detail (12), not written in English or not peer‑reviewed (7), and full text unavailable (5). As a result, we obtained 67 studies for our scoping review. Specifically, Tier 1 comprised 24 studies, Tier 2 comprised three studies, and Tier 3 comprised 40 studies.^¹

2.5 Data extraction and categorization

For each included publication, we extracted structured data focusing on synthesis methods and target string instruments. In terms of implementation, we recorded information on real‑time capability and interactivity integration status, which informed our three‑tier categorization. Additionally, we further catalogued qualitative and quantitative evidence regarding fidelity, responsiveness, controllability, and adaptability. We provide the complete database in Table S1 of the supplement to support the subsequent comparative analysis. Among the 67 reviewed studies, several employed more than one synthesis method. Consequently, we conducted analyses at the method level (n = 71), exceeding the total number of studies.

3 Sound Synthesis Methods Overview

The production of sound in string instruments arises from a complex interplay of physics and the musician’s performance. When a string is set in motion by an excitation mechanism, such as plucking, bowing, or hammering, it generates distinct spectral characteristics by shaping the harmonic frequencies and timbre of the instrument. Moreover, the collective behavior of several coupled vibrators in an instrument, together with nonlinear feedback, contributes to its behavior as a complex vibrating system (Fletcher and Rossing, 1998). As illustrated in Figure 2, abstract digital sound synthesis (ADSS) methods appear sporadically from 1990, physical modeling synthesis (PMS) shows dense activity in the early 2000s, and NAS studies surge from 2016 onward, increasingly shaping the field as an important complement to the other two methods.

Temporal trajectory of sound synthesis research. Each point represents a distinct synthesis method reported in the literature, categorized into abstract digital sound synthesis, physical modeling synthesis, and neutral audio synthesis.

Inspired by existing taxonomies of sound synthesis (Bilbao, 2009; Hayes et al., 2024; Schwarz, 2007), we organize synthesis methods into parametric and data‑driven categories, as shown in Figure 3.

Taxonomy of sound synthesis methods, organized into two main categories—parametric and data‑driven—following the structure proposed by Schwarz (2007) and Hayes et al. (2024). Within the parametric category, both abstract digital sound synthesis and physical modeling synthesis are included (Bilbao, 2009). The data‑driven category introduces neutral audio synthesis. At the third level, we list representative methods for string instrument synthesis within each category.

3.1 Abstract digital sound synthesis

ADSS methods generally lack a physical basis or real‑world model, relying instead on perceptual and mathematical principles (Bilbao, 2009). We classify the following ADSS methods based on their primary components, such as oscillators, filters, and data tables.

3.1.1 Additive synthesis

Additive synthesis (AS) is directly inspired by Fourier’s theorem, which states that any sound can be described as an infinite sum of sine waves (Risset and Mathews, 1969; Smith, 2011). However, realistic sounds often require a large number of sine waves with dynamically changing amplitudes and frequencies, making direct control computationally intensive. Spectral modeling synthesis (SMS) addresses this by separating sinusoidal and noise components via short‑time Fourier transform analysis (Serra and Smith, 1990). AS and SMS have been, respectively, applied to erhu (Siao et al., 2006) and aeolian harp (Selfridge et al., 2024) synthesis.

3.1.2 Subtractive synthesis

Subtractive synthesis (SuS) shapes harmonically rich waveforms such as sawtooth, square, or triangle waves through digital filtering to produce a diverse range of sounds (Bilbao, 2009) and has been applied to erhu synthesis (Siao et al., 2006).

3.1.3 Frequency modulation synthesis

Similar to AS, frequency modulation synthesis (FMS) relies on the manipulation of sinusoidal waveforms. It uses a carrier–modulator oscillator pair to generate sideband frequency components, which are controlled by the modulation frequency and index (Chowning, 1973). While well‑suited for real‑time inharmonic synthesis, FMS is limited as a standalone method for accurately reproducing the complex and nuanced sounds of acoustic instruments (Uncini, 2022). To address this limitation, Liu (2024) proposes a hybrid method that integrates FMS with wavetable synthesis (WTS) to reproduce expressive elements such as natural vibrato and bow pressure variations based on real violin performance data.

3.1.4 Wavetable synthesis

WTS constitutes an alternative method of generating waveforms without using oscillators or filters. It is based on a prestored table containing arbitrary waveforms such as the simple sine waves or more complex waveforms, avoiding the direct computation of sine or cosine functions (Mathews, 1963; Mathews et al., 1969). Extensions such as multi‑wavetable methods further mix or switch tables over time to approximate instrument spectra more accurately, such as for sound synthesis of guitar, piano, violin, cello, and erhu and guzheng (Horner et al., 1993; Wun et al., 2001).

3.2 Physical modeling synthesis

PMS methods attempt to simulate the acoustic mechanisms underlying the sound production rather than directly modeling the sound signal itself (Uncini, 2022). The underlying physical behavior is defined by a coupled set of partial differential equations (PDEs), which need to be (numerically) solved to produce the output waveform. We classify the PMS methods by their approach to modeling the vibrating system.

3.2.1 Mass‑interaction synthesis

Mass‑interaction synthesis (MIS) models string vibration as a network of masses and springs interacting via Newton’s laws (Cadoz et al., 1993), enabling simulation of linear and nonlinear behaviors such as collisions and friction. Despite being computationally demanding, its modular design suits interactive applications (Villeneuve and Leonard, 2019), and it has been successfully applied to plucked and bowed string instruments (Leonard and Villeneuve, 2019).

3.2.2 Digital waveguide synthesis

Digital waveguide synthesis (DWS) simulates string vibration through two delay lines carrying traveling waves in opposite directions. They are connected with a scattering junction to an excitation mechanism, with digital filters at the terminations modeling boundary conditions and energy loss (Smith, 1992, 2002). It has been applied to various excitation mechanisms, including plucked (e.g., kantele (Erkut et al., 2002), harpsichord (Välimäki et al., 2004)), bowed (Sinclair, 2009), and hammered (e.g., clavichord (Välimäki and Erkut, 2003)).

The KSA (Karplus and Strong, 1983) is a simplified DWS instance that uses a single feedback delay line with a low‑pass filter to produce naturally sounding plucked string tones. Its low computational cost has made it popular for guitar and guqin synthesis (Ding and Gerhard, 2004), having also been adapted for bowed string synthesis (Nichols, 2002).

3.2.3 Modal synthesis

Modal synthesis (MS) represents vibrating systems as a sum of decoupled resonant modes, each characterized by a frequency, decay rate, and mode shape (Adrien, 1991). Bank et al. (2010) apply MS to both transverse and longitudinal string components in a real‑time piano model, while Woodhouse et al. (2021) use it for banjo tone synthesis. Selfridge et al. (2024) extend MS by incorporating semi‑empirical fluid dynamics to simulate vibrations in an aeolian harp.

3.2.4 Direct numerical simulation

Without relying on simplifications or abstract models, direct numerical simulation (DNS) approximates a set of PDEs to model musical instruments (Hiller and Ruiz, 1971), usually using the finite difference method (FDM), which discretizes the continuous system into a spatial grid and recursively solves the PDEs (Bilbao, 2009). The advantage of DNS lies in its generality and applicability to systems with strong nonlinearities. However, a critical drawback is the need to carefully manage numerical instabilities.

The finite difference time domain (FDTD) method applies FDM to wave‑propagation problems in the time domain, accurately capturing wave reflections, stiffness, and energy loss. It has been applied to piano synthesis (Bensa et al., 2003) as well as to restoring ancient instruments such as bowed sitar, dulcimer, hurdy‑gurdy (Silvin Willemsen et al., 2019), and tromba marina (Willemsen et al., 2020).

3.3 Neural audio synthesis

NAS uses DL to create and manipulate audio by training neural networks on audio datasets (Engel et al., 2017; Kalchbrenner et al., 2018). Unlike parametric methods, DL enables models to learn complex audio patterns, generate realistic sounds, and explore novel timbres, though it often demands extensive computational resources and large amounts of data (Natsiou and O’Leary, 2021).

We classify the following NAS methods by their generation paradigm and also include differentiable DSP (DDSP) as a complementary category, which provides physics‑guided structure and can be combined with the abovementioned traditional parametric methods within a hybrid system.

3.3.1 Autoregressive models

Autoregressive (AR) models predict each audio sample or frame from preceding context, requiring sequential inference. At the sample level, WaveNet (van den Oord et al., 2016) generates music audio via dilated causal convolutions, at the cost of slow generation. At the frame level, long short‑term memory controllers drive parametric decoders for bowed‑string synthesis (Yang et al., 2020), with subsequent refinements enhancing perceptual realism (Dai et al., 2021). By predicting whole waveforms, SING (Défossez et al., 2018) substantially accelerates inference while retaining musical expressiveness and has been applied for guitar synthesis.

3.3.2 Variational autoencoders

Unlike AR models that generate audio step by step, variational autoencoders (VAE) encode audio into a continuous latent space regularized for smoothness, enabling controllable and efficient instrument synthesis. Bitton et al. (2020) present neural granular synthesis, learning a latent space of short audio patches for real‑time pitch and timbre control, demonstrated for violin, cello, and piano. Caillon and Esling (2021) introduce RAVE, a real‑time capable model supporting high‑quality reconstruction and timbre transfer.

3.3.3 Generative adversarial networks

Generative adversarial networks (GAN) learn to generate signals through a competition between a generator and a discriminator, producing realistic outputs without requiring labeled data. WaveGAN (Donahue et al., 2019) directly demonstrates raw‑waveform adversarial synthesis, while GANSynth (Engel et al., 2019) generates high‑quality violin and cello tones by modeling spectrogram magnitude and phase, offering faster synthesis and smooth timbre interpolation. GANStrument (Narita et al., 2023) enables one‑shot timbre transfer from a single example into a playable virtual instrument, extended by HyperGANStrument (Zhang and Akama, 2024) with improved pitch consistency. Beyond direct synthesis, MelGAN (Kumar et al., 2019) introduces a fast vocoder for converting spectrograms to waveforms, serving as a component in string synthesis pipelines, while MuseGAN (Dong et al., 2018) focuses on symbolic multi‑track generation that includes string tracks at the score level.

3.3.4 Diffusion models

Diffusion models (DM) generate audio by gradually denoising random noise, offering more stable training and flexible control compared to GAN. Kim et al. (2025) model continuous pitch‑bend information to capture expressive violin performance, while guitar‑specific models apply diffusion outpainting to synthesize coherent acoustic passages (Kim et al., 2024). Early applications also include spectrogram‑based MIDI‑to‑audio synthesis on instrument mixtures, including strings (Hawthorne et al., 2022) and text‑guided orchestral generation with string sections (Huang et al., 2023). Beyond generation, diffusion has been applied to piano performance‑style conditioning (Maman et al., 2024) and strings stem completion (Villa‑Renteria et al., 2025).

3.3.5 Differentiable digital signal processing

DDSP (Engel et al., 2020) integrates classical signal‑processing modules—such as additive harmonic synthesizers, filtered noise, and reverberation—into end‑to‑end trainable neural networks, yielding interpretable models with fine‑grained pitch and loudness control, as demonstrated on violin in the original work. The DDSP‑VST^² (Virtual Studio Technology) plugin serves as a representative interactive system for NAS.

Applications to string instrument synthesis include timbre transfer for real‑time instrument conversion (Carney et al., 2021; Ganis et al., 2021), while MIDI–DDSP (Wu et al., 2022) maps symbolic inputs to expressive audio through hierarchical control. Applications also include polyphonic guitar synthesis via string‑wise MIDI input (Jonason et al., 2023) and expressive violin modeling through continuous performance parameters (Hung et al., 2024; Jonason et al., 2020). DDSP principles have been extended to FM (Caspe et al., 2022; Ye et al., 2023), wavetable (Shan et al., 2022), and physics‑informed modal string models (Lee et al., 2024)—as well as differentiable piano models capturing hammer–string interactions (Renault et al., 2022; Simionato et al., 2024).

3.4 Summary of synthesis methods

The overall synthesis methods not only differ in their modeling approaches but also in the supported excitation mechanisms, as shown in Figure 4. NAS and PMS each account for a substantial share of the reviewed methods (n = 44 and n = 39, respectively), while ADSS appears far less frequently (n = 12). The comparatively high NAS count partly reflects the composition of training datasets, which often pool instruments across categories or apply generic labels, such as ‘Strings’ in the NSynth dataset (Engel et al., 2017), rather than precise excitation mechanisms. While PMS methods—particularly DWS and MIS—dominate plucked instruments and DNS also appears in bowed instruments, NAS—led by DDSP and DM—is the dominant approach for bowed and hammered instruments.

The sound synthesis methods used for various string instruments and their categories, including plucked, bowed, hammered, and other. Generic labels such as ‘Strings’ are preserved from the original studies where specific instruments were not defined or the method was applied to a general string model.

Aligned with these patterns, the distribution of synthesized instruments is also imbalanced. Well‑known instruments such as violin and piano appear in multiple studies, while many traditional or culture‑specific instruments appear only once or twice. We provide full details by instrument, synthesis method, and interactivity tier in Table S1 in the supplement. In the following section, we examine the evaluation of synthesis methods across four dimensions: fidelity, responsiveness, controllability, and adaptability.

4 Evaluation

Given the heterogeneity of the reviewed methods—ranging from explicit parametric to data‑driven categories, establishing a direct comparison is not straightforward. To address this, we selectively integrate established evaluation criteria (Castagné and Cadoz, 2003; Jaffe, 1995) for sound synthesis methods and map them onto a unified four‑dimensional evaluation framework, capturing fidelity, responsiveness, controllability, and adaptability. This reorganization streamlines the evaluation framework while aligning the technical assessment with the practical concerns that arise in different tiers of interactive systems.

We first outline the origin and definition of these dimensions and then review existing evaluations of synthesis methods within each dimension, with particular attention given to recurring patterns and characteristics.

4.1 Definition of evaluation dimensions

For consistency, we stick to the abbreviations introduced by Castagné and Cadoz (2003): ‘J’ denotes criteria originally proposed by Jaffe (1995), while ‘PM’ denotes those related to physical modeling (PM) methods, with J1–J10 and PM1–PM10 each comprising 10 criteria. We assign each criterion to a single main dimension using a primary‑motivation rule, as shown in Figure 5. When a criterion could plausibly support more than one dimension, we place it where its primary goal or bottleneck lies in interactive use and treat effects on other dimensions as secondary without double counting. The four dimensions are as follows.

Mapping between a four‑dimensional evaluation framework and established criteria by Castagné and Cadoz (2003) and Jaffe (1995).

Fidelity evaluates the resemblance of synthesized sounds to real instruments, including the ability to preserve source identity under variation (J5), to generate sounds comparable to real instruments (PM2), and to remain plausible under exploratory modeling (PM5).
Responsiveness addresses the computational and temporal requirements for synthesis, including algorithmic efficiency (J6 and PM1), control stream bandwidth and transport constraints (J7), and the minimum unavoidable latency of the technique (J9).
Controllability captures whether performers can shape pitch, dynamics, and timbre in a stable and musically meaningful way, considering intuitive mapping to musical attributes (J1), perceptible effect of parameter changes (J2), physical coherence of parameters where appropriate (J3), and well‑ behaved response to controls (J4), complemented by modular construction and incremental modeling (PM6) and by a mental model that supports anticipation and exploration (PM7).
Adaptability refers to the ease of transfer across instrument categories and the transparency of internal structure, including the breadth of representable sound classes (J8), the availability of analysis or parameter‑derivation tools (J10 and PM9), the diversity of instrument categories and mechanisms (PM3), and the depth of structural modeling (PM8).

Following the primary‑motivation rule, we assign J7 to responsiveness, since its central aim is feasible online operation. We assign PM6 primarily to controllability, since it provides clear modules and composition rules for performance and incremental modeling. We treat PM7 as a key support for controllability. Moreover, we deliberately exclude PM4, which concerns generality beyond sound synthesis, including haptic and visual interactions. We also exclude PM10, which concerns the quality of the surrounding musician‑oriented environment, because these pertain to multisensory or ecosystem considerations rather than intrinsic audio synthesis performance along our four dimensions.

4.2 Fidelity

Fidelity is a key criterion in evaluating instrument sound synthesis. To evaluate fidelity, 41 studies report some form of subjective evaluation, 41 report objective evaluation, 25 employ both, and only 10 report neither. These highlight both the central role of fidelity and the substantial variation in the ways it is evaluated. This subsection reviews how such evidence is reported, without assessing the actual quality or outcome of the methods.

We identified both formal and informal subjective evaluation methods, as summarized in Figure 6. Formal methods appear in 22 studies. The most common evaluation involves rating tasks, such as Likert scale questionnaires (5 studies, e.g., Dong et al., 2018) and Mean Opinion Scores (MOS) (5 studies, e.g., Jonason et al., 2023). A more rigorous variation is the Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) (4 studies, e.g., Kim et al., 2025). Additionally, discrimination tasks such as ABX listening tests (5 studies, e.g., Défossez et al., 2018) are used to evaluate fidelity directly. Despite these formal evaluation methods, informal methods remain nearly as prevalent (19 studies), most frequently in the form of informal listening conducted by the authors themselves (e.g., Florens, 2003). Crucially for musical instrument synthesis, where realism of synthesized sound is a primary criterion, only three studies (e.g., Liu, 2024) explicitly involve professional musicians, who provided expert judgments on realism, revealing a significant gap in expert‑based validation.

Distribution of formal and informal subjective evaluation methods across reviewed studies (Quest. = Questionnaires; Music. = Musicians).

For objective evaluation, we identified over 30 distinct metrics reported in the 41 studies, reflecting a wide variety of methods to evaluate fidelity. Spectral analysis—encompassing spectral losses, mel‑spectrogram distances, and related frequency‑domain measures—collectively appears in 16 studies (e.g., Carrillo and Bonada, 2010; Erkut et al., 2001), making it the most frequently reported class of objective metrics. Among standardized single metrics, the most common is Fréchet Audio Distance (FAD) (11 studies, e.g., Baoueb et al., 2024; Hayes et al., 2021). This usage coincides with the rise of NAS, offering a way to measure the distance between feature embeddings to assess similarity without requiring waveform alignment. Its predecessor, Fréchet Inception Distance (FID) (2 studies, e.g., Zhang and Akama, 2024) evaluates spectrogram images rather than learned audio embeddings. The shift from FID to FAD signals a methodological adaptation to the specific nature of audio data. However, error‑based measures—including mean squared error (MSE) (5 studies, e.g., Yang et al., 2020) and root MSE (RMSE) (3 studies, e.g., Bitton et al., 2020)—collectively remain the widespread default (see Figure S1 in the supplement).

Despite these tendencies, cross‑paper comparisons remain tentative because evaluation methods are inconsistent. Both listening‑based evaluations and computational metrics are often used, but their designs and implementations vary substantially across studies. In many cases, subjective and objective assessments are not applied together, making results difficult to compare. These fragmented practices highlight fidelity as a widely recognized criterion yet also reveal a pressing gap in methodological standardization.

In addition to differences between evaluation methods, the observed fidelity trends are also influenced by the type of instrument being simulated. As discussed in Section 3, string instruments vary substantially in the complexity of their excitation and sound‑production mechanisms. As a result, fidelity evaluations reported in the literature reflect not only methodological choices but also the intrinsic difficulty of the target instrument. This provides additional context for the frequent association of PMS with high‑fidelity plucked string synthesis and for the increasing use of NAS methods in the simulation of bowed and hammered instruments, where capturing continuous excitation and subtle timbral variation poses greater challenges.

4.3 Responsiveness

Responsiveness denotes the ability of a synthesis method to sustain interactive play. Specifically, it requires producing artifact‑free audio under low delay while supporting a target polyphony, defined as the number of simultaneous notes the system must render. Delay arises both from audio input and output (I/O) buffering and from control‑to‑audio processing within the method. For this reason, any ‘real‑time’ claim is only interpretable when quantified in terms of latency (ms), buffer size and sample rate, host/driver, and hardware. Moreover, in NAS literature, real‑time performance is frequently quantified via Real‑Time Factor (RTF)—the ratio of computation time to audio duration. While an RTF of <1.0 confirms that generation speed exceeds playback speed, it does not guarantee low latency. A system can achieve excellent RTF values yet require large input buffers or block‑based processing, rendering it unsuitable for interactive performance. Without these specifications, reported metrics represent isolated data points that cannot be meaningfully compared across studies.

The literature provides little quantitative evidence of system responsiveness. Of the reviewed studies, only 20 report any timing‑related metrics, and none report both latency and buffer size. This fragmented reporting—limited to latency in six studies and CPU load in eight—prevents meaningful cross‑paper performance comparisons. When the application context is specified, it is typically limited to desktop or plugin environments, with mobile, web, or XR targets rarely mentioned. Given this lack of standardization, existing reports fall into three categories. The most direct are hardware reporting standards, including CPU load (Bank et al., 2010; Silvin Willemsen et al., 2019), polyphony counts (Florens, 2003; Passalenti et al., 2019), or measured control‑to‑audio latency (Bitton et al., 2020). A second group evaluates algorithmic efficiency, giving operations per sample (Bank et al., 2010) or comparing computational complexity to baselines (Shan et al., 2022). The third group infers suitability from implementation environment, treating synthesis platforms such as Max/MSP, Pure Data, C++/JUCE, or dedicated DSP hardware as evidence of interactive viability (Selfridge et al., 2024).

Despite the heterogeneity in quantitative reporting, clear trends emerge when categorizing systems by their implementation status. As shown in Table 1, PMS dominates the fully interactive landscape (Tier 1), accounting for the majority of real‑time performance systems (18 out of 31 PMS studies). This shows that physical models remain the main choice for low‑latency, playable virtual instruments in interactive systems. While NAS is strongly growing (32 studies), it heavily concentrates on Tier 3 (27 studies), highlighting a gap: despite high fidelity, most NAS methods currently operate as offline generators or high‑latency systems. Notably, among NAS methods, DDSP is one of the few methods in Tier 1. Tier 2 represents a smaller collection of algorithms that have achieved computational efficiency for real time but await interactive integration. The fact that more than half of the reviewed studies fall into Tier 3 highlights a substantial gap between advances in string instrument synthesis and their translation into interactive musical systems. While sound quality and modeling accuracy have advanced, few works engage with the constraints imposed by real‑time interaction.

Table 1

Distribution of synthesis methods across tiers.

Methods	Tier 1	Tier 2	Tier 3	Total
Methods	(n = 25)	(n = 4)	(n = 42)	(n = 71)
ADSS	2	0	6	8
AS	1	0	1	2
SMS	1	0	0	1
SuS	0	0	1	1
FMS	0	0	1	1
WTS	0	0	3	3
PMS	18	4	9	31
MIS	5	0	0	5
DWS	4	3	7	14
KSA	3	0	1	4
MS	2	0	1	3
DNS	4	1	0	5
NAS	5	0	27	32
AR	0	0	4	4
VAE	2	0	0	2
GAN	0	0	3	3
DM	0	0	10	10
DDSP	3	0	10	13

4.4 Controllability

In contrast to fidelity, which primarily concerns the resemblance between synthesized and target sounds, controllability addresses the way in which users shape synthesis output. We examine controllability in terms of the parameters of different synthesis methods, emphasizing their type, number, and interpretability.

To structure this comparison, we group controllable parameters into four distinct categories. Signal‑level parameters capture directly perceptible musical dimensions, such as frequency (pitch), amplitude, filter coefficients, or envelope shapes, which are intuitive and widely adopted. Physical‑level parameters correspond to acoustically grounded variables—bow position or force, string tension, damping, and hammer hardness—thus providing interfaces close to instrumental performance. Latent‑level parameters describe model‑internal or structural variables without direct acoustic correlates, such as latent coordinates, or embedding dimensions, enabling powerful but less interpretable transformations. The remaining ‘other’ parameters resist consistent assignment and are often tied to task‑specific conditioning or implementation details.

Synthesis methods reveal clear differences with respect to this categorization (see Figure S2 in the supplement). ADSS methods provide signal‑level controls such as pitch, loudness, or filter settings. Their number is limited since they serve for intuitive manipulation and since there are few ADSS methods in our review. PMS methods primarily exhibit physical‑level variables, allowing for control of sound production in ways that support expressive, performance‑like control, though often at the cost of requiring domain expertise. NAS methods display the most heterogeneous profile. Signal‑level and other parameters are most prevalent, while latent‑level parameters appear exclusively in NAS, reflecting the less‑standardized nature of NAS.

Controllability thus depends both on the number of exposed parameters and on their cognitive accessibility. This accessibility is determined by the mapping strategy, where the ideal design maximizes expressivity while minimizing cognitive demand (Levitin et al., 2002). Previous research suggests a tri‑partite division in how current methods approach this goal: signal‑level parameters prioritize accessibility, physical‑level parameters focus on mechanical authenticity, and latent‑level parameters enable creative exploration. The choice of method thus dictates not just the sound quality but also the interaction style available to the users.

4.5 Adaptability

Adaptability captures whether a synthesis method can remain effective when moving from a specific instrument to related ones. ADSS methods manipulate signal‑level parameters to synthesize specific instrument timbres. Since such controls lack correspondence to physical structure, migrating to a new instrument (e.g., from violin to cello) necessitates reconfiguring the synthesis architecture rather than simple scaling (Liu, 2024). PMS methods rely on a shared physical structure that theoretically supports geometric scaling. However, adaptation is often constrained by the complexity of parameter mapping. PMS employs explicit parameters to precisely model the physical interaction between excitation and the resonator. While this tight coupling guarantees high fidelity for the specific target, it requires recalibration when migrating to close relatives (e.g., violin to viola or cello) as tension, damping, friction, and body responses change. Without retuning parameter mapping, systems may satisfy the pitch range of the target instrument yet drift timbrally or break at transitions (Florens, 2003).

NAS methods achieve adaptability through data‑driven generalization. Training on multi‑instrument datasets and conditioning on instrument labels or learned timbre tokens enables these models to learn a shared latent space across instrument categories (Maman et al., 2025, 2024; Villa‑Renteria et al., 2025). However, reliance on shared weights introduces inherent risks—most notably, identity bleed, in which distinct instruments become perceptually similar or converge toward an averaged timbre when the latent representation is insufficiently disentangled.

In summary, methods deal with adaptability in distinct ways, manifesting as architectural rigidity in ADSS, calibration complexity in PMS, and latent entanglement in NAS.

5 Discussion

We now discuss the findings of our review by highlighting recurring tradeoffs, methodological tendencies, and structural limitations observed across synthesis methods and application contexts.

5.1 Cross‑dimension tradeoffs

Our four‑dimensional evaluation framework forms a coupled design space where advances in one dimension may introduce costs on another.

A central tension exists between fidelity and responsiveness. Higher fidelity typically arises from richer excitation and radiation models in PMS or from large models with long analysis frames in NAS. These configurations increase mapping complexity and runtime load, thereby compromising responsiveness. ADSS methods usually achieve high responsiveness by using simple components but often fall short in capturing transient behaviors of acoustic instruments.

Second, there is a tradeoff between controllability and adaptability. Methods relying on explicit parameters (both PMS and ADSS) achieve high controllability by mirroring the specific structural constraints of a single instrument. However, this limits their adaptability: since the synthesis architecture is tightly coupled to a specific synthesis mechanism, migrating to even a closely related instrument requires significant reengineering—whether recalibrating physical coefficients in PMS or redesigning signal characteristics in ADSS. NAS methods stand out by addressing adaptability through learned shared representations. While this allows a single model to span an entire category, it often comes at the cost of control transparency.

All these configurations are constrained by the available computing budget, as improvements along any dimension increase computational complexity, limiting real‑time interaction. From this perspective, the advances in computational efficiency are not merely engineering optimizations (Caillon and Esling, 2021; Richardson et al., 2023) but are essential for rebalancing these tradeoffs and figuring out feasible strategies for real‑time synthesis.

5.2 Method–instrument alignment

This section examines how different synthesis methods align with particular string instruments in sound synthesis tasks. The four introduced dimensions help determine the suitability of a method for a given instrument category, even though no synthesis method is inherently limited to a single category of instruments.

In practice, different excitation mechanisms shape the spectral and temporal characteristics of string instruments. These characteristics, in turn, interact meaningfully with synthesis strategies. From the 67 reviewed studies, we found co‑occurrence patterns between synthesis methods and instrument categories.

Plucked string instruments, such as guitars and harpsichords, produce short, impulsive excitations followed by a relatively stable modal decay. The resulting sound structure is often spectrally sparse and can be effectively modeled with linear resonant systems requiring relatively few parameters. PMS methods—particularly DWS and KSA—are well suited to this structure. These methods expose interpretable controls such as string tension, damping, and body resonance and support efficient, real‑time synthesis with stable polyphony. Furthermore, parameter mappings in such models can provide a useful starting point for closely related instruments through pitch‑range scaling, though full timbral adaptation typically requires targeted recalibration.

In contrast, bowed string instruments involve a continuous, nonlinear excitation resulting from the stick–slip interaction between bow and string. This process produces dense harmonic content and highly sensitive variations based on expressive gestures—such as bow speed, pressure, and position. While physical models are capable of reproducing these effects, they often require precise calibration and become difficult to control when extended to different instruments or playing techniques. NAS methods have shown promise in this context, particularly when trained on multi‑instrument datasets with structured conditioning (e.g., instrument labels or gesture encodings). In our review, although most NAS studies are limited to offline use (Section 4), bowed string synthesis is handled best by NAS methods that maintain interpretability of control signals and support real‑time rendering (Maman et al., 2025; Villa‑Renteria et al., 2025). These methods offer a scalable and adaptable solution for modeling expressive, gesture‑driven instruments. Hammered string instruments exhibit forceful, broadband excitations followed by complex, long‑lasting resonances. These characteristics demand accurate modeling of both transient detail and extended decay behavior. Hybrid synthesis strategies are particularly effective in this context: physical models typically represent the resonant structure, while additive or subtractive components capture percussive transients and transient colorization.

ADSS methods (including AS and SuS) also appear across all defined string instrument categories, typically not as standalone solutions but as complementary components. They are commonly used to shape specific spectral regions—such as high‑frequency partials or resonator coloration—that are difficult to capture with physical or data‑driven models. In many systems, these layers are essential to achieving perceptual realism, particularly when enhancing or correcting the output of other synthesis engines.

Importantly, the synthesis–instrument matchings described here are not restrictive. PMS can be extended to bowed or hammered instruments, and NAS can be effective for plucked strings—especially when control fidelity and real‑time requirements are properly addressed. Ultimately, the choice of synthesis method depends on how well its structure aligns with the instrument’s excitation characteristics and the broader constraints of the system, including latency, control affordances, and computational cost. The method–instrument matrix (as shown in Figure 4) derived from our literature review reinforces these patterns, showing consistent tendencies that can inform synthesis design choices under practical constraints. The following section addresses how these method choices interact with interactive system design factors.

5.3 Runtime constraints

The feasibility of interactive string synthesis depends not only on algorithmic design but also on runtime conditions. Across the reviewed literature, three deployment contexts stand out. In desktop and plug‑in environments, PMS methods (e.g., DWS, MS) are widely reported to achieve immediate responsiveness and scale effectively to polyphonic use. NAS methods are viable in this setting when inference is streamed in short chunks on accelerated backends.

AS and SuS methods remain feasible if partition sizes are carefully tuned. In contrast, XR, mobile, and embedded platforms face strict power and scheduling limits. Reports of stable playability in these contexts are scarce, and, when present, they rely on compressed or quantized NAS models, short analysis frames, or partitioned convolution. However, the scarcity of explicit timing information and platform documentation makes reproducibility difficult.

In sum, responsiveness is not a property of algorithms alone but of their interaction with deployment settings and hardware constraints. This observation also reflects a broader pattern: across the four dimensions, tradeoffs and inconsistencies emerge that limit comparability of studies. This points to fundamental challenges in how interactive string synthesis is evaluated and reported, which we summarize in the next section.

5.4 Challenges

Building on our analyses, we see four recurrent obstacles that jointly limit comparability, replication, and long‑term progress.

The first challenge concerns fidelity. Perceptual quality is often optimized using spectral or waveform‑based criteria, yet improvements in such metrics do not consistently translate into musically convincing results. Moreover, increasingly complex models can compromise real‑time playability, and user‑centered perceptual evaluations are reported inconsistently, making it difficult to assess when technical gains yield audible benefits for musicians (Worrall et al., 2024). The second challenge is the insufficient and inconsistent reporting of responsiveness (Section 4). This lack of transparency hinders meaningful comparison across implementations and obscures the tradeoffs involved in deploying synthesis methods on different platforms, including desktop, mobile, and XR environments. The third challenge relates to controllability. Control interfaces and parameter mappings vary widely across systems and are rarely released or documented in sufficient detail. In NAS, in particular, latent control variables are often hard to interpret, which limits transparent authoring, reuse, and systematic comparison between methods (Engel et al., 2020, 2017). Regarding adaptability, the literature exhibits a bias toward a small set of instruments, most notably violin and piano, with limited evaluation of cross‑instrument generalization. As a result, claims of robustness at the instrument‑category level may conflate true adaptability with instrument‑specific optimization.

5.5 Recommendations and outlook

We now propose guidelines across the four‑dimensional framework, addressing evaluation and reporting practices for fidelity and responsiveness, and design priorities for controllability and adaptability.

For fidelity, subjective evaluations should adopt standardized protocols with hidden references and anchors, and informal evaluations should move beyond author‑centric listening by explicitly inviting professional musicians to assess the quality of synthesized sound. For objective evaluations, normalizing scores against a fixed reference dataset would allow direct cross‑study comparison. For responsiveness, we propose the lightweight reporting card in Table 2 for standardized documentation of system performance.

Table 2

Proposed reporting card template for string instrument synthesis systems.

Field	Description / Example
Latency (ms)	12.5 ms
Buffer/frame size	256 samples @ 48 kHz
Host/driver and hardware	macOS 15.7.3, CoreAudio, Apple M4 Pro
Stability	0 dropouts in 30‑min session
Sustainable polyphony	8 voices at 48 kHz

For controllability and adaptability, parametric methods would benefit from abstraction layers or adaptive interfaces that translate parameters into musically meaningful controls, alongside automated or learning‑based parameter mapping to reduce recalibration when scaling across instrument families. Data‑driven methods require more semantically interpretable and musically grounded mappings for latent parameters, as well as more disentangled and instrument‑aware representations to support cross‑instrument transfer.

5.6 Limitations

This scoping review has several limitations. First, although the search strategy covered major databases and multiple keywords, it may have missed relevant studies, particularly unpublished reports, non‑English works, or research outside the scope of interactive systems. Second, unlike a systematic review, this study did not apply formal quality assessment or quantitative synthesis; our aim was to map the landscape rather than to rank methods. Third, the four‑dimensional evaluation framework reflects choices informed by the literature, but alternative frameworks may capture other aspects of interactive synthesis. Finally, the field of NAS is rapidly evolving, and findings—especially concerning real‑time deployment on mobile and XR platforms—are time‑sensitive. These limitations do not undermine the value of the review but highlight the need for updates and complementary studies.

6 Conclusion

In conclusion, our review revealed several overarching trends: PMS excels in reproducing the mechanics of plucked instruments, while NAS offers flexibility for bowed and hammered sounds. Yet, gaps in latency reporting and cross‑instrument applicability limit broader adoption. To advance the field, combining parametric and data‑driven methods, expanding instrument coverage beyond canonical examples, and establishing shared evaluation protocols are essential. By addressing these challenges, future research can foster more reliable and expressive interactive systems that better capture the richness of acoustic string sounds.

Acknowledgment

The authors thank the Zentrum für Philologie und Digitalität for institutional support.

Data Accessibility

The supplementary files accompanying this article include the following materials. Table S1 provides structured evaluation data for each study across the four‑dimensional evaluation framework. Figure S1 shows the distribution of five objective evaluation methods across reviewed studies. Figure S2 shows the distribution of control parameter categories across synthesis method categories.

Funding Information

This work is supported by the China Scholarship Council.

Competing Interests

The authors have no competing interests to declare.

Authors’ Contributions

Literature search and data extraction: YZ. Draft paper and corrections: YZ, SvM, and CW. All authors approved the final version for submission.

Notes

[1] All screening and eligibility assessments were conducted by the first author. To mitigate potential bias due to the absence of multiple independent reviewers, the predefined inclusion and exclusion criteria in Section 2.3 were strictly applied, and borderline cases were resolved by repeated full‑text inspection.

[2] https://magenta.withgoogle.com/ddsp-vst.

Additional File

The additional file for this article can be found as follows:

Supplementary File 1

Supplementary File. DOI: https://doi.org/10.5334/tismir.267.s1.