PLD2flex: Establishing the Phonological Levenshtein Distance for Pairs or Groups of (Pseudo)Words

Helena Wedig; Felix Theodor; Joshua Wieler; Eva Belke

doi:10.5334/jors.510

(1) Overview

Introduction

PLD2flex is a software for establishing the phonological Levenshtein distances between sets of words. The types of comparisons users can make are highly flexible – overall there are seven different comparison scenarios ranging from one-to-one to a variety of one-to-many and many-to-many comparisons of orthographic word forms, including orthographic transcripts of pseudowords and spoken word forms. The tool was developed for psycholinguistic research purposes, but can, of course, be used in other contexts, too. In this introduction, we will provide some examples of the role of measures of phonological similarity in psycholinguistic research and present an overview of tools that are related to PLD2flex in functionality but differ from it in important ways.

Measures of phonological similarity in psycholinguistic research

There is ample evidence from psycholinguistic research that word processing is influenced by a wide variety of stimulus properties. These include a word’s surface level characteristics, such as orthography or pronunciation, but also properties that are less readily available to the naked eye or ear, such as a word’s frequency of occurrence or its similarity to other words in a language’s dictionary. There is, for example, evidence that the perception of a word is influenced by phonological neighbourhood density, which is traditionally defined as the number of words that differ from a given word by addition, deletion or substitution of one phoneme [15]. Depending on factors such as the task at hand or the language it is performed with, an increase in neighbourhood density can render perception harder [1] or easier [10].

Technically speaking, phonological neighbours of words or pseudowords have a phonological Levenshtein distance (PLD) of 1 to the target stimulus. Of course, this similarity metric can be extended beyond a single operation to multiple operations. One such measure is PLD20, which is defined as the average number of phoneme additions, deletions, or substitutions that are necessary to create the 20 most similar words to any given word [11]. Compared to PLD1, PLD20 is a much more flexible measure. For instance, it allows for a characterization of the phonological neighbourhood of words that do not have neighbours with a PLD of 1. Such lexical hermits typically comprise of polysyllabic words and, depending on the corpus, can make up a considerable amount of a lexicon [13]. By definition, words with a low PLD20 need less transformations to make up their 20 most similar neighbours and are hence located in a denser neighbourhood than words with a higher PLD20. Measures such as PLD20 can be used to characterize the influence of words in the mental lexicon on processing a target word. For instance, recognizing otherwise comparable words is harder when they are located in dense as opposed to less dense neighbourhoods [11].

The characterization of neighbourhood density exemplifies how compiling material for experiments in psycholinguistics has evolved from technically restricted non-automatic comparisons of words of a language to larger-scale, automatized comparisons of word forms (see [2] and [17] for the case of orthographic neighbours and orthographic Levenshtein distances, respectively). In principle, PLD measures are highly flexible and can be applied to any pair of phonological sequences, even pseudoword sequences. In addition, they can be applied in establishing similarities between pairs or groups of sequences of individual choice. However, as it stands, most tools for establishing metrics of phonological similarity do not make use of this flexibility but are restricted to a handful of measures, chiefly PLD1 and PLD20 (see Table 1).

Table 1

Overview of key features of PLD2flex and their access in comparable tools.

	FEATURES
	LANGUAGE(S) SUPPORTED	FINDING PHON. NEIGHBOURS	CALCULATION OF MEAN PHON. DISTANCE	FLEXIBLE EDIT DISTANCE	PROCESSING OF MULTIPLE INPUT WORDS	PSEUDOWORD SUPPORT
PLD2flex	30+ (including some varieties)	yes	yes	yes	yes	yes
CLEARPOND [7]	English, German, Dutch, Spanish, French	yes	no	no	yes	no
Phonological CorpusTools (PCT) [4]	Unrestricted	yes	no	yes	no	yes
Irvine Phonotactic Dictionary (IPhOD) [12]	English	yes	no	no	yes	yes
Phonotactic Probability Calculator (PPC) [14]	Tool no longer available as of 07/2024.

PLD2flex is designed to provide a flexible framework for researchers to establish the phonological similarity of phonological forms to each other or to reference samples of words or pseudowords. Phonological forms can be compared pairwise, i. e., one-to-one but also one-to-many and many-to-many. One particular instance of a one-to-many comparison is to establish the phonological similarity of a phonological form to all entries in a database, which can serve as a basis for PLD20 or the PLD of the nearest 5, 10 or 100 words in the corpus. This flexibility sets PLD2flex apart from other tools that are more restricted in the measures they provide, as we will review in the next section.

Tools for establishing measures of phonological similarity

Psycholinguistic research is strongly dependent on characterizing linguistic material, be it for compiling material for experiments on the production and perception of language or for characterizing linguistic material in more naturalistic settings. While numerous tools and databases for such purposes have long been available in reading research, i. e., based on written language, corresponding tools and databases for spoken language have been developed only recently. The online database CLEARPOND [7], for example, provides information on orthographic as well as phonological features of words in five languages, namely English, German, Dutch, Spanish and French. Among other functionalities, CLEARPOND computes orthographic and phonological neighbourhood densities for words both within and across languages. In addition, it provides information like word length, frequency of occurrence or the mean frequency of the neighbours in a word’s neighbourhood. When it comes to establishing the distance between phonological word forms, CLEARPOND’s neighbourhood metric is rather inflexible in that it is restricted to a maximum of one operation (substitution, addition, or deletion) but does not consider more distant neighbours differing from a target word by more than one grapheme or phoneme.

A more flexible computation of phonological distances is offered by Phonological CorpusTools (PCT, [4]), which enables researchers to browse predefined or custom corpora for individual words with predefined characteristics. Filtering criteria include phonotactic probability or neighbourhood density. Unlike CLEARPOND, PCT’s output takes into account more distant neighbours by computing Levenshtein distances between an input word and any other word of an underlying corpus. However, for users of the GUI, this procedure can only be carried out one word at a time, requiring individual workarounds for many-to-many comparisons or other non-standard applications. So while CLEARPOND and PCT are powerful tools in their own right that allow researchers to establish a wide range of properties of words, they are not geared towards establishing measures of phonological distances for flexible sets of stimuli.

Other resources allowing researchers to characterize similarities between words suffer from the same drawbacks when it comes to establishing phonological distances between flexible sets of phonological forms. These include the Irvine Phonotactic Online Dictionary (short: IPhOD; [12]) or the Phonotactic Probability Calculator (PPC; [14]). Given this gap, we developed PLD2flex. PLD2flex is restricted to establishing phonological similarity measures alone, but within this realm, it offers maximal flexibility. This renders it an ideal companion tool for working with tools with a wider scope, such as CLEARPOND or PCT, hence filling a rather specific but quite important gap.

Implementation and architecture

User Perspective

PLD2flex allows users to calculate the phonological Levenshtein-distance between phonological forms by means of a custom-made user-interface. It offers comparisons of a single target word with a file or a corpus-based database of words, of targets within one file, and of a list of targets with a database. In total, it provides seven different scenarios; details about how to handle them can be found in the User Manual.

Figure 1 presents a screenshot of PLD2flex’s user interface. To use the tool, users can either fill in the ‘target’ and ‘input’ text fields shown in the interface or choose a text file that includes the list of words to be compared with the target or within itself. The text file (.txt) must be utf-8 formatted and be either in a one-column or in a two-column format, containing either one word per line (one-column) or two words separated by a tab stop (two-columns). It is, moreover, also possible to compare one target word with an individual corpus file. To do so, the utf-8 encoded corpus needs to have one word and its SAMPA transcription per line, separated by a semicolon. This format is, amongst others, provided by G2P¹ ([8, 9]).

User interface of the PLD2flex application.

While using the application, users are provided with a protocol. It gives a short summary of the finished processes and shows error messages. The application saves these logs and the calculated distances automatically in separate csv-files (see Figure 2).

Result file showing the calculated distances.

Developer Perspective

PLD2flex was programmed in Python. It relies on several third-party python libraries as well as the BAS Web Services API [6]. The frontend framework used is PyQt in version 5, which provides a python library for creating GUIs on multiple operating systems. To speed up the calculation of the Levensthein distance, we do not rely on a custom implementation but on the Damerau_Levensthein method provided by the textdistance library. We use numpy (the core library for python data analysis) to compute means and standard deviations of sets of Levenshtein distances, where applicable. Furthermore, we use the request package to talk to the BAS Web Services API. The BAS Web Services are hosted by Ludwig-Maximilians Universität München and provide a broad range of tools in the area of speech technologies [6]. We utilize the G2P API to translate the written input the tool receives to a phonological transcription. Finally, we use the playsound python library to play a bell sound when calculations are finished. This is a fully redundant signal in that the completion of calculations is also shown in the protocol and the results sections of the GUI, but we implemented it so that users would be able to have PLD2flex run in the background and be alerted when it has finished processing.

Under the hood, the implementation follows standard object oriented programming principles. Sections of code pertaining to specific parts in the system are separated by storing them in different files. This will enable other users to import, reuse and utilize parts of the code in custom scripts or programs that extend the capabilities of the current implementation. Figure 3 presents an overview of how the different parts of the system relate to each other.

The application is started by the main.py file, which instantiates a ContentManager. The ContentManager (cf. file ContentManager.py in the repository) is the place where all other parts of the code are coordinated and managed. On initiation, it creates the User Interface (GUI.py), which is basically a class that sets up all the PyQt objects, and connects the QtButtons to the matching methods. It also reads in the config from the config file, and stores it as a custom object (defined in Config.py; for an overview of the possible configuration values, please refer to the User Manual). Furthermore, the ContentManager sets up a BASConnector (BASConnector.py) and sends out a ping to the BAS Web Services API to establish that the server is reachable and to check the current server load of the G2P API. If the server load is high, it might take more time until calls get answered. Lastly, a result and a log file is created by the ContentManager.

Every time the user clicks on one of the QtButtons, the connected method of the ContentManager is called. For every call, a CommandData object is instantiated, which is the basic data structure (for one command) that all parts of the program refer to. It stores the user data, the used method for the comparison, the returned phonological data and the results of all computations. In turn, the data for every single word is stored in a WordObject class, as defined in WordObject.py. It holds the orthographic as well as the phonographic representation of words and, if applicable, the Levensthein distance that has been computed for it. Depending on the button clicked and the input provided, CommandData is initiated with a certain set of data, which is then sent via the BASConnector to the G2P API to receive the phonological transcripts. These transcripts are then compared to each other. All the code needed for these compare operations lies in the PLD2flex class, which is located in file PLD2flex.py. Once the calculations are complete, the results are written to the output files as well as the output field of the application and a bell sound is played.

The class structure of the application code makes it reusable in an easy manner. An example of this can be found in the analyze.py file, located in the same folder as the main.py file (for further information about this script, consider the User Manual for PLD2flex). It imports the BASConnector to translate orthographic to phonographic input, the PLD2flex class to compute Levensthein distances between WordObjects and reads in the same config file used by the PLD2flex application to get access to the corpus files.

Quality control

In advance to publishing the tool, the application was extensively tested by selected individuals of the target group (in Python 3.12). The individuals were advised to put potentially problematic cases to PLD2flex, such as abbreviations as targets (see section 3.2 of the User Manual, [16]), as well as varying amounts of data. This evaluation period led us to create a list of possible error messages which is included in the User Manual. In addition, we provide guidelines for trouble shooting when there are installation problems.

We tailored the system installation guide in the User Manual to different levels of expertise with getting python-based tools to work on Windows, Mac or Linux (see Section 2 and Appendix A of the User Manual, [16]).

To test whether the tool is working, the application offers test files which can be found in the examples folder. These files were, amongst others, also used during the creation of the User Manual, thus the results can be compared to the screenshots in the User Manual when exploring the functionality of the application.

(2) Availability

Operating system

PLD2flex runs under Windows (10+), MacOS (10.9+) and Linux.

Programming language

Python 3.12+

Additional system requirements

Qt5 needs to be installed.

Dependencies

The binaries come with all dependencies bundled.

To build from source, using the command ‘pip install -r requirements.txt’ will install all the required modules. If, however, users want to install the dependencies by themselves, the following modules are needed:

requests
numpy
python-Levenshtein
pydub
PyQt5

List of contributors

Helena Wedig (HW) implemented the first complete version of PLD2flex. Felix Theodor (FT) took over the original implementation of the tool from Helena Wedig, expanding and refactoring it in parts. Joshua Wieler (JW) served as independent user providing feedback on usability and functionality of the tool. Eva Belke (EB) initiated the programming project from a psycholinguist’s viewpoint, providing relevant use cases for the initial and later versions of the tool. JW, FT, HW, and EB wrote the paper. HW wrote the first full version of the User Manual that was later extended and revised by FT, JW, HW, and EB.

Affiliations: Department of Translators and Interpreters, University of Antwerp (HW); Sprachwissenschaftliches Institut, Ruhr-University Bochum (FT (alumnus), JW, EB).

Software location

Code repository

Name: PLD2flex

Identifier: https://github.com/FelixTheodor/PLD2flex

Licence: GPL 3.0

Date published: 26/01/2024

Language

Language of repository, software and supporting files: English.

(3) Reuse potential

The tool is useful for psycholinguists wishing to characterize the phonological similarity of words or pseudowords to other words or word forms. This is relevant for much of the experimental research on auditory language processing and spoken language production. Sometimes, researchers interested in written word processing draw on phonological measures, too (e. g. [5]). Apart from such use cases in experimental psycholinguistics, PLD2flex can be used to specify the similarity between erroneous utterances, as seen in speech errors or in learner languages, to a target word form or some other sequence of phonemes generated by way of comparison (see [3]). Beyond that, PLD2flex can be used to establish how similar the words involved in such errors are to to all other words in a given sample. For instance, one might wonder whether children are more likely to acquire words that are similar to the words they already know or whether the opposite is the case. Scaling this notion up to the mental lexicon of adults, PLD2flex can be employed to characterize words with respect to their immediate as well as their wider phonological neighbourhood ([10, 11]). Finally, in educational contexts, it may be of interest to instructors to compile well-controlled material for language teaching.

An important goal for developing PLD2flex was to allow for a maximally flexible use for informed users. For instance, researchers may want to adapt its use with respect to homographs, i. e., words that are spelt identically. Many of these homographs are pronounced identically, too, as in (fish) tank and (military) tank in English, others are not, as in the noun minute vs. the adjective minute in English. G2P will provide only one transcription of homographic forms, so it may be worthwhile to decide how to deal with such homophonous forms. To facilitate this, PLD2flex generates a list of homographs and homophones that users can use as a basis for finding relevant cases for inspection (see [16] for details). User’s of the executable files for Windows/MacOS may want to use a spreadsheet-based tool to sort the database by phonological word forms and identify double entries in this way.

The code for PLD2flex is open source and available on GitHub, allowing unrestricted access for anyone interested in reviewing the code. By creating a GitHub account, users can raise issues within the repository, which will be addressed by the authors of this paper. Additionally, contributors have the opportunity to resolve issues by submitting pull requests containing code changes and additions. Ongoing support for these contributions will be provided to the best extent possible.

Notes

[1] Available at https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/Grapheme2Phoneme.

Acknowledgements

We thank Jannik Sühling and Ilka Plesse for testing PLD2flex and Marc Brysbaert for his advice on how to best deal with homophonic entries in large databases.

Competing Interests

The authors have no competing interests to declare.