Csv2xyd: A Python Software for Processing Large Biodiversity Datasets for Endemism Analysis

Jonathan Liria; Ana Soto-Vivas

doi:10.5334/jors.538

(1) Overview

Introduction

The increasing availability of online biodiversity data has placed natural history museums at the center of modern scientific research, providing primary knowledge in fields such as biogeography, ecology, and conservation. These repositories are crucial for understanding the complex geographical patterns of biodiversity, and offer valuable tools to address global challenges such as conservation area delimitation, biodiversity loss, invasive species distribution expansion, among others [1, 2].

Areas of endemism are among the most important aspects of biogeographical and conservation studies [3]. Several methods have been developed for their identification, incorporating spatial species data or relationships between areas or grid cells [4, 5]. Optimization criteria to identify areas with the highest number of endemic species have been proposed [6, 7] In general, records of a species homogeneously distributed within the evaluated area increase the endemicity index for that species, while records outside the area reduce its value.

This approach has been implemented endemism software known as NDM and vNDM (with names that emphasize the three letters of “eNDMism”), which use lists of species and their georeferenced occurrences as input files [8]. These programs require input data to be in the specific XYD format (x-longitude, y-latitude data), which represents species occurrence records using geographic coordinates and species identifiers, and is essential for generating presence/absence matrices for endemism analyses. Recently, areas of endemism have been identified from large sets of species and occurrences [9, 10], using programs developed in C++ that process comma-separated value (CSV) files and convert them to the XYD format. However, when working with large biodiversity repositories containing thousands of species and occurrences, new challenges arise, such as incomplete taxonomic information, duplicate coordinates, typographical errors, and geographic inconsistencies [11, 12].

In this context, Python [13] appears as an efficient solution, due to its diverse libraries that permits processing large volumes of data, applying filtering operations, and efficiently managing computational resources. In this work, we introduce the software csv2xyd (CSV to XYD) developed in Python, to facilitate the conversion of CSV data into XYD files and simplify the process of handling large datasets for endemism analysis. The software includes a GUI with tools for detecting and correcting typographical errors, removing duplicate records, and identifying taxonomic inconsistencies. Additionally, it allows users to apply different filters to the data, e.g., elimination of species with few occurrences or the selection of specific data.

Implementation and architecture

The csv2xyd software was developed in Python 3.12.4 and requires the installation of several libraries: Tkinter [14] for the graphical interface, Pandas [15] to facilitate data manipulation in CSV format, Dask [16] for efficient handling of large datasets and parallel operations, FuzzyWuzzy [17] for typographical error detection and duplicate names, Folium [18] for generating interactive occurrence maps, NumPy [19] for mathematical operations on occurrence data, GeoPandas [20] for geospatial analysis of polygons and points, and Shapely [21] for geometric operations such as polygon creation and distance calculation between points.

The software follows a sequential workflow (Figure 1) that includes the input files, main process tools (open CSV files, pre-processing large files, map occurrences, processing files, and geospatial analysis), and the output files. These operations are showed in a graphical user interface organized in four menus (Figure 2):

Workflow of csv2xyd showing the input files, the main process, and the outputs.

Graphical user interface of csv2xyd showing the menu panels: A) File, B) Preprocessing, C) Processing, D) Geospatial Analysis, and E) Details of the options in Geospatial Analysis.

File: Provides options (Figure 2A) to open a CSV file with a minimum header of “species”, “latitude”, and “longitude” import a CSV delimited by commas or tabs from biodiversity repositories, merge and sort several files containing the same minimum columns, and reorder the columns. A special option provides to the User, open a file downloaded from biodiversity repositories, selecting and apply filters to extract specific records.

Preprocessing: Includes options (Figure 2B) to remove duplicates, set an occurrence filter (minimum number of records per species), establish a similarity threshold for typographical error detection, detect these errors, load a CSV file with corrections, visualize a map of occurrences with a random selection percentage of the data, and detect taxonomic errors.

Processing: Allows selecting the columns to include in the output file higher taxa [22], configuring the XYD file with parameters such as grid size and fill values, and processing the CSV file (Figure 2C).

Geospatial Analysis: Offers exploratory tools for spatial analysis of biodiversity through polygon calculations (Figure 2D, E), including richness (number of species), abundance (number of observations), Hill’s diversity and evenness indices [23], and Chao1, Jack1, and Jack2 indices [24].

Figure 3A shows the occurrence viewer, where the user can choose the percentage of data to visualize through random selection. In Figure 3B, the processed file for endemism analysis in NDM/vNDM for the “World Arthropoda” dataset is shown, with 114,963 species and 756,941 occurrences, a 5° × 5° cell, and assumed fill and presence values. Figure 2C details the XYD file format generated for endemism analysis in NDM/vNDM.

Visualization of 10% of occurrences for the “World Arthropoda” dataset **(A)**, vNDM window showing 114,963 species and 756,941 occurrences for the “World Arthropoda” dataset **(B)**, and XYD file displaying the input format for NDM/vNDM **(C)**.

For a detailed tutorial on how to use csv2xyd, a video guide is available on YouTube: For English speakers [https://youtu.be/xdn38tZJmIQ] and another for Spanish [https://youtu.be/DWjorMyIhC8].

Quality control

Biodiversity data from public repositories such as the Atlas of Living Australia, the Global Biodiversity Information Facility (GBIF), and previous biogeographical studies [9, 10] were used for testing. The data include georeferenced species occurrences covering various taxonomic groups (Cnidaria, Chordata, Aves, Arthropoda, Plantae, and Fungi). These datasets range in size, from thousands to millions of records, with geographic extents spanning from local (country) to global scales. The processing time of csv2xyd was evaluated for each type of data. All tests were performed on a laptop with an Intel(R) Core(TM) i5–10210U CPU @ 1.60GHz, 2112 MHz, and 16 GB of RAM.

In terms of performance, Table 1 shows processing times ranging from less than four seconds for small datasets (1,685 species and 124,982 occurrences) to 136.53 seconds for much larger datasets (2,439 species and 4,877,116 occurrences).

Table 1

Biodiversity data used for evaluating the processing time of csv2xyd; spatial scale, dimensions (species × occurrences), required memory (Megabytes), and reference are shown.

DATA NAME	TAXONOMIC GROUP(S)	SPATIAL SCALE	DIMENSIONS (SPP × OCCURRENCES)	MEMORY REQUIRED (Mb)*	PROCESSING TIME (sec)**	REFERENCE
Australia	Cnidaria	Continent: Australia	1,685 × 124,982	56.71	3.49	[25]
Ecuador	Animalia, Plantae, Fungi	Country: Ecuador	10,390 × 407,570	187.00	14.42	[10]
South Chordata	Chordata	Continent: South America	14,296 × 880,023	403.03	44.51	[26]
World Arthropoda	Arthropoda (terrestrial)	Global	114,963 × 810,634	361.42	95.24	[9]
Asian Aves reduced***	Aves	Region: Asia	2,439 × 4,877,116	2276.59	136.53	[27]

[i] * Estimated memory required based on column and row dimensions, and cell size.

** Processing time (calculated in seconds) to export CSV to XYD file, after removing duplicates and applying a filter to retain a minimum of one occurrence per species.

*** For this dataset, occurrences were randomly reduced to not exceed 100,000 occurrences per species according to NDM/vNDM actual default limit parameters.

(2) Availability

The csv2xyd software is available for download and use from the following GitHub repository: https://github.com/jliria/csv2xyd. This repository includes the source code, brief manual, data used (input/output), and a tutorial video.

Operating system

Windows 10

Programming language

Python 3.12.4

Additional system requirements

No extra requirement.

Dependencies

Tkinter 0.1.0, Pandas 2.2.2, Dask 2024.7.0, FuzzyWuzzy 0.18.0, Folium 0.17.0, NumPy 2.0.0, GeoPandas 1.0.1, Shapely 2.0.5. Additionally, a requirements.txt file has been included in the GitHub repository to facilitate installation and deployment.

List of contributors

JL developed the software, proposed the methodology, obtained and validated the results and wrote the main manuscript draft. ASV validated the software, obtained the experimental results and contributed to the final manuscript.

Software location

Code repository: GitHub

Name: cs2xyd

Identifier: https://github.com/jliria/csv2xyd

Licence: MIT

Date published: 28/09/2024

Language

The csv2xyd software graphical interface is currently available English. However, a video guide is available on YouTube for English speakers and another for Spanish.

(3) Reuse potential

The csv2xyd program is an efficient tool for processing large biodiversity datasets and creating the XYD files needed for endemism analyses with NDM/vNDM. It integrates a graphical interface with various functionalities such as typographical error detection, duplicate removal, and the ability to handle files from different repositories. While working with large datasets can present challenges in preparing input files; csv2xyd simplifies this process, allowing researchers to effort on interpreting results and identifying areas of endemism. Future implementations will focus on adding tools that combine APIs for automatic georeferencing of localities without coordinates and for taxonomic validation using expert resources such as the Encyclopedia of Life (EOL). For instance, integration with services such as the GBIF Species API (https://techdocs.gbif.org/en/openapi/v1/species), the EOL Classic APIs (https://eol.org/docs/what-is-eol/classic-apis), and the GeoLocate Web Services (https://www.geo-locate.org/developers/default.html) will be considered to support taxonomic resolution, locality validation, and data enrichment. Another interesting direction will be to enhance the software’s data integration capabilities by allowing the merging of multiple CSV files based on field relationships, similar to database joins. Researchers interested in contributing to or extending this functionality are encouraged to engage through the project’s GitHub page.

Acknowledgements

We thank Eng. Fabiana Liria Soto (Universidad de las Américas) for suggesting recommendations to improve the article, tutorials, and Python code. We also gratefully acknowledge the constructive comments and suggestions provided by the anonymous reviewers and the handling editor, which helped improve the clarity and quality of the manuscript.

Competing Interests

The authors have no competing interests to declare.