EgoCor: An R Package to Facilitate the Use of Exponential Semi-Variograms for Modelling the Local Spatial Correlation Structure in Social Epidemiology

Julia Dyck; Jan-Ole Koslik; Odile Sauzet

doi:10.5334/jors.517

(1) Overview

Introduction

The last 20 years have seen an increase in interest in the study of associations between neighbourhood characteristics and health. Most quantitative studies are based on neighbourhood defined as disjoint administrative spatial units and non-measured spatial effects are estimated via multilevel models. However, there are limits to this strategy which have been discussed in the literature [4, 21].

An alternative approach is to conceptualize neighbourhood as ego-centred such that every individual has its own neighbourhood centred on the place of residence. Such a neighbourhood must have relevance for the health outcome of interest. Some studies mentioned the use of the spatial correlation structure of health outcomes as an alternative to estimating the correlation within administrative areas [3]. This idea has been advanced by Sauzet et al. [17] by proposing to use the parameters of an exponential model for the semi-variogram of health outcomes to assess this correlation structure quantitatively.

Semi-variogram models provide a measure of the presence of unmeasured spatial effects on health. They have the advantage of being a good fit for the empirical data as well as for the following concept: if the place of residence affects health, then health outcomes of neighbours are correlated and this correlation evanesces with increasing distance between neighbours. Another application of such an approach in social epidemiology presented by Finne and Sauzet [9] is to use semi-variogram models to obtain ego-centred indicators of neighbourhood using kriging. These in turn can be used as predictors of health outcomes.

The procedure of investigating the spatial correlation includes first estimating an empirical semi-variogram based on the data available and then fitting an exponential parametric model to this semi-variogram. Here, only small distances are considered to model the local correlation structure, i.e. the correlation of a health outcome between immediate neighbours.

The empirical semi-variogram is not unique as it is based on the choice of the maximal distance between observations (all pairs of observations which are further apart than the maximal distance are not used to estimate the semi-variogram) and the number of bins (distance intervals for pairs of observations for which one point of the semi-variogram will be estimated). If the number of observations is limited and only a small proportion of the variance is spatially structured, then the ability of fitting an exponential model to the semi-variogram may be strongly dependent on the mentioned metaparameters. Therefore, the fit of the estimated exponential model must be evaluated visually with a comparative evaluation of a range of maximal distances and numbers of bins and by applying a measure of uncertainty to the fitted models.

A variety of software implementations to fit semi-variogram models in R [14] is already available, but the range of modelling possibilities can be a deterrent to its wider use by social epidemiologists with limited experience with the analysis of spatial data. Moreover, it remains presently difficult to obtain measures of uncertainty for parametric semi-variogram models. As far as we know the only method that has been implemented in current R packages is the BRISC method [15] provided in the R package BRISC [16] which cannot be used given geo-referenced survey data due to the large number of data points, but limited number of pairs at small distances [7]. Dyck and Sauzet [7] have investigated how to modify an existing bootstrap approach by Olea and Pardo-Igúzquiza [11] to make it reliable in the context of health data surveys (sparse local data defined as a small number of pairs at small distances, and large overall sample size) and propose a filtered bootstrap estimate for the standard error of parameter estimates of an exponential semi-variogram model implementable in R.

The aim of the R package EgoCor [8] is to offer a user-friendly interface to obtain graphics and tables of parameters for a range of fitted semi-variogram models in one function in order to facilitate the decision making about which exponential semi-variogram model parameters fit the data best. In addition, it provides a function to calculate a measure of uncertainty not available until now. Our package is based on the functions of the R package gstat [13] and on the work of Dyck and Sauzet [7].

In the subsequent section, we summarize the basic concepts of semi-variogram modelling and the filtered bootstrap for semi-variogram model parameter standard error estimation. This is followed by an introduction to all functions provided in the package accompanied by an illustrative analysis demonstrating the practical application of EgoCor.

Statistical Methods

In this section, we recall the statistical models and algorithms implemented in the EgoCor package. For the interested reader, more detailed sources are mentioned.

Semi-variogram model

We provide some basic concepts on semi-variogram modelling. For a more detailed description we refer to [18]. In the sequel, we assume that the spatial process is isotropic and second-order stationary. Under those conditions, the correlation C(h) between health outcomes Z(s) and Z(s′) (or the residuals of a regression model) at locations s and s′ depends only on the (lag) distance h = ‖s–s′‖ between those observations [18]. Moreover, we assume that the variance of the health outcome is independent from the location s.

The semi-variogram γ(h) for a positive lag h is defined as

γ (h) = \frac{1}{2} Var Z (s) - Z (s^{'})] .

The covariance function C(h) under the second-order stationarity assumption and provided that it exists is given by

C (h) = c_{0} + σ_{0}^{2} - γ (h)

where $σ_{0}^{2} + c_{0} = Var Z (s)]$ is the variance of the health outcome and the nugget effect c₀ is the value of the semi-variogram when the distance between two observations tends to zero. The parameter $σ_{0}^{2}$ is called the partial sill.

Matheron’s estimator provides an unbiased estimator for the empirical semi-variogram at distance h between two observations [10].

\hat{γ} (h) = \frac{1}{2 N (h)|} \sum_{(s, s^{'}) \in N (h)} {Z (s) - Z (s^{'})\}}^{2}

where N(h) is the set of all observations lagging at distance h. This empirical semi-variogram can be estimated from the data for a finite number of lag bins.

To provide a smooth estimate, an exponential model of the form

\begin{array}{l} γ_{\exp} (h) = \begin{array}{l} c_{0} + σ_{0}^{2} (1 - exp (- \frac{h}{ϕ})) & for h > 0, \\ 0 & for h = 0 \end{array} \end{array}

can be fitted to the values of the empirical semi-variogram resulting in parameter estimates ${\hat{c}}_{0}, {\hat{σ}}_{0}^{2}$ and $\hat{ϕ}$ .

From the fitted model several statistics can be calculated: The practical range (distance above which the correlation between observations is less than 5% of the total variance) for this model is given by

H = \hat{ϕ} log (\frac{{\hat{σ}}_{0}^{2}}{0.05 ({\hat{c}}_{0} + {\hat{σ}}_{0}^{2})}) .

The relative structured variability (RSV) calculated as partial sill divided by total variance is a measure of the degree of spatial structure:

R S V = \frac{{\hat{σ}}_{0}^{2}}{{\hat{c}}_{0} + {\hat{σ}}_{0}^{2}}

The relative bias (RB) between the estimated variance according to the model and the sample variance of the health outcome is obtained as

RB = \frac{{\hat{c}}_{0} + {\hat{σ}}_{0}^{2}}{\hat{Var Z (s)]}} .

Filtered bootstrap standard errors

The filtered bootstrap standard errors provide a way of uncertainty quantification for the parameters of the exponential semi-variogram model. The algorithm used is based on the generalized bootstrap method explained in [11] and [12] and was adapted in [7] to the characteristics of population data, i.e. large sample sizes overall, but low number of pairs at small distances. We briefly recall the steps of the filtered bootstrap algorithm:

0. Exponential semi-variogram model: An empirical semi-variogram is fitted to the original spatial dataset to which an exponential model is fitted. This provides the parameter estimate $\hat{θ}$ for the true parameter vector θ with $(θ_{1}, θ_{2}, θ_{3}) = (c_{0}, σ_{0}^{2}, ϕ)$ .
1. Normal score transformation: The data vector $= {(z_{1}, \dots, z_{N})}^{t}$ is mapped into a Gaussian space by the empirical normal score transformation function ϕ. Consequently, 𝐲 = ϕ(𝐳) is a realization vector of a standard normal random variable [6].
2. Exponential semi-variogram model for transformed data: An empirical semi-variogram and exponential semi-variogram model ${\tilde{γ}}_{exp}$ are fitted to the transformed data 𝐲 combined with the raw data’s geo-coding providing the parameter estimate $\tilde{θ}$ .
3. Covariance estimation: Making use of the exponential semi-variogram model characterized by $\tilde{θ}$ , the covariance between two data points z_i and z_j is calculated based on the Euclidean distance d_ij between these two points:
$c_{i j} = \tilde{c_{0}} + {\tilde{σ}}_{0}^{2} - {\tilde{γ}}_{exp} (d_{i j}) .$
The covariance matrix with entries c_ij is denoted as 𝐂.
4. Decorrelation of the data: The decomposition of the covariance matrix 𝐂 into the product of a lower and triangular matrix 𝐋 and its transpose is obtained by the Cholesky decomposition algorithm as $C = {LL}^{t}$ .
This decomposition is used to remove the correlation structure within the sample 𝐲. The resulting vector 𝐱 = 𝐋^–1𝐲 contains independent and identically distributed, hence uncorrelated values [19].
5. Classical bootstrap: Sampling with replacement from 𝐱 leads to a bootstrap sample 𝐱* of the same size as the original spatial dataset.
6. Recorrelation: The resample 𝐱* reinherits the correlation structure by applying the inverse operation of step 4, that is $y* = Lx*$ .
7. Normal score back transformation: The back transformation of 𝐲* to the attribute space through the inverse normal score function is obtained as $z* = φ^{- 1} (y*)$ .
8. Analysis of the bootstrap sample: An exponential semi-variogram model is estimated based on 𝐳* combined with the original coordinates providing an estimate θ*.
9. Filtering: A check-filter-based test is applied to the bootstrap estimate θ* to indicate whether the exponential semi-variogram fitting algorithm did converge. If
$c_{0}^{*} + σ_{0}^{2 *} > τ \hat{V a r (Z)}$
,
i.e. if the variance indicated by the model exceeds the estimated sample variance times the threshold factor τ, the bootstrap estimate is discarded, otherwise it is saved. Within the EgoCor package the threshold is set to τ = 3 by default as in a simulation study it was found to provide the best results with respect to the standard error estimates [7].
10. Repetition: The steps 5 to 9 are repeated until a set of B bootstrap estimates ${θ_{b}^{*}\}}_{b = 1, \dots, B}$ has aggregated.
11. Parameter standard error estimation: Based on the set of repeatedly estimated parameters ${θ_{b}^{*}\}}_{b = 1, \dots, B}$ , estimates of the parameter standard errors are obtained by the empirical standard deviation
$\hat{se (θ_{j})} = sd (θ_{j}^{*}) = \sqrt{\frac{1}{B - 1} \sum_{b = 1}^{B} {θ_{bj}^{*} - \bar{θ_{j}^{*}}\}}^{2}}$
with j = 1,…,3 referring to the three parameters $c_{0}, σ_{0}^{2}$ and ϕ.

Implementation and architecture

In this section, we describe the functions provided by the package EgoCor and subsequently apply them to the simulated dataset birth to demonstrate their practical use.

Dataset

The simulated dataset birth is provided with the package EgoCor. The dataset is based on the spatial distribution of real birthweight data [20] and contains eight variables for 903 births:

x: x-coordinate in meters for a fictive Cartesian coordinate system,
y: y-coordinate in meters for a fictive Cartesian coordinate system,
birthweight: birthweight in grams,
primiparous: first pregnancy (1) or subsequent pregnancy (0),
datediff: number of days to due date,
bmi: BMI of the mother at first medical appointment,
weight: weight of the mother at first medical appointment,
inc: income quintile (0, 1, 2, 3, 4).

Functions

We use the birth dataset to illustrate the following functions:

coords.plot(): for graphical description of locations,
distance.info(): for descriptive information about distances between observations,
vario.reg.prep(): to model the spatial correlation structure of residuals of a regression model,
vario.mod(): to fit exponential models to semi-variograms with graphical presentation,
par.uncertainty(): to obtain bootstrap standard errors for the parameters of the exponential semi-variogram model.

Demonstration of the package functionalities

Package installation

The package can be installed and loaded by

install.packages(’EgoCor’)

library(EgoCor)

Data exploration

The data format required by the EgoCor functions is either a data frame or a matrix.

The first three columns of the data frame or matrix should be ordered the following way:

1st column: x-coordinate in meters for a Cartesian coordinate system,
2nd column: y-coordinate in meters for a Cartesian coordinate system,
3rd column: outcome of interest.

Other columns will be ignored. A message appears following the output of the function vario.mod() recalling the required order for the variables.

The function coords.plot() provides a simple visualization of the locations on a two-dimensional map and indicates whether the outcome is observed (by a black circle) or missing (by a red x) at a specific location. The provided legend can be removed or repositioned in case it covers up marked locations. The purpose of this function is to investigate the spatial distribution of observations and to uncover potential spatial patterns in the distribution of missing values in the outcome of interest or covariates. Figure 1 displays the location of the observations. For the outcome birthweight there are no missing values.

Coordinates plot for outcome `birthweight`.

To illustrate the display of missing values we create a new matrix with the coordinates and the variable inc and then insert 30 random missing values. In Figure 2 the missing values are marked with red crosses. As expected no spatial pattern is visible.

Coordinates plot for outcome `inc` with 30 random missing values.

Further information on the distribution of pairwise Euclidean distances is provided by the function distance.info() which calculates

the distance matrix containing all pairwise Euclidean distances,
the set of all pairwise Euclidean distances where duplicate values due to symmetry are deleted.

Moreover, distance.info() displays the following descriptive statistics:

a histogram of the Euclidean distances,
minimum, 1st quartile, median, mean, 3rd quartile and maximum of the Euclidean distances.

The output for the birth data is as follows and illustrated with the histogram shown in Figure 3: From the 407 253 pairwise distances total, 30 570 are of less than 2 000 meters and will be used for modelling of the local spatial correlation structure.

Histogram of pairwise distances for the birth data.

Semi-variogram model fitting

The function vario.mod() enables the simultaneous output of multiple exponential semi-variogram models fitted for a range of maximal distances and bin numbers. Thereby, the focus lies on the ability of the function to provide multiple estimation results depending on various specifications for the metaparameters max.dist and nbins.

It is advised to try out different values for both parameters and choose the model with the best fit. Commonly, the fit is evaluated by visual checks. An additional check can be performed by comparing the sample variance with the estimated variance according to the semi-variogram model $\hat{σ^{2}} = \hat{c_{0}} + \hat{σ_{0}^{2}}$ by looking at the RSV.

The chosen maximal distance value specifies the subset of data pairs that are used for the semi-variogram estimation. Only data pairs with an Euclidean distance ≤ max.dist are taken into account. For a first exploration, it might be useful to try a range of maximal distances to locate where the range might be situated.

The code above Figure 4 will save a PDF file showing all fitted semi-variograms and will produce the shiny [5] output shown in Figure 4.

Shiny output from `vario.mod()` for a range of `max.dist` choices.

Each row of the printed output table (see Figure 4) contains the estimated parameters of the exponential semi-variogram model with one of the stated maximal distances. More precisely, the table columns contain:

index: model number,
max.dist: maximal distance used in the estimation of the empirical variogram,
nbins: number of bins specified for the empirical variogram estimation,
nbins.used: number of bins used for the empirical semi-variogram estimation (can differ from nbins in case of colocated data points),
nugget: the estimated nugget effect ${\hat{c}}_{0}$ ,
partial.sill: the estimated partial sill ${\hat{σ}}_{0}^{2}$ ,
shape: the estimated shape parameter $\hat{ϕ}$ ,
prac.range: the practical range of the exponential model,
RSV: the relative structured variability,
rel.bias: the relative bias between the sum of the estimated partial sill and nugget and the sample variance (which theoretically are the same).

A maximal distance of 1000 meters seems to provide the best fit among the options tried and we can now refine the analysis by considering a grid of smaller maximal distances leading to the output presented in Figure 5. Because a maximal distance of 800 meters provides the best fit for the exponential model with a low RB and a good visual fit, we further investigate the effect of the number of bins for a fixed maximal distance of 800.

Shiny ouput from `vario.mod()` for a finer range of `max.dist` choices.

The nbins parameter specifies the number of lags of the empirical semi-variogram to be estimated. On the one hand, a high number of lags might lead to a small within-lag sample size and thus to an unstable estimate. On the other hand, choosing a number of bins that is too small leads to a model that does not detect a spatial correlation structure at all. To decide on one or multiple values for nbins, taking a look at the histogram plot of the pairwise distances (see Figure 3) obtained by distance.info() may help.

Trying out multiple nbins specifications as in the code above Figure 6, we obtain the output presented in Figure 6. All models provide similar results but the third model with max.dist = 800 and nbins = 13 provides a slightly better fit and could be selected as the final model to capture the spatial effect on the health outcome birthweight.

Shiny output from `vario.mod()` for a range of `nbins` choices.

Modelling the spatial correlation structure of residuals

We want to investigate whether some or all of the observed spatial correlation structure can be explained by adjusting for predictors of birthweight. Therefore, instead of modelling the correlation structure of a health outcome, the vario.mod() function can be used to model the spatial correlation structure of residuals from a (hierarchical) linear regression. To do so, the studentized residuals from a (hierarchical) linear regression model are extracted via the vario.reg.prep() function.

In the first step, we fit the following regression model and investigate the output:

All predictors are significant. Using the vario.reg.prep() function we now assign the studentized residuals of the regression model to the spatial coordinates and investigate which maximal distance provides the best exponential semi-variogram model fit for the studentized residuals saved in v.prep:

v.prep <- vario.reg.prep(res, data = birth)

In terms of model fitting, we start at similar maximal distances to the ones chosen for the raw data (see Figure 6) and obtain the output depicted in Figure 7. The results point towards a reduced spatial correlation structure when controlling for various predictors with a well-fitting maximal distance of only 600 meters and much less regularity in the empirical semi-variogram.

Shiny output from `vario.mod()` based on residuals for a range of `max.dist` choices.

Analogously to the unadjusted case, we try out multiple nbins values fixing the maximal distance at 600 meters.

Based on the resulting graphics and table (see Figure 8) we conclude that the models with max.dist = 600 and nbins = 12 (see Figure 8) or max.dist = 600 and nbins = 13 (see Figure 6) provide visually very similar results and fit the data best.

Shiny output from `vario.mod()` based on residuals for a range of `nbins` choices.

Filtered bootstrap standard errors

The function par.uncertainty() provides filtered bootstrap standard errors for all three exponential model parameters. Standard errors are important for conducting proper inference [1]. Moreover, they can be helpful to get an impression about the reliability of the estimated model and provide an objective tool to compare two or more models which seem to provide an equally good fit when evaluated visually as demonstrated in the last section.

Because the execution of the filtered bootstrap algorithm can be time consuming (depending on the sample size and number of bootstrap repetitions), the par.uncertainty() function is not called automatically within vario.mod() such that the bootstrap is not executed for all estimated models. The choice which model estimate should be completed by estimating standard errors, is left to the user by selecting the fitted model of interest in the vario.mod() outcome by its number in the par.uncertainty() option mod.nr and thereby saving execution time.

Due to the randomness in the bootstrap component within the standard error calculation, repeatedly estimating the standard errors for one fixed semi-variogram model will vary slightly. If the filtered bootstrap results are wished to be reproducible, a seed has to be set prior to the application of the par.uncertainty() command.

We save the vario.mod.output object containing the two best semi-variogram models of the residuals according to visual inspection:

models <- vario.mod(v.prep, max.dist = 600, nbins = c(12, 13),shinyresults = FALSE)

Based on that, we can estimate the parameter standard errors for model 1 (max.dist = 600, nbins = 12):

and model 2 (max.dist = 600, nbins = 13):

According to the above R output the standard error estimates of model 1 appear to be slightly smaller than the standard error results of model 2 confirming our impression that the models are very similar. The uncertainty quantification in terms of standard error values allows us to make an objective comparison and encourage us to select model 1 as the final model.

Quality control

All EgoCor functions were tested on synthetic and real data during development to ensure the desired functionality and preempt erroneous outputs. Moreover, we added a framework for documented unit testing using the R package testthat [22]. This enables repetition and complementation of tests performed in the past whenever the package is adjusted, further developed or changed in any way in the future.

The package passed the R CMD checks of the Comprehensive R Archive Network (CRAN) without errors, warnings or notes. Functions and data are documented in detail and examples are provided for each function as well as with regards to the overall workflow in form of a vignette coming with the package or in the demonstration section of this paper.

The availability of a development version of EgoCor on Github provides transparency regarding the source code, development steps (commits) and tests (in the testthat subfolder) and thereby the opportunity for the source code to undergo steady quality control by the R community. More specifically, users and developers can have a look at the source code, communicate issues, and suggest improvements in form of pull requests that might contain improvements in functions and tests.

(2) Availability

Operating system

R runs on Unix-like, Windows and Mac operating systems.

Programming language

R (≥3.5.0)

Additional system requirements

No special requirements.

Dependencies

R package dependencies: graphics, grDevices, gstat, Rdpack, shiny, sp, SpatialTools, stats, methods, utils.

R package suggests: knitr, rmarkdown, lme4, testthat (≥3.0.0).

List of contributors

Julia Dyck, Jan-Ole Koslik, Odile Sauzet.

Software location:

Code repository

Name: Github

Persistent identifier: https://github.com/julia-dyck/EgoCor

Licence: MIT + file LICENSE

Date published: 28/03/2024

Language

English

(3) Reuse potential

The R package EgoCor offers a range of functions to explore which empirical semi-variogram metaparameters (maximal distance and number of bins) result in the best-fitting exponential semi-variogram model. The graphical display and summary tables provided by the package facilitate the comparison of fitted semi-variogram models for multiple chosen metaparameters. Moreover, the package provides an implementation of the filtered bootstrap method proposed by [7] enabling standard error estimation for exponential semi-variogram model parameters that can be used for more formal model comparisons beyond the visual evaluation of the fitted semi-variogram.

The package has been developed with applications in spatial social epidemiology in mind. It has been applied in various publications [9, 17, 2]. The package can also be of use in any field where inference about a semi-variogram is useful but with only sparse data available.

Due to the user-friendly interface, practitioners can conveniently use the package to apply the concept of correlation neighbourhood suggested by Sauzet et al. [17] or use it to guide the choice of a suitable kriging model [9]. However, the modelling framework is not exclusive to social epidemiological research and can be useful for practitioners of other fields as well.

While the EgoCor package is mainly intended to be used by practitioners, developers are welcome to contribute to the package or extend it depending on their needs. Potential extensions could be

the inclusion of other semi-variogram models than the exponential,
the improvement of algorithm with respect to efficiency
and more.

Any code adjustment and changes or issue communication can be conducted on the Github repository.

Competing interests

The authors have no competing interests to declare.

EgoCor: An R Package to Facilitate the Use of Exponential Semi-Variograms for Modelling the Local Spatial Correlation Structure in Social Epidemiology

Full Article

(1) Overview

Introduction

Statistical Methods

Semi-variogram model

Filtered bootstrap standard errors

Implementation and architecture

Dataset

Functions

Demonstration of the package functionalities

Package installation

Data exploration

Figure 1

Figure 2

Figure 3

Semi-variogram model fitting

Figure 4

Figure 5

Figure 6

Modelling the spatial correlation structure of residuals

Figure 7

Figure 8

Filtered bootstrap standard errors

Quality control

(2) Availability

Operating system

Programming language

Additional system requirements

Dependencies

List of contributors

Software location:

Archive

Code repository

Language

(3) Reuse potential

Competing interests

Paradigm

My account