
Figure 1
Maximizing the apparent positive and negative effect of an explanatory variable on SWR accuracy. Each panel shows all 1081 words in SWR1081, plotting for each word w an explanatory variable against the response variable accuracy, the fraction of participants who correctly identified this word when it was presented in a noisy environment. The explanatory variables are (A) the log frequency of w in a large corpus of natural text (Brysbaert & New, 2009), (B) the number of competitors of w in the ELP lexicon (Balota et al., 2007), and (C) the clustering coefficient of w in the ELP lexicon. (Every word in SWR1081 has at least three competitors, so clustering coefficient is well defined.) Each panel identifies two pairs of 50-word subsets {A1, B1} (blue; positive effect) and {A2, B2} (red; negative effect). Each pair of color-matched subsets controls for all explanatory variables in previous panels: in (B), Ai and Bi’s average frequency (in z-score) differ by less than δ = 0.05, and likewise in (C) for both average frequency and number of competitors. Among all such δ-balanced 50-element subsets, the displayed subsets show the largest possible difference (positive and negative) in y for low-x and high-x words. (See Supplementary Materials for how these subsets are computed.)

Figure 2
A schematic of the selection process, with parameters k (size of chosen subsets), ρ (fraction of data considered “high” or “low”), and δ (tolerance in control variables). (A) We must choose 2k of n given data points, in two equal-sized sets A and B, where A is chosen from among the ρ · n points with lowest explanatory variable values and B is chosen from among the ρ · n highest points. In every control dimension ci, the elements of A and B are, on average, within δ. (B) A particular example of this input data in a SWR context, with data from the ELP lexicon (Balota et al., 2007; Brysbaert & New, 2009). The weights ai and bi are chosen uniformly at random from [0,1]. The desired solution is the lightest-weight pair of sets A and B (with respect to these particular a and b weights) that satisfies the control-dimension constraints. (C) The integer linear program (ILP) used to compute the solution. We define variables qi ∈ {0, 1} and zi ∈ {0, 1} indicating whether to include a point in A and B, respectively. Solving the ILP finds optimal values of qi and zi. Fresh random weights are chosen in each run of the algorithm.

Figure 3
The result of 5000 runs of our ILP, with k = 50 words per subset, δ = 0.05 tolerance for control variables, and ρ = 0.5 (dichotomizing on the median). Each point in each panel corresponds to a single run of the ILP to select sets A and B; the point plots the difference in mean recognition accuracy between A and B, vs. the sum of the variances of the recognition accuracies in A and B. The parabolas correspond to significance levels in a t-test on A vs. B. (A) The effect of frequency on recognition; 72.6% of these runs show that higher frequency is associated (p < 0.05) with more accurate recognition. (B) The effect of number of competitors; 53.8% of these runs show that having more competitors is associated (p < 0.05) with less accurate recognition. (C) The effect of clustering coefficient; 2.1% of these runs show an effect of clustering coefficient on recognition (p < 0.05), split between showing positive and negative effects. All experiments were controlled as in Figure 1. (See Figure S1 for the variant of this analysis that tests the effect of each variable while controlling for the other two.)

Figure 4
Comparison of power of various testing methodologies. We considered two explanatory variables with significant effect on recognition accuracy in SWR1081, as determined by linear regression and LMEM applied to the entire dataset (p < 0.001 for both frequency and competitors). We measured how often five different testing methodologies failed to detect the correct effect (at the p = 0.05 level) in subpopulations of 100 total words, over 5000 trials. ILP selection follows Figure 2; uniform selection chooses the same number of elements by uniform random sampling. We tested for a relationship using either a t-test comparing the low and high sets’ response-variable values, via linear regression (in either case controlling for the listed control variables), or LMEM. The first two rows show settings in which there is a true effect (measured on the whole dataset); here, linear regression and LMEM correctly detect an effect more frequently than t-tests. When used with linear regression or LMEM, ILP performed slightly better than uniform sampling. For contrast, the last three rows show a setting in which there is no apparent relationship on the whole dataset (p > 0.5 using both linear regression and LMEM), where all three methodologies showed no effect >94% of the time. The last two of these rows perform the clustering coefficient analysis while controlling for a much larger list of variables, following the methodology of Chan and Vitevitch (2009) and Altieri et al. (2010). Note that we did not directly attempt to correct for multicollinearity among variables; however, given the close similarity of the analyses in the last three rows of the table, which correspond to very different settings of control variables, and the fact that all of the ILP analyses (which control for covariates via selection rather than only statistically) are consistent with the uniform analyses, multicollinearity is not likely to substantially affect the results.

Figure 5
Analysis of SWR1081 and ten synthetic datasets generated to have (approximately) the same covariance matrix. The ten synthetic datasets are sorted in decreasing order of the strength of the relationship between x and y.
