Does the Experience of Remembering Differentially Influence the Factual Accuracy of Recognition, and Confidence in Its Accuracy?

Phyllis Margaret Windsor; Benjamin R. Dering; David I. Donaldson

doi:10.5334/joc.477

Introduction

Despite the inherent fragility of memory, researchers have asserted that under normal circumstances there are no established grounds for considering memory unreliable (Brewin, Andrews & Mickes, 2020), thus we should have confidence in the accuracy of our memories. Consequently, in applied settings such as eyewitness testimony, recognition memory is generally considered reliable, reflected in the fact that remembering with high confidence is associated with high recognition accuracy (e.g., Brewer & Wells, 2006). Nonetheless, researchers also acknowledge that eyewitnesses may sometimes be highly confident but inaccurate, and the use of predictive variables, e.g., time to decision, and responses based on familiarity, have been suggested (Grabman, Dobolyi, Berelovich & Dodson, 2019). Furthermore, a large body of work has demonstrated that confidence can be readily influenced and may sometimes provide a limited indication of accuracy (e.g., Leippe, 1995; Smith, Smalarz, Ditchfield & Ayala, 2021). Here, therefore, using two different memory paradigms, we re-assess the belief that remembering with high confidence is a dependable guide to memory’s accuracy.

The association between accuracy and confidence in recognition is classically tested using an old or new (item recognition) task. At test, participants are required either to recognise or remember stimuli previously shown at study. Additional information about the memory experience can also be provided, for example, by asking participants to subsequently report confidence in the accuracy of their decisions, or to provide judgments of ‘remembering’ or ‘knowing’ that reflect different subjective experiences during retrieval (Gardiner, 1988; Rajaram, 1993; Tulving, 1985). In a standard old or new recognition task, correct compared to incorrect recognition decisions (e.g., Sporer, Penrod, Read & Cutler, 1995), and remembering compared to knowing (Dunn, 2004; Gardiner, 1988; Tulving, 1985), are both associated with significantly higher ratings of confidence.

Here we consider whether the accuracy of memory or the experience of remembering (and its quality) is the more important for determining the confidence an individual has that their memory is correct. Based on data from a variety of experimental studies, we know that multiple factors may be responsible for judgments of memory confidence (Brewer & Sampaio, 2012), leading to dissociations of confidence and accuracy. Confidence can also be subject to individual differences (e.g., Ais, Zylberberg, Barttfeld & Sigman, 2016). Furthermore, participants may base their confidence on cues other than those that determine recognition accuracy (Busey, Tunnicliff, Loftus & Loftus, 2000; Zawadzka, Higham & Hanczakowski, 2017).

More broadly, metacognitive assessments of confidence are considered inferential. For example, they may be based on fluency (how quickly an answer is retrieved from memory), task difficulty, consensuality, or familiarity, as well as the quality of the recollective experience (Chua, Hannula & Ranganath, 2012; Kelley & Lindsay, 1993; Koriat, 2008; Kurdi, Diaz, Wilmuth, Friedman & Banaji, 2018; Olsson, 2000; Robinson, Johnson & Robertson, 2000). What studies to date have failed to reveal, however, is whether such cues affect confidence and accuracy in the same way, considering that other factors (such as the influence of prototypes, or sleep deprivation) have been shown to differentially affect confidence (Blagrove & Akehurst, 2000; Kerrén, Zhao & Griffiths, 2024). Of particular relevance here is the finding that the subjective experience of remembering occurring during a recognition task is considered to be a major factor in generating confidence judgments (Jersakova, Allen, Booth, Souchay & O’Conner, 2017), yet it remains unclear whether the experience of remembering equally affects the accuracy of recognition and our confidence in its accuracy.

Additionally, here we also ask whether the quality of retrieved memories is important for judgments of confidence. Briefly, it is accepted that if we consider a stimulus vivid, it is later recognised more accurately (Jacoby, Baker & Brooks, 1989; Spence, Wong, Rusan & Rasteger, 2006; Wichmann, Sharpe & Gegenfurtner, 2002). Additionally, memory quality can be suggested by how vividly a memory is perceived. If we refer to a memory as vivid, we are more likely to feel we are reliving the experience (Buchsbaum, Lemire-Rodger, Fang & Abdi, 2012; Rubin, Deffler & Umanath, 2019). Equally, the experience of recollection can be enhanced for emotional events, where vividness is defined as internal, because the event had personal significance which provoked an emotional reaction (Cooper, Kensinger & Ritchey, 2019; Ford & Kensinger, 2017; Habermas, Diel & Welzer 2013; Kinugawa, Schumm, Pollina, Depre, Jungbluth… & Dere, 2013; Phelps & Sharot, 2008), or because the stimulus triggered spontaneous recollection of a personal memory (Gardiner, Ramponi & Richardson-Klavehn, 1998).

There is ample evidence that vividness is associated with confidence in a memory’s accuracy. To the lay audience, for example, personal memories deemed vivid are more likely to be confidently believed. In addition, studies involving flashbulb memories (i.e., memories of historically significant world events) often report that such memories are detailed, long-lasting, and vivid (Brown & Kulik, 1977; Edery-Halpern & Nachson, 2004; Luminet & Spijkerman, 2017; Pillemer, Goldsmith, Panter & White, 1988). Vivid memories may not always be accurate, however, and other researchers have shown the contextual details of flashbulb memories can sometimes be incorrect or lacking (Stone, Luminet & Takahashi, 2015; Talarico, Kraha, Self & Boals, 2019; Talarico & Rubin, 2003; 2007). Equally, when asking whether remembering and its quality (i.e., vividness) might be one of many factors which underlie dissociations between confidence and accuracy, the current literature shows varying degrees of association between memory vividness and memory accuracy (e.g., Habermas & Diel, 2013; Richter, Cooper, Bays & Simons, 2016). Consequently, in addition to studying the experience of remembering and its influence on both confidence and memory accuracy, here we also assess how these relationships may be affected by variability in memory quality (i.e., remembered vividness).

Study 1 – Recognition Accuracy, Remembering and Confidence

Introduction

In Study 1, we considered the effect of remembering and its quality on both recognition accuracy and confidence in the recognition decision. We asked participants to study a set of pictures of natural scenes and then complete a recognition test with recollection (i.e., remember), confidence, and vividness ratings taken after each test trial. To measure memory, we chose a two-alternative forced-choice (2-AFC) old or new picture similarity recognition task (Tulving, 1981). Although items are shown separately at study, the two alternatives (old and new, target and lure respectively) are presented simultaneously at test, with the participant having to discriminate between them whilst information from both stimuli is available. Apart from indicating which of the test pictures was old, in the original task participants also rated confidence in their decision on a three-point scale. Tulving’s picture similarity experiment manipulated stimuli at test to make targets and lures similar, either perceptually or mnemonically, to produce a dissociation between confidence and accuracy and to test whether recognition accuracy was inversely related to the similarity between old and new test items.

This paradigm offers good replicability in producing a confidence-accuracy inversion between picture pairs shown at test with perceptual and mnemonic similarity (Dobbins, Kroll & Liu, 1998; Fandakova, Johnson & Ghetti, 2021; Heathcote, Freeman, Etherington, Tonkin & Bora, 2009; Hembacher & Ghetti, 2017; Zawadzka et al., 2017), with perceptually similar test-pairs providing more accurate recognition than mnemonically similar test-pairs (which are dissimilar to the target, but perceptually similar to a non-target picture shown at study). If the lure and the target are highly similar, however, participants are more accurate but less confident. Tulving explained this dissociation as representing two types of similarity relations in forced-choice recognition memory. When pictures are perceptually similar, the retrieval process to determine which picture was old is more elaborate and results in higher accuracy than when the pictures are perceptually dissimilar – and memory of the picture that was similar to the lure was relied on to make the decision.

Tulving’s demonstration of the confidence-accuracy inversion is important because it highlights conditions in which confidence and accuracy can be dissociated – at least when the nature of lures is manipulated. Findings from a study by Busey et al., (2000) using memory for faces also suggests confidence need not always reflect accuracy. Using an old-new face recognition experiment Busey and colleagues found that brightening face images at test (shown dim at study) increased subjective confidence without influencing accuracy, suggesting that the consequence of both study and test conditions may lead to confidence being affected over and above accuracy. Similarly, other researchers have demonstrated that confident false recognition of deceptive lures can lead to illusions of memory accuracy (Roediger & McDermott, 1995; Tulving, 1981; Weigl, Pham, Mecklinger & Rosburg, 2020).

The impact that varying the nature of lures has on the relationship between confidence and accuracy is of particular importance in relation to the eyewitness field, where recognition of a suspect is commonly assessed using either a task where pictures of faces are shown singly, or as one photograph in a set of photographs (i.e., representing an N-AFC task, where multiple images are shown together equivalent to an identity parade/lineup). The relative merits of these two tasks has been debated at length in the wider literature (e.g., see Wixted & Mickes, 2015). In a lineup, eyewitnesses are claimed to be less accurate (more prone to false recognition) if the suspect and fillers are similar. That is, the more similar lures or innocent people are to targets (suspects), the more confident false identification (false recognition) is likely to occur (cf. DeSoto & Roediger, 2014; Grabman et al., 2019; Roediger & DeSoto, 2014). In other research, however, correct recognition has been shown to be lower, and incorrect recognition higher, when a lure substituted for a target face picture was of high similarity (Brewer & Wells, 2006). In this case, therefore, results suggest that testing highly similar faces in a lineup is more likely to cause recognition errors.

Regardless, Colloff and Wixted (2020) make a case for “fair” simultaneous lineups, where all members match the description of the suspect to provide eyewitnesses with the opportunity to compare features to boost identification accuracy. Conversely, in an “unfair” simultaneous lineup the suspect stands out because they are more similar to the participant’s memory of the real perpetrator than are the other lineup members. From this perspective the confidence-accuracy inversion effect largely occurs when memory is poor (e.g., not based on remembering), or testing procedures are biased (e.g., by inappropriate lures), and when appropriate testing conditions are maintained high confidence is strongly predictive of high accuracy. As we show here, however, confidence in the accuracy of recognition is also affected by the quality of memory and the experience of remembering – not just its factual accuracy.

Methods

Recruitment

Participants were psychology undergraduates attending the University of Stirling, Scotland U.K., aged 18–37 years with normal or corrected-to-normal vision and no history of abnormal colour perception, who took part for mandatory course credits. The studies were approved by the University of Stirling’s General University Ethics Panel (GUEP). Full details of the online recruitment protocol is provided in Supplementary file 1: Appendix. Study 1, Recruitment and online instructions to participants. We recruited 70 participants of mean age 21 years (S.D. 5 years); 56 identified as female. A sample size of 60 participants provides sufficient statistical power (0.78) for an experiment designed to determine, for the same subjects, if three data sets (i.e., test picture-pair conditions) are significantly different from each other at 5% significance level (two-tailed), using one-way ANOVA.

Stimuli

Real-world colour natural scene photographs were used for targets and lures. We selected images from four distinct categories of natural scenes: flowers, green landscapes and trees, mountains, and water, because previous evidence suggests diverse types of natural scenes are associated with equivalent levels of recognition accuracy (Wichmann et al., 2002). Because the picture similarity task involves perceptually similar and dissimilar pictures, we required a large number of images. Suitable images were sourced and downloaded from two online psychology image databases, the McGill Calibrated Color Image Database (Olmos & Kingdom, 2004), and the U-Penn calibrated natural images of the Okavango Delta of Botswana (Tkačik, Garrigan, Ratliff, Milčinski, Klein, Seyfarth… & Balasubramanian, 2011). Because of the difficulty in downloading sufficient suitable images for the categories required, further photographs of natural scenes were selected from the corresponding author’s personal database of natural scene colour photographs from the UK and elsewhere. To ensure none of the images were obviously less visually vivid, contrast and colour saturation were adjusted if required. All photographs used in Study 1 can be viewed online as detailed in the data accessibility statement. To minimise content that would potentially bias recognition accuracy or image memorability, we avoided photographs including people or animals, and those featuring incongruent objects or man-made structures such as famous buildings (Evans & Baddeley, 2018; Hunt, 2013; Santangelo, Di Francesco, Mastroberardino & Macaluso, 2015; VanArsdall, Nairne, Pandeirada & Cogdill, 2017; Wichmann et al., 2002). Faces with positive and negative expressions of emotion and pictures depicting positive and negative events are considered to be emotionally-valenced (e.g., Keightley, Winocur, Graham, Mayberg, Hevenor & Grady, 2003). We deliberately did not include images with potential emotionally evocative features (i.e., including people with emotional expressions, or animals), as such scenes can influence recognition relative to neutral scenes (e.g., Keightley, Chiew, Anderson & Grady, 2011), and can influence both memory accuracy and confidence (Phelps & Sharot, 2008). Nonetheless, as would be the case for any images, we could not rule out idiosyncratic emotional interpretation of the images arising from participants liking or disliking the picture, or from images provoking a pleasant or unpleasant personal memory.

Procedures

Study sequence

The experiment was carried out online using the participant’s own tablet or PC (but not a smart phone). Duplicating Tulving’s (1981) original task, 196 different natural scene photographs were used: 160 photographs were arranged into three blocks, shown sequentially in a fixed order with no gap between the blocks. First, a buffer block of 56 photographs, which were not used for the test sequence; second, a target block of 48 photographs (not identified as such to the participants) consisting of 36 A targets and 12 B photographs (which were not shown again but were similar to the B’ lures shown at test); third, a second buffer block of 56 photographs, which were not used for the test sequence (Figure 1a). Following a fixation cross (200 ms), each picture was displayed for 2000 ms in a continuous sequence. The sequencing of photographs within each study block was randomised for each participant by the online platform. The additional 36 images that were not shown at study were used as lures, paired with the 36 A (old) targets shown in the test sequence. To ensure a timed break between study and test phases, after viewing all images in the study phase participants were asked to look at a “Where’s Wally” picture for two minutes before they were advanced to the test phase.

Test sequence

At test, the 36 A targets were shown again. each with a lure (a new previously unseen picture) as a test picture-pair. The test picture-pairs were randomly displayed in each of three conditions: (A-A′), the lure was perceptually similar to the A target; (A-B′), the lure was dissimilar to the target but was perceptually (mnemonically) similar to one of the 12 B pictures shown at study within the target block; and (A-X), the lure was dissimilar to the target or to any other picture shown at study. All test picture-pairs were matched and counterbalanced for natural scenes subject category (three of the 12 test-pairs in each of the three conditions were taken from one of the four natural scenes subject categories (Figure 1b). All participants viewed the same 12 test picture-pairs displayed for 6000 ms in random sequence in each of the three conditions. The target picture was pseudo-randomized to appear either on the left- or right-hand side of the screen. Full details of the image selection for test picture-pairs used in Study 1 are provided in Supplementary file 2: Appendix. Study 1: Picture similarity criteria.

Reaction time

Output from the testing platform included time (ms) for the participant to make their recognition decision. In addition to enabling exclusion of trials that were timed out, reaction times could be evaluated for each test picture-pair condition.

Behavioural data

Participants’ behavioural responses during testing were recorded, starting with the time (ms) participants took to make their recognition decision (enabling exclusion of trials that took longer than the maximum time allowed of 6000 ms). On all test trials, once a decision had been made (i.e., the participant had selected one of the picture-pair as being old) behavioural data were collected from participants in a fixed order: vividness, confidence, and remembering. Vividness (‘how vividly did you remember the picture?’) and confidence (‘how confident are you that you remembered the picture correctly?’) ratings were both assessed using a 0–100 sliding scale and were therefore asked one after the other, with the ‘yes or no’ remember response asked last. The vividness judgment was presented first to ensure that vividness ratings related to the memory retrieval decision; equally, it was considered appropriate to rate memory vividness before assessing confidence, as accuracy is thought to influence confidence.

Based on previous reported studies (e.g., Gardiner, Ramponi & Richardson-Klavehn, 2002), participants were instructed to answer yes or no to whether they had remembered the picture recognised as old, so as not to force participants to answer Know if a stimulus was not remembered. Additionally, before leaving the experiment, participants completed a short online checkbox questionnaire based on how well they complied with the Remember yes or no instructions. Full details of the task instructions are provided in Supplementary file 1: Appendix. Study 1, Recruitment and online instructions to participants.

Data Analysis

Data were analysed using the Statistical Package for Social Scientists (SPSS Statistics for Windows, Version 29.0, released 2022. Armonk, NY: IBM Corp.), and jamovi {(The jamovi project (2021). jamovi (Version 1.6) [Computer Software], retrieved from https://www.jamovi.org)}. Testing for moderation was carried out using the PROCESS macro for SPSS (Hayes & Scharkow, 2013; Hayes, Montoya & Rockwood, 2017).

Results

Participants

To ensure data quality accuracy (proportion of correct responses) was analysed for each participant relative to the mean [M (S.E.) 0.62 (0.01)]. One outlier with a response accuracy of 0.28 was excluded from the subsequent data analysis: of the 36 trials carried out, 17 were timed out, all trials were judged as remembered. Data from five other participants who judged all 36 trials as remembered were excluded, as it was considered they had either misunderstood or not carried out the instructions. The final data set comprised 64 participants, completing a total of 2194 trials, 110 timed-out trials (4.8%) were excluded from the analysis (recognition decisions were not recorded on timed out trials).

Variables of interest

As can be seen in Figure 2a, confidence was significantly higher when the recognition decision was associated with the experience of remembering compared to the absence of remembering [median = 64 and 25 respectively; M (S.E.) = 62.9 (0.74) and 28.2 (0.76) respectively; n = 1179 and 1015 trials respectively]. The substantial number of individual trials analysed makes even tiny differences result in significant p-values. Here, therefore, we report the unstandardised beta coefficient (B), produced by regression analysis when the analysis is performed on original, unstandardised variables. Interpretation of effect sizes were related to the R² change, taken as (0.01) 1% = weak, (0.09) 9% = moderate, and (0.25) 25% = strong (Cohen, 1992; Taylor, 1990). Trial level ANOVA revealed a significant difference [F (1, 2192) = 1071, p < 0.001], with a strong effect size (B = 34.74; R² change = 32.8%). Figure 2b illustrates confidence was also significantly higher when the recognition decision was correct compared to incorrect [median = 51 and 44 respectively; M (S.E.) = 50.2 (0.84) and 41.3 (0.97) respectively; n = 1360 and 834 trials respectively]. Analysis again revealed a reliable effect [F (1, 2192) = 46.0, p < 0.001], but critically, in this case the effect size was weak (B = 8.93; R² change = 2.1%).

Study 1: illustrating the distribution of trial level confidence ratings **(a)** by the remember response (no or yes); **(b)** by recognition accuracy (incorrect or correct). Box plots show median values as horizontal black bands. The associated violin plots illustrate the distribution of responses, along with mean values, shown as black squares. A straight line between the mean values is equivalent to a point biserial plot.

Variables of interest were compared by test-pair condition, revealing a confidence-accuracy dissociation between perceptually and mnemonically similar conditions (see Table 1). Overall differences in confidence and accuracy were followed up using pair-wise independent samples t-tests (two-sided), revealing that confidence was higher for the mnemonically similar A–B’ condition compared to the perceptually similar A–A’ condition (p = 0.01; d = 0.14), whereas accuracy (proportion correct) was not significantly different (p = 0.12). Neither the proportion of trials remembered, nor reported memory vividness, differed significantly across conditions. Participants took significantly longer to decide which of the pictures was old in the A–A’ condition compared to both A–B’ (p = 0.002; d = 0.16) and A–X (p = < 0.001; d = 0.2) conditions.

Table 1

Study 1: variables of interest [confidence, recognition accuracy (proportion correct), proportion remembered, vividness, and reaction time] by test-pair condition.

TEST PICTURE PAIR CONDITION (n)	CONFIDENCE	PROPORTION CORRECT	PROPORTIONREMEMBERED	VIVIDNESS	REACTION TIME, (ms)
A–A’ (724)	44.6 (1.09)	.63 (0.02)	0.52 (0.02)	50.05 (1.07)	3178.3 (48.74)
A–B’ (731)	48.65 (1.13)	.59 (0.02)	0.56 (0.02)	52.34 (1.08)	2974.9 (45.74)
A–X (739)	47.21 (0.65)	.65 (0.02)	0.53 (0.02)	50.58 (1.08)	2923.0 (44.28)
ANOVA	F (2, 2191) 3.36; p = 0.04, η² = 0.003	F (2, 2191) 3.06; p = 0.05, η² = 0.003	F (2, 2191) 1.82; p = 0.16	F (2,2191) 1.24, p = 0.29	F (2, 2191) 8.48; p < 0.001, η² = 0.008

[i] Note. Results are compared by one-way ANOVA. Effect sizes are represented by the partial η² statistic, where η² = 0.01 indicates a small, η² = 0.06 a moderate, and η² = 0 .14 a large effect.

Confidence Accuracy Characteristic (CAC) Analysis

We next evaluated our data using a calibration approach, i.e., using Confidence Accuracy Characteristic (CAC) analysis. Metacognitive calibration measures the accuracy associated with individual levels of confidence (e.g., how accurate are responses given a 100% confidence rating). Recognition accuracy, dependent on recognition of the target item as old (Zawadzka et al., 2017), is represented using calibration curves, calculated by dividing metacognitive confidence judgments into bins, e.g., 0–10, 11–20… and determining the average accuracy (proportion correct) for each bin. Plotting the results is equivalent to a calibration (Palmer, Brewer, Weber & Nagesh, 2013). For assessments where the outcome is binary (hit or miss, yes or no, correct or incorrect), this method of calibration equates with a confidence accuracy calculation. Proposed as an alternative to Receiver Operator Characteristics (ROC) curve analysis, CAC plots are considered to be more intuitive, for example as a way of relating identification accuracy to eyewitness confidence (Mickes, 2015), mapping participants’ average accuracy to their averaged or binned metacognitive judgments of confidence (assuming that participants can reliably rate their psychological experience on to the scale used (cf. Rhodes, 2019).

A perfect linear CAC relationship would be characterized by 100% averaged accuracy for those witnesses who were 100% confident, 90% accuracy for witnesses who were 90% confident, and so on. Eyewitness researchers (e.g., Brewer & Wells, 2006; Vuorre & Metcalfe, 2022; Wixted & Wells, 2017) have increasingly become more concerned with calibration than with measures of metacognitive resolution (indicating how individuals can judge their own accuracy using confidence ratings). In earlier research, for example, Bornstein and Zickafoose (1999) plotted average predicted confidence against accuracy for responses to both eyewitness and general knowledge questions.

More recently, Tekin, DeSoto, Wixted and Roediger (2021) demonstrated how a CAC plot can be adjusted for use in a binary old or new discrimination task, as a direct alternative to ROC curve analysis. These authors used two different methods of calculation to create the plot: either item-based accuracy, where proportion correct = hits ÷ (hits + misses), or response-based accuracy, where proportion correct = hits ÷ (hits + false alarms). In either case findings revealed that confidence was always highly related to accuracy, with both related and unrelated lures. Here, our 2-AFC paradigm does not produce data for hits and misses, and thus we use the response-based accuracy method to create CAC plots, where proportion correct = hits ÷ [hits + false alarms (i.e., all responses)]. In addition, to avoid artificially biasing outcomes by tailoring the choice of bins, CAC plots were created using trial level data with confidence binned into quartiles (defined in a data driven fashion for each set of data analysed, e.g., values of 25, 50 & 75 for a median value of confidence = 50, dividing the number of data points into quartiles of more-or-less equal size), allowing comparison of the averaged values of recognition accuracy (proportion correct) for each level of binned confidence.

As illustrated in Figure 3a, when examined across all data, CAC plots clearly demonstrate that as accuracy increases, higher confidence is reported. Collapsed across remembered and not remembered responses, the overall proportion correct (i.e., accuracy) was significantly higher at high levels of confidence ≥ 76 (M = 0.78) than for all other bins [(51–75, 26–50, and 0–25; M = 0.61, 0.57 and 0.57 respectively), with all independent two-sided samples t-tests significant, p < 0.001, (Cohen’s d = 0.29, 0.39 and 0.37 respectively)]. The CAC plot also allowed us to separately examine the effect of remembering on recognition accuracy, illustrating for each level of binned confidence there was no significant difference in accuracy by remembering (Figure 3b). Notably, binned confidence was higher as the proportion of trials where remembering was reported increased. For responses associated with remembering, analysis confirmed accuracy was significantly higher at the highest bin of confidence ≥ 76 (M = 0.79) than for all other bins [(51–75, 26–50 and 0–25, M = 0.61, 0.59 and 0.61 respectively, independent two-sided samples t-tests all highly significant p < 0.001, (Cohen’s d = 0.38, 0.46 and 0.42 respectively)].

Study 1: CAC plots of trial level data across all test-pair conditions, illustrating recognition accuracy (mean proportion correct) by confidence (binned in quartiles): **(a)** overall data; **(b)** the overall data split by the experience of remembering. [(** p < 0.001), (* p < 0.01). Error bars represent 95% *C.I*. – numbers represent trials assessed within each bin of confidence – only positive or negative bars are shown in Figure 3b for clarity; numbers represent trials assessed within each bin of confidence – bold figures represent remembered trials and italic figures represent trials associated with not remembering].

Furthermore, as accuracy increased, higher confidence was also reported for responses made in the absence of remembering, analysis confirming accuracy was significantly higher at the highest bin of confidence ≥ 76 (M = 0.71) than for all other bins [(51–75, 26–50 and 0–25, M = 0.60, 0.55 and 0.56 respectively, independent two-sided samples t-tests all significant p = 0.003, p < 0.001, p < 0.001, (Cohen’s d = 0.23, 0.33 and 0.3 respectively)].

We also examined CAC plots by test-pair condition, as shown in Figure 4, unpacking the overall data pattern shown in Figure 3b. Across all test-pair conditions, the proportion of trials where remembering was reported increased as binned confidence was higher. Remembering had the greatest impact on performance when the test pairs were perceptually similar (the A–A’ condition). The overall proportion correct (i.e., accuracy) was significantly higher for the highest bin of confidence ≥ 76; (M = 0.80) than for all other bins [(51–75, 26–50, and 0–25, M = 0.68, 0.61, and 0.63 respectively). More importantly, when the recognition decision was made in the absence of remembering there was no relationship between recognition accuracy and confidence, i.e., performance on the task was flat, despite higher reported confidence, and with no significant differences in accuracy by bins of confidence (Figure 4a).

Study 1: CAC plots by test picture-pair condition illustrating recognition accuracy (mean proportion correct) by confidence (binned in quartiles) split by the experience of remembering, **(a)** when lures were perceptually similar (the A–A’ condition), **(b)** when lures were mnemonically similar (the A–B’ condition), **(c)** when lures were dissimilar to any picture shown at study (the A–X condition). Example picture-pairs are illustrated for each condition, all the A target pictures had been shown during the study phase. The B picture was shown during the target block in the study phase of the experiment. Neither A’, B’ nor X images had been shown previously in the study. [(** p < 0.001), (* p < 0.01). Error bars represent 95% *C.I*. – for clarity only positive or negative bars are shown; numbers represent trials assessed within each bin of confidence – bold figures represent remembered trials and italic figures represent trials associated with not remembering].

CAC plots in the A–A’ condition revealed remembering to be associated with higher recognition accuracy at the highest bin of confidence; the calculations differing significantly at the highest confidence bin ≥ 73, remembering vs. not remembering (M = 0.80, 0.58), independent two-sided samples t-tests (p = 0.02), Cohen’s d = 0.51.

Similarly, in the A–B’ condition (Figure 4b) when remembered, the overall proportion of correct responses was significantly higher at the highest bin of confidence ≥ 76 (M = 0.79), than for all other bins, [(51–75, 26–50, and 0–24, M = 0.55, 0.52, and 0.53), independent two-sided samples t-tests all highly significant p < 0.001, (d = 0.53, 0.61,and 0.59)]. However, accuracy was also significantly higher in the absence of remembering at the highest bin of confidence ≥ 76 (M = 0.90) than for all other bins [(51–75, 26–50 and 0–25, M = 0.57, 0.49 and 0.53 respectively, independent two-sided samples t-tests all highly statistically significant, p < 0.001, (Cohen’s d = 0.7, 0.84 and 0.75]. Critically, higher accuracy was associated with the highest bin of confidence whether or not the participant reported the experience of remembering. Noting the discrepancy in number of trials contributing to each data point, the observed difference in accuracy at high confidence ≥ 76 for remembering (M = 0.79) compared with not remembering (M = 0.9), was not significant (p = 0.4).

When test pairs were dissimilar (the A-X condition), performance in the task was essentially flat regardless of remembering, suggesting that confidence was based on factors other than accuracy.

For each condition, we conducted a 4 (confidence bins) x 2 (remembering) ANOVA. Examining data for recognition accuracy (proportion correct) by remembering (no = 0, or yes = 1), revealed a statistically reliable main effect for remembering on recognition accuracy in the A–A’ condition, however, the effect size was less than small (Table 2). There was no significant main effect for the relationship between remembering and accuracy in the A–B’ condition, however, the analysis revealed the relationship between confidence and accuracy was significant (Table 2) with a large effect size. There was no significant interaction between remembering and accuracy in this condition.

Table 2

Study 1. Results of 2 × 4 ANOVA by test-pair condition, examining trial level data for recognition accuracy (proportion correct) by remembering (no = 0, or yes = 1), across the four bins of confidence.

STUDY 1		MAIN EFFECT	F (df)	p VALUE	h2	POWER
Condition	A–A’	Remembering	4.56 (1,723)	0.035	0.006	0.56
	A–B’	Confidence bins	5.2 (3,730)	0.001	0.021	0.93
	A–X	none	–	–	–	–

Finally, in the A–X condition there were no significant main effects. The analysis confirmed that only when targets and lures were perceptually similar, the A–A’ condition did remembering the target have a significant main effect on accuracy. In the A–X condition, although remembering was shown to have some effect on accuracy in the CAC plots, the pattern was inconsistent. The ANOVA demonstrated there was no main effect of remembering. Critically, when mnemonically similar lures were presented, the A–B’ condition, higher confidence was seen at higher levels of accuracy regardless of whether the experience of remembering was associated with the recognition decision, with no main effect of remembering.

Moderation Analysis

To further scrutinise the three-way relationship, we next asked whether the association between accuracy and confidence was moderated by remembering or its quality (the remembered vividness of the target). While a two-way ANOVA estimates how the mean of a quantitative variable changes based on two categorical variables, it also assesses the main effect of each independent variable and whether there is an interaction between them. By contrast, moderation analysis measures and tests how an independent variable affects the dependent variable, based on a moderator variable that changes the strength of the relationship between the other two variables. We therefore carried out two additional analyses which allow the effect of recognition accuracy on confidence to differ across levels of each moderator (illustrated conceptually in Figure 5).

We hypothesised that the strength of the positive effect of accuracy on a participant’s confidence would vary according to whether or not the picture selected as old was remembered, and the quality of the memory, i.e., its remembered vividness. The first analysis revealed that remembering had a significant moderation effect on the relationship between accuracy and confidence, i.e., there was a significant interaction between accuracy and remembering (p = 0.002), as shown in Figure 6a. The experience of remembering had a large effect on confidence levels, regardless of recognition accuracy. There was a further small but significant increase in confidence for accurate (correct) recognition compared to incorrect recognition when the experience of remembering informed the decision. Critically, however, in the absence of remembering, accuracy had no significant effect on confidence. The second moderation analysis revealed that remembered vividness had no significant moderation effect on the relationship between accuracy and confidence, with no significant interaction between accuracy and remembered vividness (p = 0.09), as shown in Figure 6b. The quality of remembering had a large effect on confidence levels, regardless of recognition accuracy, with no significant difference in confidence associated with remembered vividness for accurate (correct) recognition compared to incorrect recognition at any level. In sum, therefore, the moderation analysis confirmed that the association between recognition accuracy and confidence is moderated by the experience of remembering the target, but not by the quality of remembering (remembered vividness of the selected image).

Study 1. Column plots of results from the two moderation analyses (error bars represent 95% C.I.): **(a)** recognition accuracy (incorrect or correct) with confidence by remembering (no or yes); **(b)** recognition accuracy (incorrect or correct) with confidence by vividness at the mean ± 1 S.D., (low, mean, and high = 22, 51, and 80 respectively). The interaction (moderation) effect can be seen by comparing the columns for incorrect versus correct recognition in each case, i.e., representing the three-way interaction (a) between remembering, confidence, and recognition accuracy and (b) between remembered vividness, confidence, and recognition accuracy.

Correlational Analysis

Because our measures of confidence and vividness were both continuous, we were also able to evaluate the correlation between these variables using trial level data. As can be seen in Figure 7 there was a strong correlation between memory quality (remembered vividness) and confidence in memory’s accuracy (Pearson’s r = 0.82, p < 0.001). Note that the sliding scale used to make vividness ratings had a starting position of 50, resulting in some anchoring, as seen in Figure 7. We changed the slider starting position to zero halfway through the experiment to encourage participants to move the slider to provide a vividness score. Noting that some anchoring is also visible at 0 for the confidence responses, we point out that participants were instructed to leave the confidence rating at 0 if they were guessing. Despite the presence of anchoring a strong correlation is clearly visible within the data, with two-thirds (67%) of the variance in confidence explained by vividness of memory for the recognised images. The same strong positive correlation was found for data averaged across trials for each participant (aggregated data, r = 0.75, p < 0.001), and for trial level data in each test-pair condition (A-A′ r = 0.8, A-B′ r = 0.82, A-X r = 0.81, all p < 0.001).

Study 1. Scatterplot of trial level data illustrating the correlation between the vividness of memory for the image recognised as old and confidence in the accuracy of the decision, overlaid is the linear regression (best fit) line.

Because our measures of accuracy and remembering were both dichotomous, equivalent trial level analyses could not be carried out for these variables. We therefore examined the relationship between these variables and our continuous measure of confidence, using aggregated data for each participant. Analysis revealed a moderate correlation between remembering and confidence (r = 0.25; p = 0.004), whereas the correlation between accuracy and confidence was not significant (r = 0.12; p = 0.18).

Post-Experiment Questionnaire

Participants were required to complete an online post-experiment questionnaire, as detailed in Supplementary file 1: Appendix. Study 1 Online instructions to participants, producing self-report data about what decisions were based on. These data allowed us to confirm that participants had followed the instructions regarding the remember response, with the majority of remember responses being associated with conscious awareness relating to a personal memory. Bearing in mind that participants were instructed to check all responses that applied, the commonest reason cited for remembering the target was that it reminded the participant of something (69% of all responses). The second most common response was that the participant remembered thinking of an association or a personal memory (53%), followed by remembering what they were thinking when first viewing the image (47%). Liking the picture or thinking it was attractive was also common (42%, representing a positive emotional response), whereas only a few responses were associated with remembering something happening in the room at the time of viewing the picture (9%). In the absence of remembering, the commonest reason cited was not recognising the other picture in the test-pair (64%), with little difference between random guessing (48%) or thinking the picture was familiar (52%). When participants were not forced to answer remember or know, the reason for the decision that was reported least often was just ‘knowing the picture was old’ (30%).

Discussion

We asked whether confidence in the accuracy of recognition is affected more by the experience of remembering or by its factual accuracy. Across conditions, our data show that remembering had the stronger effect on confidence. Conversely, CAC plots for the overall data confirmed highly accurate recognition was associated with high confidence (corresponding with reports in the wider literature, e.g., Brewer & Wells, 2006), with no significant effect of remembering. Critically, however, by examining performance across the three test-pair conditions, our data reveal that the relationship between confidence and accuracy varies as a function of the context in which to-be-remembered stimuli are presented at test. CAC plots revealed that links between high accuracy and high confidence for remember responses were present when lures were perceptually or mnemonically similar. Analysis confirmed that although remembering only had a significant effect on accuracy in the perceptually similar A–A’ condition, in the mnemonically similar A–B’ condition confidence was also higher for highly accurate responses, but this was not reliant on the decision being made with the experience of remembering. Analysis also revealed that remembering had no significant effect on accuracy in the A–X condition (i.e., when lures were unrelated to targets). Apart from when mnemonically similar (A–B’) lures were presented at test, there was no relationship between accuracy and confidence in the absence of remembering.

Our findings replicate Tulving’s confidence-accuracy inversion of lower mean confidence and higher mean accuracy for perceptually similar (A–A’) pairs compared to higher mean confidence and lower mean accuracy for mnemonically similar (A–B’) pairs. Nonetheless, on the basis of the different CAC plot patterns seen across conditions, our data suggest a novel explanation of the confidence-accuracy inversion. For A–A’ pairs higher confidence was seen for remembered stimuli across all bins, with highest levels of confidence predominantly seen with the experience of remembering and with high accuracy. For mnemonically similar (A–B’) pairs, the highest bin of confidence linked to high accuracy regardless of remembering. We suggest, therefore, that in some cases high confidence can occur due to participants thinking they have remembered the target while mistakenly remembering a mnemonically similar lure (i.e., misremembering as old a B’ lure that is similar to a B target shown at study). Higher levels of confidence linked to inaccurate ‘remembered’ responses may be responsible for the inversion effect of lower accuracy and higher confidence compared to A–A’ pairs. This view is supported by wider evidence that overlap between features of stored information may explain autobiographical memory errors (Gonsalves & Paller, 2002; Newman & Lindsay, 2009).

Additionally, and reflecting the accessibility model of the feeling of knowing, (Koriat, 1993), participants retrieve information from memory as the basis for their confidence ratings, using whatever they remember from the search process (regardless of accuracy) to inform their recognition decision. We also comment that we found mean accuracy to be highest in the A–X condition. When the lure was dissimilar to the target or to any other picture studied, neither remembering nor confidence was reliably associated with changes in accuracy, with self-report data suggesting that not recognising the lure could be relied on. The present results suggest, therefore, that a new stimulus can trigger recognition, e.g., if the stimulus elicits a personal memory. Note that similar effects are also likely to occur in eyewitness memory, for example, if an innocent party’s face in an eyewitness lineup looks similar to the face of a neighbour or a former acquaintance, meaning that the eyewitness believes they accurately recognised the face, particularly if detail, i.e., the source of the memory, is lacking.

Taken together, therefore, our data indicate that confidence in the accuracy of recognition is based largely on the experience of remembering, rather than on knowledge of its accuracy. Moderation analyses confirmed that the association between recognition accuracy and confidence was differentially affected by the experience of remembering. In addition, despite the presence of a strong correlation between memory vividness and confidence, memory vividness did not significantly moderate the association between recognition accuracy and confidence. The lack of a differential effect may, however, reflect little more than the fact that vivid images are recognised more accurately (i.e., due to the paucity of the dichotomous judgment required in old or new recognition tasks). Consequently, to triangulate our findings, we carried out a second study using a source memory task that provides a continuous measure of accuracy.