
Figure 1
Visual comparison of CXR datasets.
t‑SNE visualization showing how a public, internationally trained CAD model [36] encodes CXRs from local and international datasets. Each point represents a single CXR: blue triangles and red dots indicate local TB‑negative and TB‑positive samples, respectively; green triangles and pink dots indicate international TB‑negative and TB‑positive samples [34]. The local dataset is labelled ‘LOCAL’ and the international dataset is denoted ‘INTL’ (top‑left legend). The axes are considered abstract coordinates and can be simply thought of as X and Y coordinates; distances between markers reflect the model’s perceived similarity between CXRs.
In the international dataset, positive and negative cases appear well separated, showing the model’s ability to discriminate TB status in populations similar to those seen during training. In contrast, local CXRs form a single cluster, with poor separation between positive and negative cases. Essentially, local images are viewed as a single group that differs substantially from positive and negative international TB cases alike. A small subset, all TB positive, of the international sample lies near the local images. Local TB‑negative samples would therefore be interpreted as more similar to international TB‑positive CXRs than to international TB‑negative cases. The inference is that the model’s TB‑negative baseline, learned from international data, does not generalize appropriately to the local population.
In a truly generalizable CAD model, diagnostic groups would separate across populations and datasets: negative cases from all sources would cluster together, while the positive cases form a distinct group. This would indicate that a model has likely learned clinically relevant features instead of dataset‑specific artefacts.



Figure 2
Visual attention maps showing the CXR regions the model paid most attention to when making predictions. Areas with red markers are regions of maximum model attention, while regions marked in green are the true locations of nodules. (A) Model A (diagnostic accuracy 80.14%): In this example, maximum model attention is in a different lung from the true location, with very little model focus on the correct image region. (B) Model B (diagnostic accuracy 80.22%): In this example, maximum model attention and true location coincide. The entire true location is encompassed by the region the model relied on for its diagnostic prediction. (C) Model C (diagnostic accuracy 95.46%): In this example, maximum model attention is on the ‘J’ image marker in the top right corner. This situation exemplifies shortcut learning in practice, where high reported accuracy can obscure diagnostic failures.
