Discrepancy between validation metrics and test metrics


First of all thanks for putting this together. It’s a brilliant database accompanied by a brilliant solution for benchmarking.

I have been working with the fastMRI dataset for quite a bit now, and I have always been annoyed by one thing: the discrepancy I observe between my validation metrics and the ones I have when submitting my results. And it’s a bit unusual because my test results are consistently higher than my validation results. For long I thought that this was my mistake, but I think I found something suggesting that it’s not.

If you for example take a U-net I had implemented a long time ago, it had an SSIM of 0.811 (it’s a small U-net with 16 base filters) in the validation set for single coil knee data PD, which is comparable what is obtained by the fastMRI team in their second paper (Table 8. in https://arxiv.org/abs/1811.08839). The only difference is that they used 32 base filters but we can legitimately assume that there is going to be just a small difference, also due to training, normalization, scaling, etc… My other metrics are indeed similar.

However, when I look at my results on the leaderboard (unet_corrected by Nspin, still single coil knee data PD), I find myself with an 0.8491 SSIM !! Which is a huge improvement. I remember seeing the fastMRI U-net on the leaderboard but I can’t find it anymore, so I can’t say for sure but it should probably be the same metric.

This might seem like a small issue, but it happens for every single one of my model, for both PSNR and SSIM, for all contrasts and for multi and single coil.

I wondered if anyone had noticed/experienced this discrepancy before and if they found it was due to the test being “easier” to tackle or to a problem in metrics definition (I had a problem at the beginning which was that I was computing the metrics per slice instead of per volume but I handled that).

Thanks in advance for your feedback,

P-S if you want to look directly at my code, you can check it here: https://github.com/zaccharieramzi/fastmri-reproducible-benchmark.

After discussing with some on the team, it’s possible in this case that there might be some outlier examples in the validation split. These could be more severe with your small model. The result isn’t entirely surprising - particularly if you aren’t doing intensive tuning with the validation split.

Thank you for your reply.

I don’t think it’s due to my model being small because even on larger models (like the one I used called updnet_v3 for multicoil knee), I see the same discrepancy. For example I have a 0.9622 SSIM for PD with AF4 on the leaderboard, while my validation score for this contrast/AF combination is 0.9328 SSIM (and this is true for all metrics/contrast/AF combinations).
It’s true that I haven’t done an extensive analysis to understand exactly how my metrics were distributed among my validation examples, this could be useful to understand if there are some outliers.

I am not sure though that I understand why intensive tuning could be a solution to this discrepancy.

I just also noticed in the end-to-end variational network for accelerated MRI reconstruction that there is a discrepancy too between the results of table 2 (E2E-VN scoring 0.910 in SSIM) and the results of the leaderbord (reporter also in table 3), where the score is 0.930.

Maybe it’s just that this difference is not so significant at least in SSIM, but the PSNR difference for my updnet_v3 is just huge (+3dB).