Hi,
First of all thanks for putting this together. It’s a brilliant database accompanied by a brilliant solution for benchmarking.
I have been working with the fastMRI dataset for quite a bit now, and I have always been annoyed by one thing: the discrepancy I observe between my validation metrics and the ones I have when submitting my results. And it’s a bit unusual because my test results are consistently higher than my validation results. For long I thought that this was my mistake, but I think I found something suggesting that it’s not.
If you for example take a U-net I had implemented a long time ago, it had an SSIM of 0.811 (it’s a small U-net with 16 base filters) in the validation set for single coil knee data PD, which is comparable what is obtained by the fastMRI team in their second paper (Table 8. in https://arxiv.org/abs/1811.08839). The only difference is that they used 32 base filters but we can legitimately assume that there is going to be just a small difference, also due to training, normalization, scaling, etc… My other metrics are indeed similar.
However, when I look at my results on the leaderboard (unet_corrected by Nspin, still single coil knee data PD), I find myself with an 0.8491 SSIM !! Which is a huge improvement. I remember seeing the fastMRI U-net on the leaderboard but I can’t find it anymore, so I can’t say for sure but it should probably be the same metric.
This might seem like a small issue, but it happens for every single one of my model, for both PSNR and SSIM, for all contrasts and for multi and single coil.
I wondered if anyone had noticed/experienced this discrepancy before and if they found it was due to the test being “easier” to tackle or to a problem in metrics definition (I had a problem at the beginning which was that I was computing the metrics per slice instead of per volume but I handled that).
Thanks in advance for your feedback,
P-S if you want to look directly at my code, you can check it here: https://github.com/zaccharieramzi/fastmri-reproducible-benchmark.