

Jul 09, 2023

Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such ‘out of distribution’ performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for ‘Robust and Efficient Medical Imaging with Self-supervision’), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1–33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging.

The datasets from Northwestern Medicine and Apollo Hospitals were used under a licence for the current study and are not publicly available. Applications for access to the Optimam database can be made using this web form. The de-identified teledermatology data used in this study are not publicly available owing to restrictions in the data-sharing agreement. The unlabelled dataset used for DME classification is de-identified data from EyePACS Inc. Interested researchers should contact [email protected] to enquire about access to EyePACSdata and approach the Office of Research and Development to enquire about access to VA data. The rest of annotated data for ID and OOD DME classification tasks were collected at the Rajavithi Hospital Thailand and at the Lions Eye Institute and are not publicly available owing to restrictions in the data-sharing agreement. Data used in the evaluation and pretraining of the chest-X-ray-condition classification, including MIMIC-CXR, CheXpert, and ChestX-ray 14 are publicly available. Data used for the ID fine-tuning and evaluation of the detection of metastases are publicly available on the CAMELYON challenge website. The TCGA data used for pretraining for both the pathology-based metastases-detection and survival-prediction tasks are available via the NIH website. The rest of the data used in pathology tasks are not publicly available owing to restrictions in the data-sharing agreement. Moreover, ImageNet-1K (ILSVRC-2012)68 used for the pretraining of baseline supervised models, and ImageNet-21K used for the pretraining of BiT-M models are publicly available via the ImageNet website. BiT-L models trained on the JFT-300M54 dataset are not publicly available owing to restrictions in the data-sharing agreement.

Several major components of the work are available in open-source repositories, such as the T library. The code base and pretrained weights used for self-supervised pretraining are available at S. The code base and pretrained weights for the BiT models are available at B. All experiments and implementation details are described in sufficient detail in Methods and in Supplementary Information to support replication with non-proprietary libraries. The code base used for our comparison to ResNet-RS was based on R. A number of the checkpoints and models generated through REMEDIS are readily accessible to researchers via the P. Additionally, the Foundation Medical ML repositories on GitHub offer access to codes that can be used to train REMEDIS-based models.

This project was an extensive collaboration between Google Brain and the Google Health AI Team. We thank Z. Ghahramani for valuable feedback and continuous support through the course of the project; M. Raghu, J. Krause, D. Eck and M. Howell for valuable feedback in improving the quality of the work; J. Uszkoreit, J. Deaton, V. Godbole, M. Sieniek, S. Prabhakara, D. Golden, D. Steiner, X. Zhai, A. Giurgiu, T. Duerig, C. Semturs, P. Bui, J. Hartford, S. Jansen, S. Shetty, T. Spitz, D. Tran, J. Luo, O. Wichrowska and A. Ward for support throughout this project; multiple contributors to this international project: Rajavithi Hospital Thailand, Lions Eye Institute and Derbarl Yerrigan Health Service, Western Australia, Stanford Center for Artificial Intelligence in Medicine and Imaging, MIT Laboratory for Computational Physiology and PhysioNet, and NIH Clinical Centre; our collaborators at DermPath AI, Apollo Hospitals and EyePACS for support of this work; collaborators at Northwestern medicine and all members of the Etemadi Research Group for support of this work.

The images and data used in this publication were derived from the Optimam database, the creation of which was funded by Cancer Research UK. Part of the retinal image dataset was provided for the study by Sankara Nethralaya, Chennai, India. The results included in this paper are in whole or in part based on data generated by The Cancer Genome Atlas (TCGA) managed by the NCI and NHGRI. Information about TCGA can be found at the NIH website. This study also used archived and anonymized pathology slides, clinicopathologic variables, and outcomes from the Institute of Pathology and the Biobank at the Medical University of Graz. The study also used pathology slides from the CAMELYON challenge.

These authors contributed equally: Shekoofeh Azizi, Laura Culp, Jan Freyberg.

Google Research, Mountain View, CA, USA

Shekoofeh Azizi, Laura Culp, Jan Freyberg, Basil Mustafa, Sebastien Baur, Simon Kornblith, Ting Chen, Patricia Strachan, S. Sara Mahdavi, Ellery Wulczyn, Boris Babenko, Megan Walker, Aaron Loh, Po-Hsuan Cameron Chen, Yuan Liu, Pinal Bavishi, Scott Mayer McKinney, Jim Winkens, Abhijit Guha Roy, Zach Beaver, Justin Krogue, Umesh Telang, Yun Liu, Lily Peng, Greg S. Corrado, Dale R. Webster, David Fleet, Geoffrey Hinton, Neil Houlsby, Alan Karthikesalingam, Mohammad Norouzi & Vivek Natarajan

DeepMind, London, UK

Nenad Tomasev & Jovana Mitrović

Georgia Institute of Technology, Computer Science, Atlanta, GA, USA

Fiona Ryan

School of Medicine/School of Engineering, Northwestern University, Chicago, IL, USA

Mozziyar Etemadi

S.A., J.F., L.C., V.N., N.H., A.K., M.N., S.K., T.C., N.T., J.M., B.M., P.S., S.S.M., F.R., E.W., P.-H.C.C. and G.H. contributed to the conception and design of the work. S.A., L.C., J.F., V.N., A.K., B.B., P.B., E.W., P.-H.C.C., Yuan Liu, Yun Liu, S.M.M., A.L., J.W., M.W., Z.B., A.G.R., D.R.W., L.P., G.S.C., U.T. and J.K. contributed to data acquisition. S.A., L.C., J.F., S.B., B.M. and V.N. majorly contributed to the evaluation of the work. S.A., L.C., J.F., V.N., N.H., A.K., M.N., S.B., S.K., T.C., B.B., D.R.W., D.F., G.S.C. and M.E. contributed to analysis and interpretation of the data. S.A., L.C., J.F., V.N., N.H., A.K., M.N., S.K., E.W., P.S., S.S.M. and M.E. contributed to drafting and revising the paper. N.H., A.K., M. N. and V.N. contributed equally as co-advisers.

Correspondence to Shekoofeh Azizi, Alan Karthikesalingam or Vivek Natarajan.

This study was funded by Google LLC and/or a subsidiary thereof (‘Google’). J.F., L.C., S.A., V.N., N.H., A.K., M.N., B.M., S.B., P.S., S.S.M., S.K., T.C., N.T., J.M., B.B., P.B., E.W., P.-H.C.C., Yuan Liu, Yun Liu, S.M., A.L., J.W., M.W., Z.B., A.G.R., U.T., D.R.W., D.F., L.P., G.S.C., J.K. and G.H. are employees of Google and may own stock as part of the standard compensation package. M.E. received funding from Google to support the research collaboration.

We observe that, under increasing severity of synthetic shifts, the performance of both the REMEDIS and the supervised baseline drops. However, the drop is more gradual for REMEDIS.

The different stages in which unlabeled and labeled (both ID and OOD) are used for model development and evaluation.

Variation between ID and OOD data can be visually subtle or pronounced. This variation includes (but is not limited to) changes in contrast, sharpness or tint, differences in non-linear effects of X-ray-sensor construction or in zoom levels. The underlying cause of the distribution shift can be associated with technology shift, demographic shift or behavioral shift45.

Azizi, S., Culp, L., Freyberg, J. et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat. Biomed. Eng (2023).

Received: 22 July 2022

Accepted: 02 May 2023

Published: 08 June 2023


