Deep Learning Applications in Ophthalmology

Ehsan Rahimy


Curr Opin Ophthalmol. 2018;29(3):254-260. 

In This Article

Diabetic Retinopathy

A number of programs have been developed for the automated detection of diabetic retinopathy, known as automated retinal image analysis systems (ARIAS).[11–14] Such systems have the potential to significantly improve current diabetic retinopathy, screening programs by decreasing reliance and burden on manual graders, which may in turn reduce costs of running these programs and improve overall efficiency. In one study by Tufail et al.[15] retinal images were manually graded by humans following a standard national protocol for diabetic retinopathy, screening and then additionally analyzed by three commercially available ARIAS: iGradingM (Medalytix Group Ltd, Manchester, UK), Retmarker (Retmarker SA, Taveiro, Portugal), and EyeArt (Eyenuk, Woodland Hills, California). The investigators found that EyeArt and Retmarker achieved acceptable sensitivity for referable retinopathy compared with manual graders, while being more cost-effective options. Although numerous ARIAS are commercially available, demonstrating superiority of one over the other, however, can be difficult as they each employ different algorithms.

Recently, there have been several studies reporting on deep learning algorithms in development for the detection of diabetic retinopathy,. In 2016, Abràmoff et al.[16,17] demonstrated that the integration of convolutional neural networks on top of an existing lesion-based diabetic retinopathy, detection algorithm resulted in greatly improved performance for identification of referable diabetic retinopathy, compared with the same algorithm that did not employ deep learning techniques. Referable diabetic retinopathy, is defined as moderate or severe nonproliferative diabetic retinopathy (NPDR), proliferative diabetic retinopathy (PDR), and/or diabetic macular edema (DME). In their study using the Messidor-2 validation set (n = 1748 images), sensitivity of the deep learning-enhanced algorithm was 96.8%, which was equivalent to previously published results of the same algorithm without deep learning (96.8%). However, specificity of the deep learning-enhanced model was significantly greater at 87 versus 59.4%. The area under the receiver-operating characteristic curve (AUC) was 0.980. Although the sensitivity was not statistically different from the previous version of the algorithm not employing deep learning, the higher specificity obtained by the deep learning integration would be preferable for potential diabetic screening programs in order to minimize the number of false positive readings. For comparison, guidelines for diabetic retinopathy screening initiatives recommend at least 80% sensitivity and specificity.[18] This hybrid screening algorithm, known as IDx-DR, is being commercialized in partnership with IBM Watson.

Soon afterwards, Gulshan et al.[19] from Google reported on the results of a deep learning algorithm for detecting diabetic retinopathy. Training of the algorithm was performed using 128 175 macula-centered fundus photographs obtained from EyePACS (Eye Picture Archive Communication System) in the United States and three eye hospitals in India (Aravind Eye Hospital, Sankara Nethralaya, and Narayana Nethralaya) amongst individuals presenting for diabetic retinopathy screening. Each of these images were then graded between three and seven times amongst a cohort of 54 ophthalmologists, and nearly 10% of images were randomly selected to be re-graded by the same physicians in order to assess for intragrader reliability. Images were assessed for the degree of diabetic retinopathy based on the International Clinical Diabetic Retinopathy scale: none, mild, moderate, severe, or proliferative,[20] and DME was defined as hard exudates within one disc diameter of the fovea, which is a proxy for macular edema whenever stereoscopic views are not available.[21] Once the human grading was completed, this development set was subsequently presented to the algorithm for training. For the second portion of the study, the investigators utilized two sets of new images (EyePACS-1 set = 9963 images, and Messidor-2 set = 1748 images) in order to test the algorithm against a reference standard of board-certified ophthalmologists (eight in the first set, and seven in the second set). In these validation sets, when the algorithm was programmed for high sensitivity as would be employed for a screening protocol, it achieved 97.5 and 96.1% sensitivity and 93.4 and 93.9% specificity in each of the two sets, respectively. The AUC was 0.991 for EyePACS-1 and 0.990 for Messidor-2 sets.

Earlier in 2017, Gargeya and Leng[22] published on a separate deep learning algorithm to detect all stages of diabetic retinopathy, derived from a dataset of 75 137 color fundus images obtained from the EyePACS public dataset. In their study, the model achieved sensitivity and specificity of 94 and 98%, respectively, with an AUC of 0.97. Additional testing on the MESSIDOR-2 and E-Ophtha databases for external validation was performed. With the entire MESSIDOR-2 set, the algorithm achieved 93% sensitivity and 87% specificity, with an AUC of 0.94, which was comparable to previously published studies on diabetic retinopathy, detection using the same dataset. Of note, the investigators' model also evaluated the ability to detect mild diabetic retinopathy, rather than just referable diabetic retinopathy,. Specifically, they tested the ability of their deep learning model to discern healthy retinal images from those with only mild diabetic retinopathy (n = 1368 image subset from MESSIDOR-2), and found that the algorithm struggled to differentiate between healthy and very early cases of diabetic retinopathy, failing to detect images that demonstrated a few small microaneurysms (74% sensitivity and 80% specificity, with AUC of 0.83). However, with the E-Ophtha images (n = 405 images), the algorithm was better able to distinguish amongst eyes with healthy versus mild diabetic retinopathy (90% sensitivity and a 94% specificity, with an AUC of 0.95).

Most recently, in late 2017, Ting et al.[23] reported on a deep learning system applied to multiethnic cohorts of diabetic patients. Although the images constituting the training set were derived from the Singapore Diabetic Retinopathy Screening Program (SIDRP), further external validation was performed in 10 additional multiethnic datasets from different countries with diverse clinic-based populations with diabetes. This was unique given that the Messidor-2 and other publicly available sets largely consist of homogenous Caucasian individuals. The investigators stressed the importance of developing and testing deep learning applications in clinical scenarios that employ diverse retinal images of varying quality from different camera types and in representative diabetic retinopathy screening populations of varying ethnicities.

In addition to detecting referable diabetic retinopathy, and vision-threatening diabetic retinopathy (defined as severe NPDR or PDR), the deep learning algorithm was also trained on identifying referable glaucoma or age-related macular degeneration (AMD) as the investigators noted that screening for other vision-threatening conditions should be mandatory for any clinical diabetic screening program. Referable glaucoma was defined as a ratio of vertical cup to disc diameter of 0.8 or greater, focal thinning or notching of the neuroretinal rim, presence of disc hemorrhage, or localized retinal nerve fiber layer defects. Referable AMD was defined as the presence of intermediate AMD (numerous medium-sized drusen, 1 large drusen ≥125 μm in greatest linear diameter, noncentral geographical atrophy, and/or advanced AMD (central geographical atrophy or neovascular AMD) according to the Age-Related Eye Disease Study grading system.

In the primary validation dataset (n = 71 896 images), the AUC of the algorithm for referable diabetic retinopathy was 0.936, with sensitivity of 90.5% and specificity of 91.6%. For vision-threatening diabetic retinopathy, AUC was 0.958, with sensitivity of 100% and specificity of 91.1%. For possible glaucoma, AUC was 0.942, with sensitivity of 96.4% and specificity of 87.2%. Finally, for AMD, AUC was 0.931, with sensitivity of 93.2% and specificity of 88.7%. Among the additional 10 datasets used for external validation (n = 40 752 images), AUC range for referable diabetic retinopathy, was between 0.889 and 0.983.