Deep Learning and the Electrocardiogram: Review of the Current State-of-the-Art

Sulaiman Somani; Adam J. Russak; Felix Richter; Shan Zhao; Akhil Vaid; Fayzan Chaudhry; Jessica K. De Freitas; Nidhi Naik; Riccardo Miotto; Girish N. Nadkarni; Jagat Narula; Edgar Argulian; Benjamin S. Glicksberg


Europace. 2021;23(8):1179-1191. 

In This Article


This review filtered 31 original research papers to address the applications of DL on ECG identification, starting from a PubMed query for [('deep learning' OR 'machine learning' OR 'artificial intelligence') AND ('electrocardiogram' OR 'ECG' OR 'ecg' OR 'electrocardiograph')] between 1 April 2015 and 15 May 2020 (Figure 3). Since many of the original research articles performed beat classification using the open source datasets and were exhaustively addressed in prior reviews, only papers utilizing >1000 unique ECGs (including both training and test data) were included.

Figure 3.

Paper selection process: consort diagram demonstrating the selection criteria used in retrieving the literature pieces evaluated in this review. The number of articles corresponding to different application categories is also shown.


Conduction system abnormalities are the most natural cardiac disorders to tackle with ECGs. Motivated by a relatively high adult population prevalence of around 3%,[41] significant work has been devoted to diagnosing AF, the most common arrhythmia, with few ML works on diagnosing other aberrant waveforms (e.g. ventricular tachyarrhythmias). The problem of its identification by ECG has been subject to many research endeavours encompassing all strokes of AI, such as signal processing, ML, and DL, the lattermost of which is detailed in Table 1.

For what may be the most unique but clinically relevant application, Attia et al.[26,40] used DL to predict paroxysmal AF from a patient's first clinically benign (i.e. normal sinus rhythm) ECG with the knowledge that they were ultimately diagnosed at least 30 days after this benign ECG with AF. Using a CNN architecture with residual blocks, which allow deeper models to be trained more efficiently, the authors used 454 789 ECGs from 126 526 patients for training and achieved promising performance. While the study design may suffer from heavy selection bias in failing to address patients with ultimately undiagnosed AF and offers no values for a negative predictive value (NPV) despite suggesting the utility of this model as a screening test, the true utility of this work remains in the innovative approach to using ECG data in a novel way and entertaining the possible adjuvant role of DL in conjunction with CHADS2-VASC for recommending anticoagulation in patients with etiologically cryptogenic stroke and, more generally, the risk of stroke secondary to underlying AF.

DL models on ECGs have also been shown to perform at the level of medical professionals. Using only a single ECG lead, Hannun et al.[42] curated a dataset composed of 91 232 ECGs from 53 549 patients in an ambulatory setting. At the cost of having a small testing set, the authors benchmarked the model's encouraging performance by having expert cardiologists manually annotate all 328 test set ECGs. In this case, these experts performed worse compared with the model in detecting all arrhythmias except junctional rhythm and ventricular tachycardia. At a larger scale, Ribeiro et al.[42,43] demonstrate end-to-end training on the largest ECG database found in this review, comprising of 1 558 415 ECGs from a tele-ECG service in southeast Brazil, to train a CNN with residual connections to diagnose various arrhythmias, such as AVB Type I, RBBB, LBBB, sinus tachycardia and bradycardia, and AF. Somewhat similar to the case with Hannun et al., the performance of this model, as judged by its PPV, sensitivity, specificity, and AUC, was marginally better when compared with a cohort of medical trainees (residents and medical students).

Extending this multi-classification further, Smith et al.[44] additionally refined the ECG classification problem in the scope of triaging ECGs in the ED as normal, abnormal, or emergent, subtyped by the etiology (e.g. ventricular rhythm emergency vs. significant AV conduction) at a single centre study in MN, USA. They investigated the performance of a pre-trained DL model from an industrial partner (Cardiologs Technologies) against conventional, on-board algorithms that detect these abnormalities on the ECG machines themselves (Mortara/Veritas). For a cohort of 1500 randomly sampled ECGs from that year, their DL model showed greater specificity and accuracy in triaging these ECGs, and, despite suffering only from a marginal loss in sensitivity, demonstrated potential for reducing false alarms on the ECGs by ~50%. Recently, van de Leur et al.[45] also developed a model to triage ECGs, but using a dataset orders of magnitude larger and additionally incorporated a gradient-based 'saliency feature mapping', which leverages how the output of a model changes with small changes to different regions of the input signal,[46] to identify important features investigated by the model for different types of presentations. Similar to the models developed by Smith et al., these models retain high specificity (0.88 to 0.98 for different classes) despite low sensitivity, highlighting their use in rapid escalation of care for those flagged by the model.

Beyond these private datasets, there were three open datasets that met the inclusion criteria for database size: Computing in Cardiology (CINC) 2017, CINC 2015, and CPSC2018 (later merged into the CINC 2020).[39] In the CINC 2017 competition, which provided contestants with a training set of 8 528 single-lead ECGs for diagnosis of AF vs. NSR, other arrhythmias, and noise, the winner of the competition used an LSTM stacked with an XGBoost classifier (a tree-based ML algorithm). Oster et al. helped externally validate the second-place winner[47] of this competition on 450 four-lead ECGs from the UK Biobank. As expected, the ML algorithm did not generalize well to this novel dataset (F1-score 58.9%); however, a DL model (CNN + LSTM) that was reported after the challenge concluded demonstrated close to a 30% improvement (F1-score 74.1%).[48] In another unique application, a deep CNN trained from AliveCor ECG data, which was the source of the CINC 2017 challenge dataset, was deployed on a single-lead recorder system (KardiaBand, Apple Watch) to continuously monitor for AF in 24 patients.[48,49] When compared with annotated reports from an insertable cardiac monitor (ICM), the model achieved an encouraging performance (episode sensitivity 97.5% and duration sensitivity 97.7%) on 24 patients, highlighting the utility of DL in creating an inexpensive, non-invasive approach to AF surveillance and management.

For the CPSC2018 challenge, Cai et al. [48–50] added data from additional sources (hospital, ambulatory ECG monitoring device) and trained a DenseNet-inspired CNN to reach state-of-the-art performance on this multi-centre test set, with an AUC of 0.994 and a sensitivity of 99.1% for the three-label classification task (AF, normal, other arrhythmias). Furthermore, the authors explored the parameter weights of the first convolutional layer of their DNN and found the model to learn, as expected by the premise of DL models, low-level features like peaks, troughs, and upward/downward slopes in the signal, which suggests the model's efforts to remove baseline shifts and identify key landmarks (i.e. P-waves) in diagnosis.

Ultimately, tackling arrhythmias is the most classical of pattern recognition problems around the ECG. While their diagnosis has been addressed heavily, few works have investigated the direct role of these inpatient management. To our knowledge, only a few have assessed the characteristics of the ECG that are significant for diagnosis. Further work may be undertaken to integrate and assess the role of these DL solutions in direct clinical care, in application towards screening and diagnosis of less prevalent disease states (e.g. congenital long QT syndrome), in more accurately diagnosing arrhythmias, like complex atrioventricular block and wide-complex QRS tachyarrhythmia, which may be difficult to discern clinically, and in providing insights to predicting outcomes after interventional procedures (e.g. AF ablation).


While ECG lacks sensitivity to diagnose valve disease from traditional clinical frameworks,[51] subtle structural changes in response to long-standing valvular disease may be discovered by a DL model to diagnose these pathologies. Indeed, Kwon et al.[52] demonstrate use of an ensemble model, which combines a CNN classifier operating on raw, 12-lead ECG signals and a fully connected network that incorporates demographic information and numeric ECG-derived features (HR, QT interval, QRS duration, QTc, etc.), for classification of severe aortic stenosis (AS) (<1.5 cm[2] or mean pressure gradient ≥20 mm Hg, as confirmed by echocardiography). Notably, the authors validated this model on 10 865 patients from a secondary hospital centre, with encouraging AUC of 0.884. The authors also perform a saliency analysis to identify features on the ECG that were most heavily used for AS prediction, identifying the model's focus on the T-wave in V1–V4, which has been linked with a delayed repolarization from AS-related ventricular hypertrophy. However, the specificity of diagnosing AS relative to other cardiomyopathies was not evaluated in this article, which is an important drawback given that the model may instead be learning to distinguish possible non-specific structural changes secondary to AS, rather than AS itself.

With the same motivation, Kwon et al.[53] replicated the above study on patients with significant MR (valve regurgitant orifice area ≥ 0.2 cm2, regurgitation volume ≥ 30 mL, regurgitation fraction ≥ 30%, and MR grade II–IV). In this architecture, they instead opted for a CNN-type network only with raw ECG data as the input and trained on 56 670 ECGs from 24 202 patients in one hospital system. The external validation test set was composed of 10 865 ECGs from another hospital, to which the model had a high sensitivity and NPV at the expense of low specificity and PPV, suggesting its applicability as a screening tool for ruling out MR in patients. A final saliency analysis was notable for the model's focus on P-wave flattening, which can be explained physiologically as secondary to a more distributive atrial depolarization as a result of atrial stretching from long-standing MR, as well as T-wave abnormalities, which could be prioritized in patients with AF (and thus an absent P-wave) secondary to MR. For patients without MR, the algorithm weighed heavily on the QRS complex, suggesting that the absence of QRS widening is sensitive for eliminating MR.


With respect to cardiomyopathies, both HCM and LV systolic dysfunction have been the focus of multiple research groups. In a unique study combining elements from DL and ML, Tison et al.[54] trained a modified CNN architecture (U-Net) on a dataset utilizing publicly available and institutional data to automate ECG segment classification (e.g. P wave, PR segment, QRS complex). Rather than opting for an end-to-end DL architecture, the authors subsequently generated a feature vector from a DL model, fed it into a more classical ML algorithm on a set of 35 466 ECGs to predict the presence of pulmonary hypertension, HCM, amyloid detection, and mitral valve prolapse in patients and achieved encouraging AUROCs, as low as 0.78 for MVP prediction and notably at 0.91 for HCM detection.

For HCM, Ko et al. at the Mayo Clinic[55] report the use of a CNN to train 12-lead ECGs from ~47K patients to diagnose HCM. Remarkably, their models achieved extremely high AUCs of 0.96 on the test set, and though suffering from a relatively low PPV of 31%, concomitantly strong model NPVs and sensitivity suggest its use as a screening tool in clinically suspected patients. A secondary analysis showed that their model responded to a patient who underwent septal myomectomy by lowering its diagnostic probability of HCM from 72% before the operation to 2.5% after. Furthermore, this model retained its high performing AUC in a subgroup of patients with left ventricular hypertrophy (LVH), demonstrating its ability to distinguish true HCM (disease) vs. non-HCM LVH (physiologic).

Further demonstrating the adaptability of DL architectures to different problems, Kwon et al.[56] extend their architecture for AS classification and apply it to detecting LVH. Training their ensemble classifier leveraging both raw ECG waveforms in a CNN and structured patient data from 35 694 ECGs from 12 648 patients, their model achieves respectable AUCs of 0.87 on a test set from another hospital centre. The model was benchmarked against cardiologists assessing for LVH using the Sokolov–Lyon criteria and outperformed them on sensitivity, while operating at the same specificity level, by 177%. A saliency analysis revealed that the model focused particularly heavily on the QRS complex during an 'easy' diagnosis for LVH, in line with clinical criteria, but concentrating on P wave morphology in V1–V3 and T-wave in I and aVR during more difficult cases, for which clinical criteria are generally absent.

On a different use case, Attia et al.[57] were the first to report the use of DL to predict low EF (<35%) by training a cohort of 35 970 patients on a simple CNN and achieving an AUC of 0.93 on the test set of 52 870 patients. Of significance, the model's performance remained agnostic to age and sex unlike BNP, which is sensitive to these patient factors and has been proposed as a marker for low EF despite its lower AUC (0.60).[58] A follow-up study[59] included an additional 6 008 patients who had ECGs for non-cardiac clinical indications but were found to have echocardiograms within a year of this ECG indicative of systolic dysfunction. With high AUCs on this external validation set (0.918), these results are encouraging and suggest, in combination with a BNP level > 150, the model and lab test can be excellent candidates in screening for systolic dysfunction. Noseworthy et al.[60] further assessed this model's robustness by investigating the impact of different race and ethnic groups on the model's performance. Notwithstanding the challenges of binning patient ethnicities into a social construct such as race, the authors demonstrated the model's invariance in predicting LVEF across various races and ethnicities, retaining AUCs >0.93 for each ethnicity. Additionally, the model demonstrated some inherent ability to predict race from an ECG as well (AUCs 0.76–0.84), though this may be falsely elevated given that the model suffers from severe class imbalances (overrepresentation of non-Hispanic whites) in the training set.

Kwon et al.[60,61] greatly extended this demonstration for prediction of reduced EF (EF < 40% and EF < 50% as the primary and secondary study outcomes, respectively) by adding a fully connected neural network trained on both patient-level demographic and ECG-derived data from 13 486 patients to their CNN. The authors report an encouraging model performance (AUC = 0.889 and 0.850 for primary and secondary outcome for external validation set) on an internal and external validation set of ~10 000 ECGs. It is worth noting that logistic regression and random forest (RF), two fundamental ML techniques, both performed only marginally worse relative to the DL model (AUC = 0.853 and 0.847 for LR and RF, respectively, P < 0.001), which may highlight the limited advantage of DL models on tabular data over statistical or ML techniques. By perturbing input values for different features and analysing the impact on the model's AUC, the authors identified that the most salient features for the DL model were surprisingly in agreement with those found with logistic regression (e.g. HR, T-wave axis, QRS duration, sex, age), suggestive of the more complex and non-linear interplay between these variables (as able to be represented by their architecture) than a simply linearly weighted one. Future directions include utilizing DL with ECG for early identification for understanding or differentiating other cardiomyopathies that are clinically less well understood, such as heart failure with preserved EF (HFpEF) or cardiac amyloidosis.


Though myocardial ischaemia is one of the most classical areas of cardiovascular research focus, the literature search only revealed one paper that investigated this domain of cardiovascular disease using ECGs and DL. Tadesse et al.[62] used a popular framework known as transfer learning, where a model that has been trained on one task (i.e. classifying real-world objects from photos)[34] is partially re-trained on a completely new, but structurally similar, dataset to solve another task. By transforming the ECGs into the Fourier space (which simply changes the representation of an ECG signal from a signal intensity vs. time to signal intensity vs. wave frequency) and spatially stacking all 12-leads together (to form a 2D-image), they trained a pre-existing, state-of-the-art image classification model, GoogLeNet,[31] on an openly available Chinese ECG Challenge dataset,[63] and a private curated dataset of ~17 000 ECGs from patients in Southern China with MI (STEMI and NSTEMI), attaining a respectable accuracy of 86% on the private dataset. However, their model performs notably worse with an accuracy of 49% on the Challenge dataset. Furthermore, despite highlighting an interesting technical method for performing DL on the ECG, the authors fail to disclose appropriate sensitivity, specificity, and AUC analyses, leaving room for another research effort to establish precedence for the use of DL on ECGs for patients with ischaemic cardiac disease. Future directions may involve detection of subclinical CAD along, or prior to, the ischaemic heart disease spectrum (e.g. stable angina, unstable angina, etc.).


Outside the immediate realm of cardiological disease, though certainly not without an impact on the heart, DL has been applied to ECGs in two major areas: identifying electrolyte abnormalities and prognosticating health status. Physiologically, deviations from baseline in either electrolytes or mental illness (i.e. anxiety) have been reported to show short-term and long-term effects on cardiac structure and function, which encourages the study of ECGs to identify the underlying disease state even more.

The sensitivity for diagnosing hyperkalaemia from ECGs, though classically characterized on the ECG by T-wave peaks, PR shortening, QRS prolongation, remains low (34–43%).[64] With this in mind, Galloway et al.[64,65] conducted a multi-centre study on patients from various Mayo Clinic sites in the US to identify the presence of hyperkalaemia in chronic kidney disease patients using 2- and 4-lead ECGs. Despite low specificity for hyperkalaemia, their model achieved respectable accuracies and sensitivities on these external validation sets, suggesting the role of ECGs for hyperkalaemia screening. Lin et al.[64–66] extended this study to predict either hypo- or hyperkalaemia with a single-centre database of 66 321 ECGs to all patients (irrespective of kidney disease) and attained better sensitivity, specificity, and accuracy on their test set when benchmarked against emergency physicians and cardiologists. Unlike the Mayo Clinic, this model retained high specificity (0.92) at the expense of low sensitivity (0.67), which is more akin to its application as a diagnostic tool instead of a screening one. Notably, the authors additionally performed a saliency analysis of the features, which showed a greater focus on the ST segment in those cases of hyperkalaemia that were more difficult to clinically identify (i.e. low sensitivity and high inter-rater variability). In addition to hyper/hypokalaemia, other electrolytes such as magnesium and calcium levels can be assessed here, notably to predict, in real-time, the likelihood of impending arrhythmias like Torsades de Pointes.

Beyond prediction of clinical disease and lab values reflective of disease severity, ECGs, as biometric data points over time, have the potential to capture measures of overall health as well. For example, the epitome of an elderly individual maintaining a prime state of health is captured by that individual having a 'young heart'. Thus, the idea of an 'ECG age' vs. biological age can be inspired and is addressed in another piece by Attia et al.,[67] which sought to predict patient age using ECG. Subgroup analysis of this study revealed those cases with the largest error in prediction were found to have significantly more instances of systolic dysfunction, hypertension, and CAD, whereas those individuals in which the prediction accuracy was higher (i.e. less error) were found to have fewer cardiovascular incidents at follow-up. Though there are certain implications of overinterpreting this information, since this error could capture both the severity of cardiac disease (e.g. higher age) and also random error in model training, these results encourage the belief that an ECG may be used as a composite biomarker to track general health over time.

In further corroboration of this possible role, Raghunath et al.[68] report prediction of 1-year mortality from age, sex, and baseline ECGs using a convolutional framework with a hazard ratio of 9.5 over the two predicted dead/alive groups, further corroborating the prognostic role of an ECG in a patient's global health. The authors also employ the use of a gradient-based class activation mapping to assess feature importance and note that the model discerned ST-elevations in certain patients as notable contributors to prediction of mortality within 1-year. However, given that these ECGs were retrieved from a hospital setting, care must be taken not to apply this model, which is prone to a heavy selection bias, on the general population.