Deep Learning and the Electrocardiogram: Review of the Current State-of-the-Art

Sulaiman Somani; Adam J. Russak; Felix Richter; Shan Zhao; Akhil Vaid; Fayzan Chaudhry; Jessica K. De Freitas; Nidhi Naik; Riccardo Miotto; Girish N. Nadkarni; Jagat Narula; Edgar Argulian; Benjamin S. Glicksberg


Europace. 2021;23(8):1179-1191. 

In This Article

Abstract and Introduction


In the recent decade, deep learning, a subset of artificial intelligence and machine learning, has been used to identify patterns in big healthcare datasets for disease phenotyping, event predictions, and complex decision making. Public datasets for electrocardiograms (ECGs) have existed since the 1980s and have been used for very specific tasks in cardiology, such as arrhythmia, ischemia, and cardiomyopathy detection. Recently, private institutions have begun curating large ECG databases that are orders of magnitude larger than the public databases for ingestion by deep learning models. These efforts have demonstrated not only improved performance and generalizability in these aforementioned tasks but also application to novel clinical scenarios. This review focuses on orienting the clinician towards fundamental tenets of deep learning, state-of-the-art prior to its use for ECG analysis, and current applications of deep learning on ECGs, as well as their limitations and future areas of improvement.


The field of deep learning (DL), which has seen a dramatic rise in the past decade, is a form of data-driven modelling that serves to identify patterns in data and/or make predictions. It has made substantial impacts in multiple aspects of modern life, from allowing the human voice to execute commands on smartphones to hyperpersonalizing advertisements.[1] In the healthcare space, DL has been leveraged to predict diabetic retinopathy from fundoscopic images,[2] diagnose melanoma from pictures of skin lesions,[3] and segment the ventricle from a cardiac MRI,[4] the latter most of which was recently approved by the FDA, among countless other examples.[5–7]

Given the vast array of imaging modalities (e.g., CT, MRI, echocardiogram) present in cardiology, DL has also been utilized extensively on cardiovascular data to address key clinical issues.[8–10] Though not formally an imaging modality, electrocardiograms (ECG) may be considered different channels (i.e. leads) of one-dimensional images (i.e. signal intensity in volts over time). While other reviews[11–16,84] have extensively reported the technical details of various examples of applications of DL or focused on machine learning (ML) applications for ECG analysis, a focus on developing an intuitive understanding for the clinician as well as a clinical perspective on the impact of these advances remains lacking. Additionally, the original research articles showcased in these publications are generally over-representative of small open-source datasets, which are marred with concerns of external validity. In addition, there have been many publications recently using DL on ECGs in large, privately curated datasets to solve novel problems, which remain unaddressed by a review.

This review will first aim to establish a foundation of knowledge for DL, with an emphasis on explaining why it is best suited for many ECG-related analyses. Subsequently, we will provide an overview of how ECGs can be represented as a data form for DL, with brief coverage on openly available and private datasets. The Application section will build on this knowledge base and explore original DL research on ECGs that focuses on tasks in five domains: arrhythmias, cardiomyopathies, myocardial ischaemia, valvulopathy, and non-cardiac areas of use. This review will conclude with a recapitulation of the current state, limitations, promising endeavours, and recommendations for future clinical and research practice.

On Artificial Intelligence, Machine Learning, and Deep Learning

While a thorough discussion on the details of artificial intelligence (AI) is beyond the scope of this paper, the field and its recent advances will be refreshed for the reader's benefit. More interested readers are recommended to explore other seminal articles of literature that more exhaustively cover essential knowledge for original research appraisal and endeavours.

Simplistically, AI refers to the idea of a computer model that makes decisions using a priori information and improves its performance with experience (i.e. more data). Such clinically related tasks may involve detecting cancerous nodules from CT scans,[17] identifying clusters of disease phenotypes,[10] or optimizing treatment regimens in patients over time.[10,18] Given its broad definition, AI is necessarily classified into multiple subsets, notably ML and, more recently, DL, which is a subset of ML. Briefly, both ML and DL seek to use data, rather than a fully empirical set of human-generated rules, to solve a problem. Take, for example, the simple task of converting a temperature from Celsius to Fahrenheit. The empirical approach to solving this problem is to explicitly write a program that takes, as an input, a temperature in °C and converts it into an output, its equivalent temperature in °F, by multiplying the input temperature by 1.8 and adding 32. If we suppose that this conversion equation was not known, one can use linear regression, which is common to both statistics and ML as a simple linear model, to offer the computer an initial guess of a representative equation Temp (F) = m×Temp (C) + b. A starting guess is offered for the unknown parameters (in this case m and b) to represent this information (also called a 'model'), supply it a table of temperatures in °C (called 'features') and corresponding °F (referred to as 'labels'), provide another set of instructions to fit this data to the underlying equation (i.e. 'optimization') by minimizing its prediction error (i.e. 'loss' or 'cost function'), and finally execute this instruction set to continually update the parameters with some logic to ultimately fit this data to the underlying equation (i.e. 'training'). Though simplistically represented, each parenthetical reference above recognizes a key aspect to some of the most integral and defining components for an AI algorithm that, when tuned appropriately, create novel techniques and entire subspecialties in data-driven AI.

Additionally, while much of probability and statistics is used to mathematically derive and establish the basis for many machine and DL models,[19] the priority of statistical models tends to lie in inference and understanding of the dataset's features and their impact on the outcome of interest with generally parametric models. These modelstend to be simpler and not capture non-linearity as well as that of ML or DL models. However, in equivalent and supervised tasks, the simplest AI models prioritize optimizing on outcome prediction instead by engendering more complex model representations.[20] The main drawback, however, is that interpretation of the model's learned parameters becomes significantly harder than that of its counterparts from more statistical frameworks.

Nonetheless, there are nuances between ML and DL that set them apart and are worth discussing. Predominantly, DL separates itself from its parent and predecessor, ML, by the difference in its underlying architecture (which certainly also impacts other facets of the pipeline). Deep learning models are composed of many simple linear models ('nodes') arranged in series (each series termed 'layers', the number and depth of which contribute eponymously to these models being referred to as 'deep') with intervening non-linearities to encourage more complex information representation (Figure 1). This sort of hierarchical structure encourages learning simple representations at each layer that build up to learning complex concepts. In the most intuitive example in image recognition tasks, as work by Olshausen et al. and others has shown,[10,18,21] this amounts to each layer (e.g. convolutional, discussed below) in the series learning simple entities (e.g. lines, circles) that build up into more sophisticated representations (e.g. beaks, feathers, eyes).[19]

Figure 1.

Understanding important layer types. Two common layer types used in deep learning pipelines for image processing are fully connected layers (top), which function simply as many linear regression models with a non-linear activation function that increases the informational capacity of the model. Convolutional layers (bottom) are composed of many 'kernels' that learn particular patterns to pick up (small gradient boxes) and scan across an input signal where that pattern may be present. In this example, the kernels from the top to below represent the shape of a R-S wave, a P-wave, and T-P wave segment, and their relative strengths of detection (high: yellow, low: blue) are shown for the input ECG signal (magenta). The resulting signals demonstrate localization of these key kernel patterns that helps the deep learning model learn both the presence and relationship of such features in the input signal. ECGs, electrocardiograms.

By designing models with increased capacity, DL by virtue reduces the need for extensive, manual feature engineering on certain datasets that are not as natively compatible (e.g. raw ECG waveforms, variable-length sequences) with typical ML models. For example, Narula et al.[22] demonstrate the use of an ML algorithm to distinguish physiologic hypertrophy from hypertrophic cardiomyopathy (HCM) using information such as LV volume and wall strain derived from speckle-tracking echocardiogram data. Simplistically speaking, however, DL, by virtue of its greater capacity to perform cohesive tasks like vision and computer knowledge representation, may obviate the need for such manual labelling by its ability to process raw echocardiogram video data and automatically learn important features (which may or may not include or be derived from the aforementioned features) in order to perform the classification step. It is worth noting that these engineered features may also be used for training DL models, but that DL models operating on such and other structured, tabular data (e.g. patient demographics, lab values) have largely been unable to demonstrate an improvement over comparable statistical or ML frameworks, where data complexity is not high enough to provide deep models with an advantage over well-performing shallow models.[23–25]

Of critical importance, the need to relinquish a priori feature establishment may not be apparent to the reader. For example, with respect to the ECG, frameworks for its interpretation (e.g. rate, rhythm, axis, intervals, ventricles) already exist to classify and localize various cardiac diseases. However, despite the relative robustness of these systems, it would be naïve to discount the possible existence of other morphologies indiscernible to the human eye, either locally or as relationships between beats, given the complexity of the cardiac conduction system. In signal processing and imaging, there are many underived features in the raw waveforms and pixels, respectively, which the high-fidelity automatic feature engineering DL offers may take advantage of. Certainly, such indescribable patterns must exist, and though not fully proven, must explain the encouraging results of Attia et al.[26] in predicting paroxysmal atrial fibrillation (AF) in patients from a benign, normal sinus rhythm ECG.

However, often the cost of this luxury in capturing complex data representations and improved prediction performance is the aforementioned loss of model interpretability, blanching the technique's reputation as 'black-box'. Though methods have been developed to gain more insight into the parameters learned by these models, a notable side effect is overfitting, which is typically caused by having a model with more capacity than relevant information present in the data and required to perform well on the task. This facet permits the model to learn inappropriate aspects about the data, giving the false impression of performing well and causing poor generalizability to other datasets.[27] Typically, this issue arises when large density models are used to perform prediction on small datasets, which is a slippery slope that can easily occur when trying to improve a model's performance. Overfitting may also occur in response to biases present in the dataset, notably when limiting data acquisition from a single site or manufacturer or when restricting to a subset of the general population.[19]

To avoid such pitfalls, it is essential to consider the quality of the dataset, which, if poor enough, may never be overcompensated by any degree of model adjustments.[28] Best practices dictate use of a training set (usually 60–80% of a given dataset but will vary based on data availability and outcome prevalence) for the model to learn the parameters for a given network configuration, a validation set (anywhere from 10% to 20% of the dataset) to learn the best configuration for the model (i.e. the size and number of layers, type of non-linear activations in the models, etc.), and a test set (usually 10–20% of the dataset) to report the final model's performance. Commonly reported metrics to assess model performance include precision or positive predictive value (PPV), recall (sensitivity), specificity, area under the receiver operator characteristic curve, i.e. AUC-ROC (which reflects the model's ability to distinguish between different task outcomes), and the F1-statistic (which measures model performance especially in the setting of class imbalance, when one outcome or characteristic is significantly overrepresented in the dataset). While the AUC-ROC, also known as the c-statistic, tends to be the most heavily reported and investigated value, it is important to consider all metrics during appraisal since these metrics are sensitive to the system's inherent limitations (i.e. class imbalance).[29]

Finally, we conclude with an overview and intuitive description of the most common DL architectures encountered during the literature retrieval process. By far, convolutional neural networks (CNN) are the most common architecture used for analysing ECGs. At the heart of these networks is the use of the convolution operation, which is a classical technique in signal processing for localizing key features and reducing noise. Convolution refers to the act of taking a small pattern (so-called 'kernel') and identifying where in the input that pattern arises (Figure 1), akin to a sliding window. The resulting 'heat map' of activity helps to identify where such patterns exist in the image, which can then be used to localize important features, retain global information through successive layers, and remove artefacts deemed unnecessary by the neural network during training. For example, one of the simplest convolutional kernels functions as an edge detector by detecting horizontal or vertical changes in a signal. Serial combinations in parallel and series of these simple edge detectors can allow the CNN to learn how edges combine to form more complex shapes, like the number 9. This generic operation allows sophisticated architectures to be built (i.e. AlexNet,[30] GoogLeNet,[31] DenseNet,[32] ResNet[33]) that achieve state-of-the-art performance on standard image competition datasets (e.g. ImageNet[34]) and serve as inspiration for the development of other models.

While CNNs are well-suited for fixed-length spatial data, recurrent neural networks (RNNs), on the other hand, approach problems that are represented as fixed- or variable-length sequences (i.e. word sentences, signals) and characterize the temporal and spatial relationship of data. The core node in this architecture operates in a loop: for each element in the sequence, it transforms that sequence into an output and hidden representation, the latter of which serves as an additional input for the next element in the sequence. In this way, this architecture maintains a memory of the important parts of the sequence and updates the output with that information. Further improvements on this basic design include bi-directional RNNs, gated recurrent units (GRUs), long–short-term memory (LSTM), and attention-transformer networks, which help address the shortcomings of a naïve RNNs and achieve state-of-the-art performance in speech recognition, neural (language) translation, and music generation.[19]

As is evident, the classical tasks to which these networks are derived do not readily seem amenable to ECG analysis, given the cyclic format (i.e. heartbeats) and its spatial and temporal duality. Therefore, it is worthwhile to discuss the ECG from a data perspective and how it maintains a high level of compatibility with DL to be served to different types of architectures.

Electrocardiograms as Data

Historically, the heartbeat classification and segment identification of the P-QRS-T were the first data analysis tasks to be performed, and they were achieved from a signal processing approach. These ECGs, originally a time series with a signal intensity, were decomposed into wavelike components with Fourier transformation, Hermite techniques, and wavelet transformations. This may be considered a form of feature extraction since these transformations make important features, such as irregularity in rhythm or rhythm frequency, more discernible for downstream models. Such wavelet-based convolutional techniques have achieved a 93% accuracy on the MIT-BIH arrhythmia database.[35] However, ML and DL models have generally achieved better performance with a promise of better generalization and have been favoured since.[36,37]

In that light, for data-driven model development, it becomes important to identify the best way to represent this signal for the task being solved (Figure 2). The ECG signal may actually be represented in a variety of fashions, each of which may be amenable to a DL pipeline. First, the ECG itself may be subsampled into individual heartbeats of fixed length, which can generate hundreds to thousands of samples per ECG from which features may be derived and used in a more traditional DL network, such as a fully connected neural network. Additionally, it can be sent as a 2D boolean (zeros or ones) image instead of a 1D signal, which is amenable for diagnosing conditions from a fixed-length ECG strip and is highly compatible for use in more traditional image-based CNN architectures. This signal may be one-dimensional or multi-dimensional, depending on the number of leads used, allowing more information to be captured. Finally, the ECG may be represented as a sequence of beats, each linked to the other in time, and treated as a time series that may be analysed by an RNN-type framework.

Figure 2.

Supervised deep learning pipeline: this figure shows what a simple deep learning pipeline for ECG analysis may look like. First, ECGs recorded from patients may be stored in an electronic health record system that can be queried for their retrieval (Panel 1). While user-readable formats may be generated when clinicians query the EHR for viewing a patient ECG, these ECGs will be stored as a sequence of numbers with accompanying header information (i.e. patient medical record number, date of ECG acquisition, etc.) in an easily queryable data structure. Next, during time of analysis, all stored patient ECGs may be queried selectively to construct a dataset that is more easily amenable for a DL model (i.e. matrix format) for training and evaluation as well as being relevant for the application of interest (Panel 2). Third, ECGs must be pre-processed for noise removal and baseline variation. These may then be further re-represented as one-dimensional signals, as pixelated images, in the Fourier space, or as wavelets (Panel 3). Finally, the dataset may be split into training, validation, and testing and used to help a deep neural network learn to predict on a particular outcome of interest (Panel 4). ECGs, electrocardiograms.

The type of representation chosen for ECG analysis will ultimately depend on the dataset available. A list of the most common freely available datasets encountered in the literature search is shown in Table 1. The MIT-BIH AF database was the earliest to be released, containing 25 two-lead ECGs, each of which was ~10 h long. As other databases followed from the same institution (MIT-BIH), the low number of unique patient ECGs was compensated for by their length, which was subsampled to generate thousands of smaller length ECGs centred around each beat and motivated the research endeavours attempting to perfect beat classification in the early days.[38] The Computing in Cardiology Challenge datasets, by introducing much larger datasets, set the stage for novel task definitions (ranging from AF classification, ECG abnormalities, ECG quality, and sleep arousal classification).[39] Additionally, though less clean and without extensive annotations for extensive ML or DL tasks, the MIMIC database[40] gained popularity as well, offering >67 000 ECGs for ICU patients. The past half-decade, however, has also seen a growth in institutional datasets (Table 2), which have surpassed the number of annotated ECGs in these open databases by orders of magnitude. While the number of institutions with published evidence of such databases is few, the retrospective collection of ECG data has allowed more cohort-based questions to be asked, many of which are discussed in the sections below.