Deep Learning and the Electrocardiogram: Review of the Current State-of-the-Art

Sulaiman Somani; Adam J. Russak; Felix Richter; Shan Zhao; Akhil Vaid; Fayzan Chaudhry; Jessica K. De Freitas; Nidhi Naik; Riccardo Miotto; Girish N. Nadkarni; Jagat Narula; Edgar Argulian; Benjamin S. Glicksberg


Europace. 2021;23(8):1179-1191. 

In This Article


When applied to large datasets that contain hidden but valuable relationships, DL has delivered groundbreaking performance. ECGs, laden with information-rich spatial and/or temporal views of the cardiac conduction system, have been amenable to having these hidden associations with cardiovascular pathologies (arrhythmias, cardiomyopathies, valvulopathies, and ischaemia) unravelled, as demonstrated by the original research articles contained within this review. Their role is certainly apparent in future endeavours, as multiple clinical trials[69–73] have been created to prospectively collect ECG data for not only understanding more about their respective heart disease of interest but also validating existing DL models on these newly collected datasets in the form of a randomized, control trials. Nevertheless, difficulties in data access and model sharing, as well as limited flexibility of pre-existing IT infrastructures, are barriers that must be addressed before these algorithms can be deployed to other hospital systems.

Despite its promise, the shortcomings of these endeavours are readily apparent in the incongruence between model design, model validation, and model interpretation. For example, utilizing DL for feature extraction and performing ML on those features in series is in concept an interesting idea,[62] but certainly carries with it the perils of not abiding by the fundamental hierarchical tenets of DL. Similarly, rigorous practices to ensure an appropriate validation of the model are of crucial importance.[74] Because most datasets thus far have been curated from a single centre, they run the risk of overfitting and generalizing poorly to other hospital systems and other datasets, which not only may have different machines that could have slight variations in the underlying noise that may not be readily filtered for by the model.[75] By extension, adversarial (i.e. simulated noise) training would take advantage of generative adversarial networks (GANs), which are DL models trained to discriminate random generated inputs vs. true dataset inputs and subsequently generate new samples that are more resilient to noise, that have made great strides in improving model performance when additionally trained with subtle but key noisy artefacts. Additionally, no central framework exists for comparing the performance of these various models from one institution with another. An open framework to permit such an exchange of ideas, datasets, and pre-trained model weights is not a trivial task, but can foster an environment for collaboration between what are apparent institutional silos of development.

While every original research article covered in this paper offers encouraging results for the value of DL in interpreting ECGs, only a handful offer insight into the model's learning representation of the ECG for the respective task.[52,53,56,61] Without explaining what these DL models are sensing on the ECG to perform their specific task in an interpretable way, developers of these tools run a strong risk of souring the clinician, who needs to understand how these models work before entrusting them to augment their practice, to adopting these tools. Methods to open the 'black box' of DL have been elucidated in detail elsewhere, offering more than a handful of techniques to evaluate both input feature importance and layer-wise information retention.[76] Such techniques may not only make reduction of these algorithms in clinical practice more palatable but may also offer hypotheses on the pathophysiology of disease that may improve its understanding and possibly reduce the barriers to reduction to practice. Additionally, the trials and tribulations for model selection are not apparent in the methodologies for many papers, which does not instill confidence in the rigor of the model development that is otherwise heavily and rightfully emphasized by the computer science community. The question to be asked is not whether DL can solve a task, but which DL method and why can best tackle the task.

Adherence to these suggested principles of research reporting may create cohesion in the research field by virtue of models and datasets being more amenable to each other, which could in turn foster improved collaboration between research groups. For example, in diagnosing valvulopathies, it is difficult to know, given the current findings in this space, how much of the model is dependent on the effect of the continued altered flow mechanics that create subclinical perturbations in the ECG signal vs. long-standing changes to the heart, which may or may not be specific for that pathology. Performance of classifiers predicting relevant physiological cardiomyopathies or augmenting the original dataset with data from patients with non-valvular cardiomyopathy could help improve the robustness of these original seminal works in DL.

In conclusion, though the emerging literature evaluating the role of DL in ECG analysis has shown great promise and potential, with continued improvement, generalization, refinement, and standardization of methods and data to improve the short-term drawbacks in reduction to clinical practice, DL offers the ability to improve a novel way of diagnosing and managing heart disease. The concurrent development of wearable technologies and accessible platforms for deploying pre-trained DL models offers a unique and scalable opportunity to screen for and intervene early in different cardiovascular disease states.