Artificial Intelligence Using Electrocardiography

Joon-myoung Kwon; Yong-Yeon Jo; Soo Youn Lee; Kyung-Hee Kim


Eur Heart J. 2021;42(30):2896-2898. 

Graphical abstract: Recent studies related to artificial intelligence using ECG.

Artificial intelligence (AI) is being applied in various fields of cardiology. In particular, deep learning (DL), a subset of machine learning (ML) in AI, enables the diagnosis and prediction of cardiac diseases using neural networks with more neurons at their layers as well as their interconnectivity. The primary advantage of DL is its ability to discover features of certain data that cannot be discovered from a human perspective.[1] Conventional ML models require meticulous feature engineering with domain expertise to derive features from images or signals for their input. Meanwhile, DL automatically discovers representations and extracts the features from raw data. Therefore, DL requires minimal engineering by hand for development, and it is not restricted by human prejudice when extracting features from data.

Cohen-Shelly et al. developed and validated a DL model for detecting aortic stenosis (AS) using electrocardiography (ECG), and their results are published in this issue of the European Heart Journal.[2] The authors have shown that AI using ECG can identify patients with moderate or severe AS, and might be able to predict developing AS by comparing false-positive and true-negative groups in subgroup analysis. Through learning an implicit representation, the DL model is effective in discovering diverse features based on subtle changes in ECG and creating an algorithm from complex and non-linear ECG data. Since 2019, AI using ECG has been investigated to enable the diagnosis of diseases not possible through conventional ECG (Graphical Abstract). Recent studies have shown that AI-enabled ECG can be used to detect heart failure, pulmonary hypertension, hyperkalaemia, and anaemia, as well as to predict the development of atrial fibrillation and cardiac arrest.[3–8] Various technologies based on DL, such as the generation of precordial six-lead ECGs from limb six-lead ECGs, are being introduced to detect myocardial infarction.[9]

DL enables a model to be created using only data, i.e. without the restrictions of human ideas. Furthermore, new insights can be acquired by comparing findings obtained using DL from data only with existing medical knowledge. Using a saliency map from an AI technology developed, Cohen-Shelly et al. showed that the TP interval and U waves in the right precordial leads were weighted the most heavily for determining the presence of AS.[2] Typical ECG findings for left ventricular hypertrophy were not weighted in the developed AI. Although those findings and the methodology involved will inspire many researchers, further research is needed to understand the exact meaning of the former.

One limitation of DL is overfitting. Using only data without human engineering is both a disadvantage and an advantage of DL. DL is merely a method for developing an algorithm with the best accuracy limited to certain data, and the risk of overfitting exists. For example, if a DL model that identifies cats and dogs on an island is developed, where all cats are white and all dogs are black, then the developed DL model will distinguish cats and dogs using only both black and white features. Furthermore, the developed DL model will demonstrate poor accuracy in environments other than the island on which the model was developed. In another example, because suspicious skin lesions are often routinely marked with gentian violet surgical skin markers, Winkler et al. demonstrated that skin marking at the periphery of dermoscopic images was significantly associated with the DL model detection of skin cancer.[10] Therefore, to guarantee real-world performance, an external validation with isolated data from a different environment is required in all DL research studies.

An external validation implies performing testing using data that differ completely from those for the internal validation used to develop the AI model. In most cases of DL-based AI models, the number of parameters is significant, and occasionally exceeds the number of study subjects. For example, ResNet-152, a popular DL model with outstanding performance for image classification, comprises 60 million parameters.[11] Hence, the DL model might overfit the training data during internal validation; if data extracted from a certain patient belong to both training and test data for the internal validation, then the developed DL model will identify the patient rather than detecting target disease, thereby resulting in an overestimated performance—this is not guaranteed in real-world applications.

Conducting an external validation implies not only separating data for the internal validation, but also confirming the performance for data in a different environment. Wolpert and Macready explained the 'no free lunch' theorem: if AI is optimized for a specific situation, then it cannot yield favourable results in a different situation.[12] For an accurate validation, the data should be split by hospital or region. Although the populations investigated by Cohen-Shelly et al. were from Minnesota, Arizona, and Florida, the data were mixed before they were assigned to training and validation data.[2] Absence of external validation might result in an overestimated performance because the training and test data were not distinctly different. Hence, further studies are needed for external validation such that the developed AI model can be applied across regions and hospitals.

The other disadvantage of DL is that, currently, it cannot unveil the DL decision process, i.e. the black box. In other words, although a DL model can be developed by fitting each coefficient, we cannot specifically interpret the decision process of the model. Based on the study by Cohen-Shelly et al., although we can infer that the TP interval is important through a saliency map, characteristics of the TP interval that are related to AS could not be identified.[2] Moreover, we could not determine why the DL model did not use the ECG features of left ventricular hypertrophy for detecting AS. As the DL model might make an unreasonable decision, the lack of interpretability of the DL model hinders its clinical use significantly. Because the process and reason related to the wrong decision of the DL model could not be determined, we could not monitor or rectify the model risk that might cause medical errors. Because of this, a safety net is required when using DL in clinical applications. To detect critical errors of the DL model, conventional methods and DL models must be used simultaneously. For example, when we used AI-enabled ECG to screen for AS, conventional methods, such as detailed history taking, careful auscultation, and cardiologist consultation, were needed. Recently, several studies have been conducted to understand the decision-making process of DL, and explainable AI in the field of medicine would continue to evolve.[13]

AI was introduced in the 1950s; since then, two AI 'winter' periods of reduced funding and interest in AI research have occurred.[14] These winter periods were due to disappointment from unsatisfactory real-world performances following extravagant endorsements of the idea that AI can solve all problems. Furthermore, unreasonable and unexplainable AI decisions contributed to the recurrence of these winter periods. It is clear that AI exhibits significant potential in the field of medicine; it can improve diagnostic accuracy and support clinical decisions for many diseases. However, the disadvantages of AI should be identified and efforts should be expended to overcome its limitations. This would enable us to continue developing AI technology for medical applications, e.g. AI based on DL for improving the early diagnosis and prevention of irreversible cardiovascular disease progression.