Skip to main content
Advertisement
  • Loading metrics

QRS detection in single-lead, telehealth electrocardiogram signals: Benchmarking open-source algorithms

  • Florian Kristof,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Writing – original draft, Writing – review & editing

    Affiliation TUM School of Computation, Information, and Technology, Technical University of Munich, Garching bei München, Germany

  • Maximilian Kapsecker,

    Roles Methodology, Software, Writing – original draft, Writing – review & editing

    Affiliations TUM School of Computation, Information, and Technology, Technical University of Munich, Garching bei München, Germany, Institute for Digital Medicine, University Hospital Bonn, Bonn, Germany

  • Leon Nissen,

    Roles Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

    Affiliation Institute for Digital Medicine, University Hospital Bonn, Bonn, Germany

  • James Brimicombe,

    Roles Data curation, Investigation, Methodology, Writing – review & editing

    Affiliation Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom

  • Martin R. Cowie,

    Roles Funding acquisition, Investigation, Writing – review & editing

    Affiliation School of Cardiovascular Medicine & Sciences, Faculty of Lifesciences & Medicine, King’s College London, London, United Kingdom

  • Zixuan Ding,

    Roles Software, Writing – review & editing

    Affiliation Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom

  • Andrew Dymond,

    Roles Data curation, Investigation, Methodology, Writing – review & editing

    Affiliation Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom

  • Stephan M. Jonas,

    Roles Writing – review & editing

    Affiliation Institute for Digital Medicine, University Hospital Bonn, Bonn, Germany

  • Hannah Clair Lindén,

    Roles Data curation, Investigation, Methodology, Writing – review & editing

    Affiliation Zenicor Medical Systems AB, Stockholm, Sweden

  • Gregory Y. H. Lip,

    Roles Funding acquisition, Investigation, Methodology, Writing – review & editing

    Affiliations Liverpool Centre for Cardiovascular Science at University of Liverpool, Liverpool John Moores University and Liverpool Heart & Chest Hospital, Liverpool, United Kingdom, Danish Center for Health Services Research, Department of Clinical Medicine, Aalborg University, Aalborg, Denmark

  • Kate Williams,

    Roles Conceptualization, Data curation, Investigation, Methodology, Project administration, Writing – review & editing

    Affiliation Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom

  • Jonathan Mant,

    Roles Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – review & editing

    Affiliation Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom

  • Peter H. Charlton ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Supervision, Writing – original draft, Writing – review & editing

    pc657@medschl.cam.ac.uk

    Affiliation Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom

  • on behalf of the SAFER Investigators

Abstract

Background and objectives

A key step in electrocardiogram (ECG) analysis is the detection of QRS complexes, particularly for arrhythmia detection. Telehealth ECGs present a new challenge for automated analysis as they are noisier than traditional clinical ECGs. The aim of this study was to identify the best-performing open-source QRS detector for use with telehealth ECGs.

Methods

The performance of 18 open-source QRS detectors was assessed on six datasets. These included four datasets of ECGs collected under supervision, and two datasets of telehealth ECGs collected without clinical supervision. The telehealth ECGs, consisting of single-lead ECGs recorded between the hands, included a novel dataset of 479 ECGs collected in the SAFER study of screening for atrial fibrillation (AF). Performance was assessed against manual annotations.

Results

A total of 12 QRS detectors performed well on ECGs collected under clinical supervision (F1 score ≥0.96). However, fewer performed well on telehealth ECGs: five performed well on the TELE ECG Database; six performed well on high-quality SAFER data; and performance was poorer on low-quality SAFER data (three QRS detectors achieved F1 of 0.78-0.84). The presence of AF had little impact on performance.

Conclusions

The Neurokit and University of New South Wales QRS detectors performed best in this study. These performed sufficiently well on high-quality telehealth ECGs, but not on low-quality ECGs. This demonstrates the need to handle low-quality ECGs appropriately to ensure only ECGs which can be accurately analysed are used for clinical decision making.

Author summary

The electrocardiogram (ECG) is a vital tool for assessing heart health. Traditionally, ECGs are recorded in clinical settings, but with advances in technology, mobile devices and smartwatches can now be used to record ECGs in daily life. However, ECG recordings from these devices often contain more noise, posing challenges for accurate analysis. In this study, we evaluated 18 different algorithms for detecting heartbeats in ECGs. Our aim was to identify the best-performing algorithm for use with ECGs recorded using mobile devices. We tested each algorithm on 995 ECG recordings and compared their performance against manually-annotated heartbeats. From our analysis, we identified the two best-performing algorithms. These algorithms performed well when analysing high-quality ECGs obtained under clinical supervision and from mobile devices. However, their performance degraded significantly when analysing noisy ECGs from mobile devices. These findings highlight the importance of selecting robust algorithms for ECG analysis, particularly for data collected outside clinical environments. Furthermore, the study demonstrates the need to ensure that only ECGs which can be accurately analysed are used for clinical decision making.

Introduction

The electrocardiogram (ECG) is one of the most widely used physiological measurement techniques, providing detailed information on heart function. Traditionally ECG measurements have been confined to clinical settings. However, recently it has become possible to measure the ECG in telehealth settings using handheld devices or smartwatches [1, 2]. This presents the opportunity to conduct health assessment beyond the clinical setting, with potential applications including remote health monitoring, personalized diagnosis, rehabilitation, and screening for atrial fibrillation (AF). Indeed, the recent COVID-19 pandemic has acted as a strong catalyst for innovation in this area [3]. However, the increasing use of wearable and telehealth technologies also presents new challenges.

A key challenge is that telehealth ECGs can be of lower quality than those collected in clinical settings, and so ECG analysis algorithms must be able to handle increased noise levels. Telehealth ECGs can be of lower quality for several reasons [4]: the ECG is often measured further away from the heart (such as at the hands rather than the chest); devices typically use dry electrodes rather than the more conductive adhesive electrodes; and there is less quality control since measurements are taken by a non-expert user without clinical supervision. Therefore, there is a need to understand how well ECG analysis algorithms perform in the telehealth setting.

QRS detection is a fundamental task in ECG analysis. QRS complexes indicate ventricular depolarisation, i.e. the electrical impulse which causes the the heart to pump blood into the circulation. QRS detection is widely used for heart rate and rhythm monitoring, and heart rate variability analysis. Furthermore, QRS detection is frequently the first step towards extraction of more detailed ECG features such as QT intervals and P-waves. A range of QRS detection algorithms have been proposed [5, 6], most of which were developed using ECGs collected in clinical settings ([4] being a notable exception). Therefore, there is a need to assess their performance with telehealth ECGs. QRS detectors should firstly be accurate, correctly identifying QRS complexes. They should ideally remain accurate in the presence of pathologies such as AF (which results in an irregular heart rhythm), and in the presence of noise. In addition, QRS detection algorithms should also be stable with low execution times to ensure they are suitable for rapid and long-term analyses.

Previous studies have compared the performance of QRS detection algorithms across databases recorded in different settings. Liu et al. assessed ten QRS detectors across five datasets including one telehealth dataset [5]. The algorithms, chosen for their computational efficiency, achieved F1 scores of >99% on high-quality signals, ≤80% for low-quality signals, and ≥94% during pacing and in the presence of arrhythmias. The study concluded that an optimized knowledge-based algorithm [7] performed best. Llamedo and Martinez assessed six QRS detectors on 12 databases covering five categories: normal sinus rhythm, arrhythmia, ST and T morphology changes, stress, and long-term monitoring [6]. The study concluded that the gqrs algorithm performed best. Research in [8] assessed 12 QRS detectors across five publicly available datasets. The study concluded that the neurokit (nk)algorithm performed best when considering both accuracy and execution time. Previous work in this area addresses known algorithms and benchmark ECG databases, but there is a lack of knowledge about the latest algorithms and their application to new telehealth databases, especially their performance on self-recorded ECGs.

The aim of this study was to identify the best-performing open-source QRS detector for use with telehealth ECGs. The performance of 18 algorithms was assessed on multiple datasets including a novel dataset collected using handheld devices during screening for AF. Performance was assessed primarily in terms of the accuracy of QRS detection (quantified using the F1 score), and also in terms of the execution time and error rate of algorithms. The findings address the gap in knowledge about how well QRS detection algorithms perform in telehealth and settings. They are particularly relevant given the rapid introduction of single-lead ECG technology in consumer devices such as smartwatches, and clinical devices such as handheld ECG recorders.

Methods

QRS detection algorithms

The 18 QRS detectors assessed in this study are summarised in Table 1 (with source links provided in Table A in S1 Text). The QRS detectors were identified through a search for open-source algorithms. The majority of algorithms were found in either in the ‘NeuroKit’ [8] or ‘ecgdetector’ [9] Python packages. Some algorithms were available in both packages with slightly different implementations, in which case the faster implementation was used. Python implementations were used where available to provide a fair comparison of algorithm execution times. In four cases Python implementations were not available and Matlab implementations were used instead (jqrs, rdeco, rpeak, and unsw). Six additional algorithms were identified but not used in this study due to one of the following reasons: (i) no Python or Matlab implementation was available; (ii) the available implementation only accepted particular sampling frequencies; (iii) the available implementation predominantly led to errors; or (iv) the execution time was substantially longer than that of other algorithms. Further details are provided in Table B in S1 Text.

Datasets

The performance of QRS detectors was assessed using six datasets, including datasets collected in inpatient, outpatient, and home settings. The datasets are summarised in Table 2 and described in the following paragraphs. Full source links for the datasets are provided in Table C in S1 Text.

MIT-BIH Normal Sinus Rhythm Database (SIN).

The MIT-BIH Normal Sinus Rhythm Database (SIN) contains 18 24-hour ECG recordings from patients referred to the Arrhythmia Laboratory at Boston’s Beth Israel Hospital, who were found not to have significant arrhythmias [10]. The first two hours of each recording were used in this study. The subjects consisted of 13 women and 5 men, aged 20 to 50. Each recordings contains two ECG channels of unknown leads, the first of which was used in this analysis.

MIT-BIH Arrhythmia Database (ARR).

The MIT-BIH Arrhythmia Database (ARR) contains 48 30-minute ECG recordings from 47 patients referred to the same Arrhythmia Laboratory [10, 11]. This dataset consists of 23 recordings which were selected at random from a larger dataset and a further 25 recordings which were manually selected to include examples of significant but uncommon arrhythmias. The subjects included 22 women and 25 men aged 23 to 89. The first ECG channel in each recording was analysed, which was the modified limb lead II in most cases.

PhysioNet/Computing in Cardiology Challenge 2014 Datasets (HIGH and LOW).

The PhysioNet/Computing in Cardiology Challenge 2014 datasets consist of 10-minute ECG recordings from patients and healthy volunteers [10, 12]. The two publicly available datasets were used in this study: (i) the Training Set (HIGH), which contains 100 recordings which are generally of high quality; and (ii) the Augmented Training Set (LOW), which contains 100 recordings that are generally of low quality. Each record in these datasets contains a single ECG lead. The LOW dataset contains the following leads: lead II (78 records); lead III (5), lead AVF (3), lead AVL (1), and no lead label (13). No lead labels are provided in the HIGH dataset.

TELE ECG database (TELE).

The TELE ECG Database contains 250 30-second lead-I ECG recordings from home-dwelling patients suffering from chronic obstructive pulmonary disease and/or congestive heart failure [4, 32]. Recordings were acquired without clinical supervision using the TeleMedCare Health Monitor (TeleMedCare Pty. Ltd. Sydney, Australia). The device records an ECG from the hands using dry metal electrodes. This dataset contains 221 ECGs randomly selected from 120 patients, and an additional 29 ECGs specifically selected to represent poor-quality data. The dataset contains manual annotations of QRS complexes. One ECG in the dataset lasted longer than 30s, and was truncated to 30s for this study.

SAFER ECG dataset (SAFER).

The SAFER ECG Dataset contains 479 30-second lead-I ECG recordings from home-dwelling subjects aged 65 and over, collected in an AF screening study (the SAFER Feasibility Study, ISRCTN 16939438) [13].

ECG recordings were acquired without clinical supervision using the Zenicor EKG-2 device shown in Fig 1 (Zenicor Medical Systems AB, Sweden). The device records an ECG from the thumbs using dry metal electrodes. This dataset contains: 183 high-quality ECGs exhibiting AF (denoted SAFER-AF-HIGH) collected from 48 subjects (13 female and 35 male); 199 high-quality ECGs from subjects without AF (SAFER-nonAF-HIGH) collected from 199 participants (100 female and 99 male); and 97 low-quality ECGs from subjects without AF (SAFER-nonAF-LOW) collected from 97 subjects (49 female and 48 male). ECG quality was assessed using the Cardiolund ECG Parser algorithm (Cardiolund AB). R-peaks were manually annotated specifically for this study. The presence of AF was determined as described in [13]: (i) using the Cardiolund algorithm to identify ECGs with potential abnormalities; and (ii) expert reviewers manually reviewing ECGs to identify AF (as described in [13, 33]). To provide further details, ECGs were classified as AF and non-AF based on ad hoc review by two cardiologists. An ECG was classified as AF if either: (i) both cardiologists agreed that the ECG contained AF; or (ii) one cardiologist made an AF diagnosis and the other provided no diagnosis. An ECG was classified as non-AF if either: (i) the Cardiolund algorithm didn’t identify abnormalities in the ECG, and the cardiologists did not identify an arrhythmia, and the participant was not diagnosed with AF; or (ii) both cardiologists agreed that the ECG didn’t contain an arrhythmia.

thumbnail
Fig 1. Zenicor-EKG device.

The handheld Zenicor-EKG device used to record 30-second ECGs in the SAFER ECG Dataset.

https://doi.org/10.1371/journal.pdig.0000538.g001

Ethics statement

The SAFER Feasibility Study in which the SAFER ECG dataset was acquired was approved by the London Central NHS Research Ethics Committee (18/LO/2066). All participants gave written informed consent to participate in the study. The study was conducted in accordance with the Declaration of Helsinki. Ethical approval was not required for the use of the remaining datasets as these were pre-existing, anonymised datasets.

Statistical analysis

The performance of QRS detectors was primarily assessed using the F1 score (following a precedent in [5, 34]). The F1 score is the harmonic mean of the sensitivity (SEN) and positive predictive value (PPV). These three statistics were calculated as follows from: the number of reference QRS complex annotations (nref, corresponding to the number of actual positives, P); the number of QRS complexes identified by an algorithm (nalg, corresponding to the number of predicted positives, i.e. true positives + false positives, TP+FP); and the number of QRS complexes which were correctly identified (ncorrect, corresponding to the number of true positives, TP). (1) (2) (3) ncorrect was calculated as the number of reference QRS complex annotations for which at least one QRS complex was identified by an algorithm within ± 75ms of the reference QRS annotation as shown in Fig 2.

thumbnail
Fig 2. Assessing whether QRS complexes were correctly identified.

An ECG signal is shown with dotted red lines marking reference R-peak annotations, grey areas showing the tolerance of ± 75ms around these annotations within which QRS complexes are deemed to be correctly identified, and markers for the R-peaks identified by the 18 QRS detectors used in this study.

https://doi.org/10.1371/journal.pdig.0000538.g002

A threshold of ± 75ms was chosen to classify QRS detections as correct or not for the following reasons. QRS complexes typically last <120ms in health [35], although can last longer in disease [36]. ± 75ms was identified as a conservative threshold which would classify any QRS detections lying on a QRS complex as correct, whilst classifying any detections on other ECG waves (such as p- or t-waves) as incorrect. This was based on the assumptions that: a QRS complex lasts up to approximately 150ms; the R-wave is located approximately in the centre of a QRS complex; and reference QRS annotations are at the locations of R-waves. To investigate the suitability of this threshold, we assessed the performance of the QRS detectors on the telehealth (TELE and SAFER) datasets for thresholds ranging from ± 1 to 140ms. The results (shown in Fig A in S1 Text) show that for all QRS detectors performance was poorer at low thresholds, with performance generally approaching a maximum between 20 and 100ms (such as ≈ 20ms for nk and unsw, and ≈ 60–80ms for pan-tomp and two-avg). Therefore, a threshold of ± 75ms appeared a reasonable choice. In comparison, previous work in this area has used tolerances of 50 ms [5] and ± 150 ms [22].

F1 scores are reported using the median and inter-quartile range of the F1 score for each ECG window.

Two additional performance measures were used: algorithm error rate and execution time. Error rates were defined as the percentage of 30s ECG segments in which an algorithm encountered an error and did not return identified QRS complexes. Execution times were assessed as the median time taken for an algorithm to process each 30s ECG segment. The analysis was performed on a MacBook Air (M1, 2020, 16 GB RAM, 8 cores) without parallelization. The assessment was run in Visual Studio Code 1.73.0, using Python 3.9, and calling MATLAB R2022a for QRS detectors written in MATLAB code.

The two-sided Mann-Whitney U test was used to test for statistically significant differences between F1 scores at the 95% significance level. A Bonferroni correction was used to account for the multiple comparisons (a comparison for each beat detector). This test was used as the distributions were neither normally distributed nor dependent on each other. Comparisons were made between: (i) supervised and telehealth ECGs; (ii) high- and low-quality ECGs; (iii) AF and non-AF ECGs; and (iv) female and male subjects. Comparisons between female and male subjects were made on the SIN, ARR and SAFER datasets, but not on the HIGH, LOW and TELE datasets as to the best of our knowledge they do not contain information on gender.

Results

Algorithm performance

The performance of the algorithms is presented in Fig 3 using the F1 score. When using a F1 score of ≥ 0.96 to identify good performance, a total of 12 out of 18 algorithms performed well on ECGs collected under clinical supervision (ARR, HIGH and LOW, and SIN). The exceptions were engz, gamb, jqrs, mart, nab and rpeak. Fewer algorithms performed well on telehealth ECGs: five algorithms performed well on the TELE dataset (gqrs, nk, rdeco, two-avg, and unsw); six algorithms performed well on high-quality SAFER data (fnvg, fwhvg, nk, rdeco, two-avg, and unsw); and performance was considerably poorer on low-quality SAFER data, with only three algorithms scoring ≥ 0.78 (fnvg, nk, and unsw), and none scored higher than 0.84.

thumbnail
Fig 3. The performance of QRS detectors, expressed as the F1 score.

Results are shown for the 18 QRS detectors (on the y-axis) and the six datasets (x-axis). Dataset definitions: ARR—MIT-BIH Arrhythmia Database; HIGH—PhysioNet/Computing in Cardiology Challenge 2014 training set; LOW—PhysioNet/Computing in Cardiology Challenge 2014 augmented training set; SAFER-AF-HIGH—SAFER ECG Dataset subset of high-quality ECGs exhibiting AF; SAFER-nonAF-HIGH—SAFER ECG Dataset subset of high-quality ECGs not exhibiting AF; SAFER-nonAF-LOW—SAFER ECG Dataset subset of low-quality ECGs not exhibiting AF; SIN—MIT-BIH Normal Sinus Rhythm Database; TELE—TELE ECG Database.

https://doi.org/10.1371/journal.pdig.0000538.g003

Therefore, overall the nk and unsw algorithms performed best, with consistently high F1 scores on datasets of supervised ECG recordings, and the highest F1 scores on self-recorded ECGs (TELE and SAFER datasets).

Additional results for the positive positive predictive value (PPV) and sensitivity (SEN) are provided in Figs B and C in S1 Text. These metrics show that: gamb performed poorly because of a low PPV, indicating that it falsely detected additional QRS complexes; and mart, and engz had a low SEN, indicating that they frequently missed QRS complexes.

Fig 4 shows the error rates of each QRS detector (indicating the proportion of ECG windows for which the QRS detector algorithms failed to execute—i.e. encountered an error). Most QRS detectors had no or very few errors. The best-performing algorithms had 0.0% errors on all datasets (nk and unsw). The engz and gamb algorithm implementations frequently produced errors, and some errors were encountered for gqrs, jqrs, rdeco and rpeak algorithms. Of particular note, the gamb algorithm exhibited higher error rates on SAFER data, including error rates of ≥99% for gamb (in keeping with a previous study [8]). This was due to the algorithm’s use of a fixed amplitude threshold which was often not met for SAFER ECGs.

thumbnail
Fig 4. The error rates for each QRS detector (expressed as percentages).

These indicate the proportion of ECG windows for which the QRS detector algorithms failed to execute (i.e. encountered an error).

https://doi.org/10.1371/journal.pdig.0000538.g004

Fig 5 shows the median execution time of each QRS detector. The fastest QRS detector, rpeak, had an execution time of 1.1 ms (i.e. 0.004% of the signal duration). Of the best-performing QRS detectors (nk and unsw), nk had a short execution time of 2.7 ms (0.009% of the signal duration), whereas unsw was slower at 37.1 ms (0.124% of the signal duration). Four QRS detectors had much longer execution times (christ, engz, gqrs, and wqrs), although we note that C code implementations are available for some of these that would have led to shorter execution times. Most QRS detectors had similar median and mean execution times: the mean execution time was between 87 and 124% of the median for all QRS detectors except christ, whose median execution time was substantially longer (373% of the median), primarily due to exceptionally high runtimes on the SAFER-nonAF-LOW dataset.

thumbnail
Fig 5. QRS detector execution times.

The median execution time of each QRS detector was calculated across all datasets, where QRS detectors were implemented in either Python (blue) or Matlab (red).

https://doi.org/10.1371/journal.pdig.0000538.g005

Comparison between supervised and telehealth ECGs

For most QRS detectors, the performance of QRS detectors was higher on supervised ECG recordings than on unsupervised, telehealth ECGs. A total of 17 (out of 18) QRS detectors had a significantly higher F1 score on the supervised SIN dataset than the unsupervised SAFER-nonAF-HIGH dataset (mart showed no sigificant difference). Similarly, 16 QRS detectors had a significantly higher F1 score on the supervised ARR dataset than the unsupervised SAFER-AF-HIGH dataset (hamilt and nk showed no sigificant difference). Referring to Fig 3: some QRS detectors performed below average on unsupervised ECGs despite having performed well on supervised ECGs: gqrs achieved F1 scores of ≥0.98 on supervised ECGs (SIN, ARR, HIGH, LOW), but ≤0.70 on SAFER; and rpeak achieved ≥0.81 on supervised ECGs, but ≤0.53 on TELE and SAFER datasets.

The results for positive predictive value (PPV) and sensitivity (SEN) (in Figs B and C in S1 Text) show that most QRS detectors which performed poorly on self-recorded ECGs had a low PPV, indicating false positive QRS detections. In addition, some QRS detectors had low sensivities, indicating unrecognized QRS complexes (e.g. engz, gamb, jqrs, mart, nab, and wqrs).

Algorithm errors predominantly occurred in unsupervised telehealth ECGs (see Fig 4).

The impact of signal quality

Low signal quality was associated with poorer performance of QRS detectors in the telehealth setting. The F1 scores for all QRS detectors except gamb were significantly lower on low-quality unsupervised ECGs (SAFER-nonAF-LOW) than high-quality unsupervised ECGs (SAFER-nonAF-HIGH). For instance, the best-performing QRS detectors (nk and unsw) performed well on high-quality unsupervised ECGs (TELE, SAFER-nonAF-HIGH, SAFER-AF-HIGH) with F1 scores of ≥0.97, but performed less well on low-quality unsupervised ECGs (SAFER-nonAF-LOW) with F1 scores of ≤0.84. Indeed, all remaining QRS detectors showed F1 scores of ≤0.78 on low-quality ECGs (SAFER-nonAF-LOW) in the unsupervised telehealth environment.

Signal quality had a smaller but nonetheless significant impact on QRS detectors when using supervised ECGs. Almost all algorithms performed well on high-quality ECGs (the SIN, ARR, and HIGH datasets) with F1 scores of ≥0.97 (except gamb and mart), and most of these algorithms continued to perform relatively well on low-quality supervised ECGs (the LOW dataset) with F1 scores of ≥0.97 (except engz, gamb, jqrs, mart, nab and rpeak). The small differences in F1 scores between HIGH and LOW were significant for all QRS detectors except wqrs and hamilt.

Other influencing factors

The presence of arrhythmia did not have a large effect on F1 scores for either supervised ECGs (comparing ARR and SIN) or unsupervised ECGs (comparing SAFER-AF-HIGH and SAFER-nonAF-HIGH) (see Fig 3). Whilst the differences were mostly small, F1 scores were significantly lower during arrhythmias in ARR compared to SIN for 6 out of 18 QRS detectors, and in SAFER-AF-HIGH compared to SAFER-nonAF-HIGH for 8 QRS detectors. Amongst the best performing QRS detectors (nk and unsw), the only significant difference was for nk in the comparison of SAFER-AF-HIGH and SAFER-nonAF-HIGH, although this difference was small with median F1 scores of 0.99 and 0.98 on the datasets.

Sex had little impact on performance when using unsupervised ECGs as demonstrated by there being no significant differences in performance between female and male subjects on high-quality, non-AF SAFER signals (see Fig 6A), and significant differences for only two QRS detectors on high-quality, AF SAFER signals (see Fig 6B). There were no significant differences in performance between sexes on the SIN and ARR datasets (see Fig D in S1 Text), although we note the small numbers of subjects in each group in the SIN dataset (13 female and 5 male).

thumbnail
Fig 6. Comparison of the performance of QRS detectors between female (F) and male (M) SAFER participants.

A: SAFER-nonAF-HIGH: High-quality, non-AF ECGs (including 100 female and 99 male subjects). B: SAFER-AF-HIGH: High-quality, AF ECGs (including 92 female and 91 male subjects). Definitions: ns—no significant difference.

https://doi.org/10.1371/journal.pdig.0000538.g006

p-values for all statistical comparisons are provided in Tables D and E in S1 Text.

Discussion

Summary of findings

This study assessed the performance of open-source QRS detectors on single-lead, telehealth ECGs. The neurokit (nk)and UNSW (unsw)QRS detectors were identified as the best-performing out of 18 QRS detectors. They performed well on telehealth ECGs recorded without clinical supervision, and also on ECGs recorded in clinical settings. They achieved F1 scores of ≥0.98 on high-quality telehealth ECGs and ≥0.97 on ECGs recorded in clinical settings. Performance was lower at ≥0.78 when analysing low-quality telehealth ECGs. Performance was not substantially affected by heart rhythm or gender. nk had one of the fastest execution times (at 0.009% of the signal duration), whereas unsw was over ten times slower (0.124%).

Comparison with literature

Several studies have compared the performance of multiple QRS detection algorithms across databases of different quality [5, 6, 8, 22]. Previous studies assessed 6–12 algorithms, compared to 18 in the current study. Several of the high-performing algorithms included in the current study were not widely assessed in previous comparison studies: nk and two-avg were only included in [8]; unsw was only included in [5]; and rdeco was not included in these studies. In addition, previous studies had mostly focused on assessing performance on supervised ECG recordings rather than the telehealth setting. Telehealth data was only included in [5]: the current study included analyses of both this dataset and also data from the SAFER AF screening study, containing the additional challenge of QRS detection during AF.

The current study adds to our understanding of how best to detect QRS complexes in telehealth ECGs, and demonstrates the need to develop techniques to handle low-quality ECGs appropriately. Previously, QRS detectors had been found to perform worse on telehealth data, and in particular the TELE dataset [5]. We also observed worse performance on telehealth data, although we found that the best QRS detectors performed adequately well on high-quality telehealth data, and that performance was only substantially worse on low-quality telehealth data. This provides two complementary directions for future work: (i) QRS detectors could be developed to perform well even in the presence of noise (e.g. through denoising [37] or improved algorithm design [4]); and (ii) ECG signal quality algorithms could be developed to identify low-quality recordings in which QRS complexes cannot be accurately identified [22, 38].

The current study also has implications for future research. We observed that the performance of QRS detectors on supervised or high-quality ECG recordings is not necessarily indicative of their performance on unsupervised recordings, in keeping with [6]. This highlights the importance of assessing performance in the target setting, such as in AF screening as performed in this study. We also observed quite different performances on the TELE dataset to those reported previously: whereas the highest performing algorithm achieved an F1 score of 0.80 on TELE in [5], six of the algorithms included in the present study achieved F1 scores of 0.90–1.00. Whilst in many cases this may be explained by including additional algorithms in this study, it is notable that the jqrs algorithm’s performance was substantially higher on this dataset in the present study (0.93) than the previous study (0.79). This may also be explained by the use of different tolerance windows. Nonetheless, this demonstrates the need to share open-source algorithm implementations and the code used to perform algorithm assessments. To address this, we have provided a repository of open-source algorithms and assessment code to accompany this article: https://github.com/floriankri/ecg_detector_assessment.

Strengths and limitations

The key strengths of this study are the assessment of QRS detectors in a real-world AF screening setting, and the inclusion of recently developed, high-performance QRS detectors. The key limitation is that algorithms were run retrospectively on a computer, rather than in real-time on a telehealth device. Some algorithms were implemented in Python, and others in Matlab. Therefore, the comparison of algorithm execution times reported in this study may not be truly representative of the relative execution times which would be observed on devices: the comparison of Python and Matlab execution times may not be fair; different algorithms may have been optimised to different extents; and some algorithms may be more amenable to further optimisation for use on devices than others (such as through implementation in C, as is already the case for parts of unsw). We note that in this study we did not investigate the potential benefit of additional ECG filtering beyond that already incorporated into each of the QRS detector algorithms: potentially performance could be improved further by including additional linear or non-linear filtering steps [39, 40]. Furthermore, we did not investigate the accuracy of RR-intervals derived from QRS detections, nor their suitability for heart rate variability analysis or arrhythmia detection. We note that additional processing steps may be required to accurately derive RR-intervals, such as locating the R-wave on each detected QRS complex.

Implications

This study identified leading QRS detector algorithms for use with telehealth ECGs. The best-performing algorithms were able to detect QRS complexes with a very high degree of accuracy on high-quality telehealth ECG data, demonstrating the potential utility of telehealth devices for assessments based on RR-intervals (such as arrhythmia detection). Furthermore, the study demonstrates the importance of selecting a high-performance QRS detector, since performance can vary greatly on telehealth ECGs, between even well-established algorithms. The study also demonstrates the difficulty in analysing low-quality telehealth ECGs, which appear to be of particularly low quality, perhaps due to increased artifact, the use of dry electrodes, being self-recorded without clinical supervision, and acquisition at the hands rather than the chest [4].

The findings are particularly relevant to telehealth settings where ECG signals are recorded without clinical supervision. Several such settings arise in the detection and management of atrial fibrillation at home, including: (i) virtual wards to reduce hospitalisation for atrial fibrillation [41]; (ii) screening for paroxysmal atrial fibrillation [42]; and (iii) detecting recurrent atrial fibrillation after ablation or cardioversion [43]. In each of these examples an accurate QRS detector is a key step in processing the intermittent ECGs acquired by patients at home, where signal quality may be lower than in the clinical setting.

Conclusion

This study identified two leading QRS detectors for use with single-lead, telehealth ECGs: the nk and unsw algorithms. These algorithms provided accurate QRS detection with fast execution times. Whilst most other algorithms performed well on data collected under clinical supervision, many did not perform as well on telehealth data, demonstrating the importance of selecting a high-performance algorithm for use in clinical analysis. The performance of even the leading algorithms was substantially lower on low-quality telehealth ECGs, highlighting the need to handle low-quality ECGs appropriately in an analysis pipeline. All the QRS detection algorithms used in this study are openly available, ensuring that they can be quickly used in future research. Furthermore, the code used to assess algorithm performance is also available to facilitate future research, at: https://github.com/floriankri/ecg_detector_assessment.

Supporting information

S1 Text. Supplementary Material.

The Supplementary Material provides additional results, details of the study methodology, and links to algorithms and datasets.

https://doi.org/10.1371/journal.pdig.0000538.s001

(PDF)

Acknowledgments

[5] provided the foundations for the selection of datasets and their presentation in Table 2. ChatGPT (OpenAI, San Francisco, CA, USA) was used for language editing.

The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care.

References

  1. 1. Lopez Perales CR, Van Spall HGC, Maeda S, Jimenez A, Laţcu DG, Milman A, Kirakoya-Samadoulougou F, et al. Mobile health applications for the detection of atrial fibrillation: a systematic review. EP Europace. 2021;23(1):11–28. pmid:33043358
  2. 2. Lu L, Zhang J, Xie Y, Gao F, Xu S, Wu X, et al. Wearable health devices in health care: narrative systematic review. JMIR mHealth and uHealth. 2020;8(11):e18907. pmid:33164904
  3. 3. Lee SM, Lee D. Opportunities and challenges for contactless healthcare services in the post-COVID-19 Era. Technol Forecast Soc Change. 2021;167:120712. pmid:33654330
  4. 4. Khamis H, Weiss R, Xie Y, Chang CW, Lovell NH, Redmond SJ. QRS detection algorithm for telehealth electrocardiogram recordings. IEEE Trans Biomed Eng. 2016;63(7):1377–1388. pmid:27046889
  5. 5. Liu F, Liu C, Jiang X, Zhang Z, Zhang Y, Li J, et al. Performance analysis of ten common QRS detectors on different ECG application cases. J Healthcare Eng. 2018;2018:e9050812. pmid:29854370
  6. 6. Llamedo M, Martínez JP. QRS detectors performance comparison in public databases. Computing in Cardiology. 2014;357–360.
  7. 7. Elgendi M. Fast QRS Detection with an Optimized Knowledge-Based Method: Evaluation on 11 Standard ECG Databases. PLoS ONE. 2013;8(9):e73557. pmid:24066054
  8. 8. Makowski D, Pham T, Lau ZJ, Brammer JC, Lespinasse F, Pham H, et al. NeuroKit2: A Python toolbox for neurophysiological signal processing. Behav Res. 2021 Aug;53(4):1689–1696. pmid:33528817
  9. 9. Porr B, Howell L. py-ecg-detectors: Seven ECG heartbeat detection algorithms and heartrate variability analysis. Version 1.3.2. Available from: https://github.com/berndporr/py-ecg-qrs-detectors.
  10. 10. Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation. 2000 Jun 13;101(23):E215–220. pmid:10851218
  11. 11. Moody GB, Mark RG. The impact of the MIT-BIH Arrhythmia Database. IEEE Eng Med Biol Mag. 2001 May;20(3):45–50. pmid:11446209
  12. 12. Moody G, Moody B, Silva I. Robust detection of heart beats in multimodal data: The PhysioNet/Computing in Cardiology Challenge 2014. In: Computing in Cardiology 2014. 2014 Sep; p. 549–552.
  13. 13. Pandiaraja M, Brimicombe J, Cowie M, Dymond A, Lindén HC, Lip GYH, et al. Screening for atrial fibrillation: improving efficiency of manual review of handheld electrocardiograms. Eng Proc. 2020;2(1):78. pmid:33778802
  14. 14. Christov II. Real time electrocardiogram QRS detection using combined adaptive threshold. BioMedical Engineering OnLine. 2004 Aug 27;3(1):28. pmid:15333132
  15. 15. Engelse WAH, Zeelenberg C. A single scan algorithm for QRS-detection and feature extraction. Computers in Cardiology. 1979;6:37–42.
  16. 16. Lourenco A, Silva H, Leite P, Lourenco R, Fred A. Real time electrocardiogram segmentation for finger based ECG biometrics. In: Proceedings of the International Conference on Bio-inspired Systems and Signal Processing. 2012; pages 49–54.
  17. 17. Emrich J, Taulant K, Wirth S, Muma M. Accelerated Sample-Accurate R-Peak Detectors Based on Visibility Graphs. In: Proceedings of the European Signal Processing Conference. 2023; pages 1090–1094.
  18. 18. Koka T, Muma M. Fast and Sample Accurate R-Peak Detection for Noisy ECG Using Visibility Graphs. In: Proceedings of the 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society. 2022; pages 121–126.
  19. 19. Gamboa H. Multi-modal Behavioral Biometrics Based on HCI and Electrophysiology. PhD Thesis, Universidade Técnica de Lisboa. 2008.
  20. 20. Hamilton P. Open Source ECG Analysis. In: Computers in Cardiology 2002. Volume 29. Memphis, TN, USA: IEEE; 2002. p. 101–104.
  21. 21. Hamilton PS, Tompkins WJ. Quantitative Investigation of QRS Detection Rules Using the MIT/BIH Arrhythmia Database. IEEE Transactions on Biomedical Engineering. 1986 Dec;BME-33(12):1157–1165. pmid:3817849
  22. 22. Johnson AEW, Behar J, Andreotti F, Clifford GD, Oster J. Multimodal heart beat detection using signal quality indices. Physiol Meas. 2015 Jul;36(8):1665–1677. pmid:26218060
  23. 23. Behar J, Oster J, Clifford GD. Non-invasive FECG extraction from a set of abdominal sensors. Computing in Cardiology 2013. 2013 Sep;297–300.
  24. 24. Behar J, Oster J, Clifford GD. Combining and benchmarking methods of foetal ECG extraction without maternal or scalp electrode data. Physiol Meas. 2014 Jul;35(8):1569–1589. pmid:25069410
  25. 25. Kalidas V, Tamil L. Real-time QRS detector using Stationary Wavelet Transform for Automated ECG Analysis. In: Proceedings of the 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE); 2017 Oct;457–461.
  26. 26. Martinez JP, Almeida R, Olmos S, Rocha AP, Laguna P. A wavelet-based ECG delineator: evaluation on standard databases. IEEE Trans Biomed Eng. 2004 Apr;51(4):570–581. pmid:15072211
  27. 27. Nabian M, Yin Y, Wormwood J, Quigley KS, Barrett LF, Ostadabbas S. An open-source feature extraction tool for the analysis of peripheral physiological data. IEEE J Transl Eng Health Med. 2018;6:2800711. pmid:30443441
  28. 28. Pan J, Tompkins WJ. A real-time QRS detection algorithm. IEEE Trans Biomed Eng. 1985 Mar;32(3):230–236. pmid:3997178
  29. 29. Moeyersons J, Amoni M, Van Huffel S, Willems R, Varon C. R-DECO: an open-source Matlab based graphical user interface for the detection and correction of R-peaks. PeerJ Comput Sci. 2019 Oct 21;5:e226. pmid:33816879
  30. 30. Elgendi M, Jonkman M, DeBoer F. Frequency bands effects on QRS detection. In: International Conference on Bio-inspired Systems and Signal Processing. Valencia, Spain: SciTePress; 2010. p. 428–431.
  31. 31. Zong W, Moody GB, Jiang D. A robust open-source algorithm to detect onset and duration of QRS complexes. In: Computers in Cardiology. 2003 Sep; p. 737–740.
  32. 32. Redmond SJ, Xie Y, Chang D, Basilakis J, Lovell NH. Electrocardiogram signal quality measures for unsupervised telehealth environments. Physiol Meas. 2012;33(9):1517–1533. pmid:22903004
  33. 33. Adeniji M, Brimicombe J, Cowie M, Dymond A, Lindén HC, Lip GYH, et al. Prioritising electrocardiograms for manual review to improve the efficiency of atrial fibrillation screening. In: Proc IEEE EMBS. 2022; p. 3239–3242. pmid:36086145
  34. 34. Laguna P, Jané R, Caminal P. Automatic detection of wave boundaries in multilead ECG signals: validation with the CSE database. Comput Biomed Res. 1994 Feb;27(1):45–60. pmid:8004942
  35. 35. Hnatkova K, Smetana P, Toman O, Schmidt G, Malik M. Sex and race differences in QRS duration. EP Europace. 2016;18(12):1842–1849. pmid:27142220
  36. 36. Wang NC, Maggioni AP, Konstam MA, Zannad F, Krasa HB, Burnett JC, et al. Clinical implications of QRS duration in patients hospitalized with worsening heart failure and reduced left ventricular ejection fraction. JAMA. 2008;299(22):2656–2666. pmid:18544725
  37. 37. Beni NH, Jiang N. Heartbeat detection from single-lead ECG contaminated with simulated EMG at different intensity levels: A comparative study. Biomed Signal Process Control. 2023;83:104612.
  38. 38. Liu F, Liu C, Zhao L, Jiang X, Zhang Z, Li J, et al. Dynamic ECG Signal Quality Evaluation Based on the Generalized bSQI Index. IEEE Access. 2018;6:41892–41902.
  39. 39. Clifford, GD. Linear Filtering Methods. In: Advanced Methods and Tools For ECG Data Analysis. Artech; 2006. p. 135–170.
  40. 40. McSharry, PE and Clifford, GD. Nonlinear Filtering Methods. In: Advanced Methods and Tools For ECG Data Analysis. Artech; 2006. p. 171–196.
  41. 41. Kotb A, Armstrong S, Koev I, Antoun I, Vali Z, Panchal G, et al. Digitally enabled acute care for atrial fibrillation: conception, feasibility and early outcomes of an AF virtual ward. Open Heart. 2023;10(1):e002272. pmid:37385729
  42. 42. Svennberg E, Engdahl J, Al-Khalili F, Friberg L, Frykman V, Rosenqvist M. Mass screening for untreated atrial fibrillation: the STROKESTOP study. Circulation. 2015;131(25):2176–2184. pmid:25910800
  43. 43. Goldenthal I, Sciacca RR, Riga T, Bakken S, Baumeister M, Biviano AB, et al. Recurrent atrial fibrillation/flutter detection after ablation or cardioversion using the AliveCor KardiaMobile device: iHEART results. Journal of Cardiovascular Electrophysiology. 2019;30(11):2220–2228. pmid:31507001