Skip to main content
Advertisement
  • Loading metrics

Evaluating the diagnostic and triage performance of digital and online symptom checkers for the presentation of myocardial infarction; A retrospective cross-sectional study

  • William Wallace,

    Roles Data curation, Formal analysis, Investigation, Methodology, Writing – original draft

    Affiliation Department of Surgery & Cancer, Imperial College London, St. Mary’s Hospital, London, United Kingdom

  • Calvin Chan ,

    Roles Data curation, Formal analysis, Project administration, Validation, Writing – original draft, Writing – review & editing

    calvin.chan16@imperial.ac.uk

    Affiliations Department of Surgery & Cancer, Imperial College London, St. Mary’s Hospital, London, United Kingdom, Institute of Global Health Innovation, Imperial College London, South Kensington Campus, London, United Kingdom

  • Swathikan Chidambaram,

    Roles Formal analysis, Investigation, Methodology, Writing – review & editing

    Affiliation Department of Surgery & Cancer, Imperial College London, St. Mary’s Hospital, London, United Kingdom

  • Lydia Hanna,

    Roles Data curation, Supervision, Validation, Writing – review & editing

    Affiliation Department of Surgery & Cancer, Imperial College London, St. Mary’s Hospital, London, United Kingdom

  • Amish Acharya,

    Roles Data curation, Formal analysis, Investigation, Project administration, Resources, Writing – review & editing

    Affiliations Department of Surgery & Cancer, Imperial College London, St. Mary’s Hospital, London, United Kingdom, Institute of Global Health Innovation, Imperial College London, South Kensington Campus, London, United Kingdom

  • Elisabeth Daniels,

    Roles Data curation, Writing – review & editing

    Affiliation Department of Surgery & Cancer, Imperial College London, St. Mary’s Hospital, London, United Kingdom

  • Pasha Normahani,

    Roles Project administration, Supervision, Writing – review & editing

    Affiliation Department of Surgery & Cancer, Imperial College London, St. Mary’s Hospital, London, United Kingdom

  • Rubeta N. Matin,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation Department of Dermatology, Oxford University Hospitals NHS Foundation Trust, Oxford, United Kingdom

  • Sheraz R. Markar,

    Roles Supervision

    Affiliation Surgical Intervention Trials Unit, Nuffield Department of Surgery, University of Oxford, United Kingdom

  • Viknesh Sounderajah,

    Roles Conceptualization, Data curation, Investigation, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Institute of Global Health Innovation, Imperial College London, South Kensington Campus, London, United Kingdom, Google Health UK, London, United Kingdom

  • Xiaoxuan Liu,

    Roles Conceptualization, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliations Institute of Inflammation and Ageing, University of Birmingham, Birmingham, United Kingdom, University Hospitals Birmingham NHS Foundation Trust, Birmingham, United Kingdom

  • Ara Darzi

    Roles Supervision

    Affiliations Department of Surgery & Cancer, Imperial College London, St. Mary’s Hospital, London, United Kingdom, Institute of Global Health Innovation, Imperial College London, South Kensington Campus, London, United Kingdom

Abstract

Online symptom checkers are increasingly popular health technologies that enable patients to input their symptoms to produce diagnoses and triage advice. However, there is concern regarding the performance and safety of symptom checkers in diagnosing and triaging patients with life-threatening conditions. This retrospective cross-sectional study aimed to evaluate and compare commercially available symptom checkers for performance in diagnosing and triaging myocardial infarctions (MI). Symptoms and biodata of MI patients were inputted into 8 symptom checkers identified through a systematic search. Anonymised clinical data of 100 consecutive MI patients were collected from a tertiary coronary intervention centre between 1st January 2020 to 31st December 2020. Outcomes included (1) diagnostic sensitivity as defined by symptom checkers outputting MI as the primary diagnosis (D1), or one of the top three (D3), or top five diagnoses (D5); and (2) triage sensitivity as defined by symptom checkers outputting urgent treatment recommendations. Overall D1 sensitivity was 48±31% and varied between symptom checkers (range: 6–85%). Overall D3 and D5 sensitivity were 73±20% (34–92%) and 79±14% (63–94%), respectively. Overall triage sensitivity was 83±13% (55–91%). 24±16% of atypical cases had a correct D1 though for female atypical cases D1 sensitivity was only 10%. Atypical MI D3 and D5 sensitivity were 44±21% and 48±24% respectively and were significantly lower than typical MI cases (p<0.01). Atypical MI triage sensitivity was significantly lower than typical cases (53±20% versus 84±15%, p<0.01). Female atypical cases had significantly lower diagnostic and triage sensitivity than typical female MI cases (p<0.01).Given the severity of the pathology, the diagnostic performance of symptom checkers for correctly diagnosing an MI is concerningly low. Moreover, there is considerable inter-symptom checker performance variation. Patients presenting with atypical symptoms were under-diagnosed and under-triaged, especially if female. This study highlights the need for improved clinical performance, equity and transparency associated with these technologies.

Author summary

Online symptom checkers are increasingly popular tools that patients turn to in order to understand their symptoms, self-diagnose and ultimately seek further medical attention. We wanted to evaluate how accurately different commercially available symptom checkers are able to diagnose a severe medical condition (myocardial infarction) and give appropriate medical advice (i.e., seek immediate medical assistance). In this study, we collected anonymous clinical data of 100 people who had confirmed myocardial infarctions. Their presenting symptoms and biodata (e.g., age, sex, co-morbidities) were inputted into the eight most popular commercially available symptom checkers found through a Google and an App Store search. We found that the performance of the online symptom checkers for correctly diagnosing a myocardial infarction was low, especially given the severity of the disease. Additionally, there was considerable variability in performance between different symptom checkers. Finally, we found that patients with atypical presenting symptoms were less likely to be diagnosed and given correct medical advice, especially if female. Our study highlights the need for better clinical performance, equality and transparency of online symptom checkers.

Introduction

Symptom checkers are applications or web-based tools that enable patients to input symptoms and personalised biodata to produce potential diagnoses and relevant clinical information, aiding self-diagnosis and triage. Symptom checkers are becoming an increasingly prominent feature of the modern healthcare landscape due to both increased access to internet connectivity and a cultural shift towards more involved self-care engagement. Approximately 70% of internet users search for health-related information, while over a third of adults use the internet for self-diagnosis [13]. Symptom checkers are especially pertinent given the unprecedented burden exerted on emergency care services, particularly in light of the COVID-19 pandemic.[35] There are over 23 million visits to A&E annually in the UK, of which an estimated 11.7% (approximately 2.7 million visits) of A&E attendances would be better managed by other services [6,7]. Thus, symptom checkers could reduce the financial and resource burden placed upon hospitals and can help focus resource towards those who are truly in medical need [8]. As such, governments have also incorporated symptom checkers into their formal health and social care pathways in order to alleviate the increasing burden that is placed upon both primary care services and emergency services [35]. In 2017, the NHS 111 triage service, backed by Babylon, reported over 15,500 app downloads and was responsible for over 9,700 triages [9]. In 2021, Babylon reportedly covered 4.3 million people worldwide, performing over 1.2 million digital consultations with 4,000 clinical consultations each day, with one patient interaction every 10 seconds.

Despite the increased use of these digital health technologies, there has been a lack of proper evaluation and assessment of public readiness which is apparent in the recent media criticism regarding symptom checkers. The NHS-backed application Babylon was found to miss myocardial infarctions (MIs) in female users, and users with atypical presentations of chest pain [10,11]. Myocardial infarctions usually present as central, substernal chest pain that is crushing, heavy or tight in character, with possible radiation to arm, shoulder, back, epigastric, jaw or neck [1214]. However, 20–30% of patients present atypically, especially if diabetic, elderly or female [1316]. Delays in seeking medical attention and receiving treatment increases risk of morbidity and mortality in patients with MI [1719]. During the COVID-19 pandemic. a 30–50% decrease in MI hospital admissions has been noted [2022]. Concordantly during this period, the incidence of out-of-hospital cardiac arrests in England due to MI has increased by 56% [20]. This could demonstrate a reduced willingness to seek medical attention and highlights a potential use case for symptom checkers, because symptom checkers can provide urgent triage advice and enable patients to receive prompt medical attention. However, the utility of symptom checkers wholly depends on the accuracy of diagnostic and triage advice given. If a patient with a non-urgent ailment is over-triaged by the symptom checker, healthcare services may be unnecessarily utilised. Conversely, symptom-checkers that miss cases of life-threatening and serious conditions could lead to people avoiding seeking help, which may increase morbidity and mortality [8,23].

There are a number of freely available online and app-based symptom checkers that patients can utilise. However, there is currently no guidance or endorsement by health agencies such as NICE or the NHS of any particular symptom checker. Furthermore, there is often a lack of clear guidance permitting the use of symptom checkers, and where there is guidance, it is unclear how well they are followed. For example, following International Medical Device Regulators Forum (IMDRF) guidance [24], software that has a medical purpose (as defined in the guidance) qualifies as a medical device and is regulated as such. Despite this apparent international harmonisation, there is still heterogeneity between IMDRF members as to what software qualifies from jurisdiction to jurisdiction. Although there are some NICE recommended apps such as Sleepio, the majority are not registered as medical devices (Class I, II, etc under CE/UKCA/FDA), nor do they state the classification where they do qualify as a medical device. There is therefore a lack of regulatory requirements relating to safety, such as post-market surveillance requirements. Thus, the primary aim of this study is to assess and compare the performance of different, publicly available symptom checkers for diagnosing and triaging myocardial infarctions. The secondary aim is to identify the potential presence of unequitable symptom checker performance based upon symptomatology (typical vs atypical MI) and pertinent patient demographics (age, sex and ethnicity).

Results

Identification of symptom checkers

In total, 60 symptom checkers were identified. 11 duplicates were removed, and 39 symptom checkers did not meet the inclusion criteria. 23 of the apps were not free of charge; 9 were not in the English Language; and 3 provided a limited number of diagnoses. A further two symptom checkers were removed due to having duplicate logic tools or backend software [8]. Eight symptom checkers were finally included for the study. This included seven symptom checkers which could both diagnose and triage patients and one (SC4) that provided only diagnostic information. All the included symptom checkers had disclaimers in their terms of use that they were “not designed to provide medical diagnosis, advice, or treatment”. At the time of data collection, five of the seven symptom checkers (SC1, SC2, SC3, SC6 and SC7) were registered with the MHRA by the manufacturer as Class I medical devices, the classification for the remaining symptom checkers were unclear. Other symptom checker characteristics are summarised in Table 1.

thumbnail
Table 1. Symptom checker characteristics regarding biodata, case specific questions and possible results.

https://doi.org/10.1371/journal.pdig.0000558.t001

Participants

Of the 100 cases collected 97 were confirmed to have an acute ST-Elevation Myocardial Infarction (STEMI), whilst the remaining 3 were found to have an acute Non ST Elevation Myocardial Infarction (NSTEMI). Females made up 24% of the 100 cases collected [25]. The female group had a significantly higher mean age than males (66.8±16.6 versus 60.7±14.0, p < .05). The frequency of hypercholesterolaemia was also significantly higher in the female group (62.5% versus 31.6%, p < .01). No other significant differences were noted between groups. Demographic data is summarised in Table 2.

thumbnail
Table 2. Baseline participant demographics stratified by sex.

https://doi.org/10.1371/journal.pdig.0000558.t002

Diagnostic sensitivity

The overall mean D1 (±standard deviation) sensitivity of symptom checkers was 48.0±31.4% (Fig 1). D1 sensitivity ranged from 85.0% to 6.0%. The overall mean D3 sensitivity was 72.6±20.2% and ranged from 92.0% to 34.0%. The overall mean D5 sensitivity was 78.5±13.5%, ranging from 94.0 to 63.0%.

thumbnail
Fig 1. Symptom checker D1, D3, D5 and Triage sensitivity for all cases (n = 100).

Note. Mean sensitivity for D1, D3, D5 and Triage has been included. SC4 was removed from Triage bar chart due a lack of triage function.

https://doi.org/10.1371/journal.pdig.0000558.g001

Subgroup analysis: Sex.

There were no significant differences in the mean D1 (48.7±33.7% versus 45.8±24.1%, p = 0.848) and D3 (75.7±21.7% versus 63±16.6%, p = 0.213) sensitivity between male and female patients across all symptom checkers (Fig 2A and 2B). For specific symptom checkers, we note that (1) D1 sensitivity for males was significantly higher than for females in SC 8 (89.5% versus 70.8%, p < .05), (2) D3 sensitivity for males was significantly higher than females for three of the symptom checkers assessed. Pooled mean D5 sensitivity for males was significantly higher than for females (82.1±14.5% versus 67.2±11.7%, p < .05) (Fig 2C). D5 sensitivity was significantly higher for males than for females in four symptom checkers examined.

thumbnail
Fig 2. Symptom checker D1, D3, D5 and Triage sensitivity for both male (blue, n = 76) and female (orange, n = 24) cases.

Note. Mean sensitivity for D1, D3, D5 and Triage has been included, error bars represent standard deviation. Overall differences between sensitivity for male and female cases assessed using t-test for each sensitivity measure. Significance within each symptom checker assessed using Pearson’s Chi Squared test for each sensitivity measure. Levels of significance shown with asterisks (* p < .05, ** p < .01 and *** p < .001). SC4 was removed from Triage bar chart due a lack of triage function.

https://doi.org/10.1371/journal.pdig.0000558.g002

Subgroup analysis: Symptomology (atypical versus typical MI presentation).

Mean D1 sensitivity for atypical MI cases was not significantly lower than typical presentations (24.0±16.2% versus 52.5±34.9%, p = .064) (Fig 3A). The D1 sensitivity for atypical MI cases was significantly lower for five symptom checkers.

thumbnail
Fig 3. Symptom checker D1, D3, D5 and Triage sensitivity for both typical (blue, n = 84) and atypical (orange, n = 16) MI cases.

Note. Mean sensitivity for D1, D3, D5 and Triage has been included, error bars represent standard deviation. Overall differences between sensitivity for typical and atypical MI cases assessed using t-Test for each sensitivity measure. Significance within each symptom checker assessed using Pearson’s Chi Squared test for each sensitivity measure. Levels of significance shown with asterisks (* p<0.05, ** p<0.01 and *** p<0.001). SC4 was removed from Triage bar chart due a lack of triage function.

https://doi.org/10.1371/journal.pdig.0000558.g003

Mean D3 sensitivity for atypical MI cases was significantly lower versus typical MI cases (43.8±20.6% versus 78.1±23.13%, p < .01) (Fig 3B). The D3 sensitivity for atypical MI cases was significantly lower versus typical MI cases for five symptom checkers.

Finally, mean D5 sensitivity for atypical cases was also significantly lower versus typical cases (48.4±23.6% versus 84.2±14.7%, p < .01) (Fig 3C). The D5 sensitivity for atypical MI cases was significantly lower versus typical MI cases for five symptom checkers.

Subgroup analysis: Symptomology stratified by sex.

There were no significant differences in the mean D1 (30.7±21.7% versus 51.7±36.4%, p = .187) and D3 (54.6±24.8% versus 79.3±24.1%, p = .063) sensitivity in male patients with typical or atypical MIs (Fig 4A and 4B). One symptom checker was unable to diagnose any male atypical MI cases as a primary (D1) diagnosis. Mean D5 sensitivity for males with atypical MI was significantly lower versus males with typical MI (59.1±27.0% versus 86±15.1%, p < .05) (Fig 4C), and was evident in four symptom checkers.

thumbnail
Fig 4. Symptom checker D1, D3, D5 and Triage sensitivity for typical and atypical cases for both male (blue, n = 65 and pale blue, n = 19 respectively) and female (orange, n = 11 and pale orange, n = 5 respectively).

Note. Data labels have been included. Mean sensitivity for D1, D3, D5 and Triage has been included, error bars represent standard deviation. Overall differences between sensitivity for typical and atypical cases assessed using t-Test for each sensitivity measure. Significance within each symptom checker assessed using Pearson’s Chi Squared test for each sensitivity measure. Levels of significance shown with asterisks (* p<0.05, ** p<0.01 and *** p<0.001). One symptom checker (SC 4) was removed from Triage bar chart due a lack of triage function.

https://doi.org/10.1371/journal.pdig.0000558.g004

Mean D1 sensitivity for females with atypical MI was significantly lower versus females with typical MI (10.0±10.69% versus 55.3±29.8%, p < .01). Four symptom checkers were unable to diagnose any female atypical MI cases as the primary diagnosis. Mean D3 sensitivity for females with atypical MI was significantly lower versus females with typical MI (20.0±15.1% versus 74.3±21.5%, p < .001). Mean D5 sensitivity for females with atypical MI was significantly lower versus females with typical MI (25.0±20.7% versus 78.3±15.5%, p < .001) and was seen in five symptom checkers. Two symptom checkers were unable to diagnose any female atypical MI cases in the top three or five diagnoses.

Subgroup analysis: Age.

Stratified based on decade of birth, there were statistically significant differences within symptom checkers for D1, D3 and D5 sensitivity (p<0.05, S2 Fig). The mean D1 sensitivity was highest for users aged 50–70 years old (50.6%- 52.1%), while sensitivity was lowest for users aged 20–30 years and >90 years (37.5% and 0%, respectively). In four symptom checkers, age was a statistically significant factor for D1 sensitivity with a poorer performance for ages at the extreme. The mean D3 sensitivity was highest for those aged 51–60 years (78.9%), although this decreased for older ages and reached a nadir of 0% for those aged 90 and above. At the younger extremes, D3 sensitivity was lower for those aged 30 and below (37.5%). Age was a significant factor in six of the eight symptom checkers tested for D3 sensitivity (p<0.05). The mean D5 sensitivity was highest for those aged 41–50 and 51–60 years old (82.2% and 84.4%, respectively). Most notably, the D1, D3 and D5 sensitivity was 0% across all symptom checkers for those aged 90 and above, while a similar feature was noted for those aged 30 and below in five of the eight symptom checkers.

Triage sensitivity

The overall mean triage sensitivity was 82.6±12.6%, ranging from 91.0% to 55.0%.

Subgroup analysis: Sex.

Mean triage sensitivity for males versus females showed no significant difference (84.6±13.0% versus 76.2±13.1%, p = .25) (Fig 2D). One symptom checker significantly higher triage sensitivity for males versus females (96.1% versus 75.0%, p < .01).

Subgroup analysis: Symptomology (atypical versus typical MI presentation).

Mean triage sensitivity for atypical cases was significantly lower versus typical cases (52.7±20.0% versus 88.3±13.4, p < .01) (Fig 3D). The triage sensitivity for atypical MI cases was significantly lower versus typical MI cases for four symptom checkers.

Subgroup analysis: Symptomology stratified by sex.

Mean triage sensitivity was significantly lower for males with atypical MI versus males with typical MI (61.0±21.5% versus 88.6±14.2%, p < .05) (Fig 4D). The triage sensitivity for males with atypical MI was significantly lower versus males with typical MI for five symptom checkers. Mean triage sensitivity was significantly lower for females with atypical MI versus females with typical MI (34.3±36.0% versus 87.2±10.9%, p < .01). The triage sensitivity for females with atypical MI was significantly lower versus females with typical MI for five symptom checkers. Two symptom checkers were unable to provide appropriate triage advice for any female cases with atypical MI.

Discussion

This study evaluated the performance of symptom checkers for correctly diagnosing and appropriately triaging myocardial infarctions. The mean overall sensitivity across symptom checkers for diagnosing MI as the first diagnosis was poor and demonstrated significant variations between symptom checkers. Additionally, variability in providing appropriate triage advice was also noted. The main factors that affected the performance of symptom checkers were an atypical pattern of presenting symptoms and sex. MI cases with atypical presenting symptoms had significantly lower triage sensitivity compared to typical MIs. Furthermore, females with atypical MI presentations were found to be significantly under-diagnosed and under-triaged than those presenting with a pattern that is more typically seen in MIs.

In this study, a primary diagnosis of MI was missed in approximately one in every two cases. This indicates a significantly low sensitivity for such a life-threatening condition. Given the relative lack of transparency in how these symptom checkers are constructed and validated prior to public rollout, it is difficult to pinpoint to particular factors which adversely impact performance. Several included symptom checkers claim to house “AI algorithms” as part of their diagnostic processes. AI-related factors that negatively impact performance include (1) training data that are limited (without the ‘4 Vs’ of big data–volume, variety, velocity and veracity), (2) algorithms with limited accuracy (due to poor choice of predictive factors and poor external validity), (3) a poor user interface that prevents users from navigating the software to best utilise features, and (4) inappropriate intended use of the software. These performance factors represent potential further avenues of development for health technologists and developers (further discussed in this paper’s Recommendations). However, it is currently unclear in what capacity AI is used in the decision-making process of the included symptom checkers, if at all. Indeed, symptom checkers have been reported to use hand-coded decision trees, using AI only for natural language processing to help interpret patient prompts [26,27]. Full transparency is needed in identifying the exact role of AI, if any, in these acutely patient-facing applications.

The observed variation in performance between different public-facing symptom checkers suggests a need for strict evaluation and robust regulation, given their potentially vast audience to avoid life-threatening consequences of misdiagnosis [5,23]. It is important that the evidence produced by healthcare researchers and used by regulators and policymakers is transparent and complete in order to conduct fair health technology assessments. Key factors that need to be accounted for include clinical context, reporting of reference standards and case coverage (i.e., what conditions and patient populations are accounted for). These reporting and evidential tenets are reinforced by guideline initiatives produced by The EQUATOR Network which direct the reporting of such key study characteristics for AI-centred diagnostic accuracy and clinical trial studies respectively [28,29].

The observed primary diagnostic sensitivity for detecting MI in our study was higher than previous vignette studies examining symptom checkers for other common medical conditions [8,30,31]. This could be partly attributed to the heterogenous nature of chest pain as a presenting complaint which would generate a wide list of differential diagnoses. While some of these could be MI, previous work has reported a prevalence of up to 50% of patients evaluated to have been diagnosed with non-specific chest pain, of which 56% continued to have persistent non-cardiac pain [32]. Given such a high prevalence, differentiating between imminently life-threatening cardiac pathology, less urgent non-cardiac conditions, and non-urgent non-specific chest pain is difficult without further clinical investigation. This links back to the data that can be inputted into symptom checkers. The traditional Osler’s model of diagnostic process consists of history and examination before further investigations. Although symptom checkers currently do not incorporate equivalent informatic input for chest pain, the recent increase in the use of smart watches and phones which are able to collect information on vital signs such as heart rate, blood pressure and oxygen saturation is an avenue for symptom checkers to expand into and may strengthen their diagnostic and triage accuracies, although this will also rely on extremely robust data collection and require regulatory oversight.

Our study highlighted significant differences in diagnostic and triage sensitivity due to patient demographics, in particular lower accuracies for females regardless of symptomology. Symptom checkers significantly under-diagnosed females in at least one diagnostic performance measure. Although previous epidemiological studies have shown that compared to men, the overall incidence of MI is lower in women [33], it may also present with less pathognomonic symptomology [13,25]. Our findings suggest that the current algorithms utilised by the symptom checkers in our study are not sufficiently robust to discern between these presentations, which may intrinsically generate a bias against women who use these symptom checkers. Regarding ethnicity, although differences in MI incidence depending on ethnicity has previously been noted [34,35], this study did not assess this variable. Ethnicity is likely to be a very relevant risk factor which should be considered, especially in the context of MIs where atypical presentations of chest pain are more prevalent in non-Caucasian populations [36].

Our study demonstrated that age affected the diagnostic performance of symptom checkers. Age is an important determinant in risk stratification of patients with presenting symptoms of MIs. Age is frequently incorporated in: (1) scoring the likelihood of MI as the diagnosis, (2) predicting the outcome of treatment, (3) determining patient eligibility for or benefit from interventions, and (4) in estimating the risk of short- and long-term complications. Age was a well-established risk factor for MI even before the often-cited Framingham Study. Surprisingly, in our work, the diagnostic performance was lower for ages higher than 60 in some symptom checkers. In scoring systems such as the TIMI and GRACE scores, age above 65 receives a higher score, reflecting a higher likelihood and risk of mortality and complications. This is in concordance with previous work, which has highlighted a similar age-related bias albeit in different medical conditions. Given that older patients are already predisposed to poorer health outcomes with digital technology for various reasons including a lower digital health literacy, intrinsic age-related algorithmic biases could exacerbate these outcomes [37].

Similar biases may also be present in other socioeconomic and cultural demographic factions. Particularly in light of the COVID-19 pandemic, which has already disproportionately affected disadvantaged communities, digital health technologies, including those incorporating AI, could risk exacerbating existing health inequalities [38]. A more inclusive strategy is needed so that these disparities are not further exacerbated [39]. One source of imbalance could be due to ‘algorithmic bias’, the application of an algorithm that exacerbates existing inequalities (e.g., socioeconomic status, ethnic background, sex and disability). Furthermore, this is compounded by questions regarding the clinical translatability and reproducibility of AI healthcare research that overall raise the issue about clinical safety [40]. Policy and decision makers, health technologists, and health officials need to be aware of these potential pitfalls.

Recommendations

Policy makers and regulators.

Our study emphasises an urgent reassessment of how symptom checkers are regulated that reflects the clinical function and associated risk that these systems serve in the real world as a public facing source of medical advice [23,41]. Currently, regulatory authorities, such as The Medicines and Healthcare Products Regulatory Agency (MHRA), the regulatory body responsible for medicines and medical devices in the UK, classifies Software as a Medical Device (including apps) into one of four groups (Class I, IIa, IIb and III) based on a risk classification system. The significance of the classification rules is that Class IIa and above require conformity assessment with an notified body (CE) or approved body (UKCA), whereas Class I medical devices are self-registered by their manufacture and are therefore only subject to self-certification prior to roll-out. In this study, most symptom checkers had a class 1 MHRA classification and were only subject to self-certification prior to roll-out. All of the symptom checkers had disclaimers in their terms of use that they are “not designed to provide medical diagnosis, advice, or treatment”, and that patients should “discuss all information…with (their) physicians before making any medical decisions, including starting, stopping or modifying any medication or other treatment or care plan”. This shifts the responsibility of appropriately using the apps to the lay public audience who do not necessarily have the expertise for this. Despite disclaimers, these systems clearly provide users with information that is relevant to managing disease or injury, behavioural recommendations, and can be viewed as providing an “indicative diagnosis” in the context of lay usage. Software that provides “decisive information for making or allowing diagnosis” is sufficient for the device to be at least Class IIa under the UK Medical Device Regulations 2002 according to MHRA guidance [42]. Regulatory scrutiny should encompass the entire development pathway, from (1) the datasets used to train and validate the systems, (2) the clinical context in which evaluations were undertaken, (3) reporting of diagnostic and triage performance in the intended clinical setting of use, and (4) evaluating public understanding of the intended purpose of these systems (as this will inform how lay users utilise symptom checkers). This could all be aided by independent expert review of these systems, which will occure where the software qualifies as Class IIa and above (in the EU or UK) and where certain FDA pathways are used [43]. Finally, ensuring that symptom checkers qualify as medical devices and are appropriately classified would allow greater post roll-out scrutiny, which will supersede the current status quo of anecdotal reports through media sources.

Health technologists.

Robust development, validation and testing of symptom checkers is needed to account for potential biases and inequities at every stage of the health technology development process. First, development of symptom checkers that utilise an AI model in their decision-making processes should ideally incorporate real-world patient data. Moreover, larger and more diverse patient datasets should be used and algorithms should be subsequently validated on external datasets to assure its generalisability [44,45]. The importance of using real-world data has been emphasized by regulatory bodies, as seen by the introduction of the “Real-World Evidence” initiative by the FDA [46]. Second, ensuring transparent and complete reporting of datasets used is will be crucial in increasing accountability. This has been evidenced by the recent proposed “Healthsheet” initiative that set healthcare-specific standards around dataset reporting [47]. Efforts such as these to improve reporting and transparency will aid further scrutiny from regulators, policymakers, hosts of apps (app stores such as Apple and Google store) and health data institutions. Third, the roles that App stores should play in the future, including whether they should assume responsibility of validating claims as a medical device or otherwise, is yet unclear. However, App stores should actively engage in discussions about their role in the medical ecosystem, and how they can usefully contribute to protecting patient and public safety [39]. Fourth, where diagnostic dilemmas and uncertainties are common, this is not always indicated by the apps’ output, so developers could be more open about this. In the same vein, integrating these applications into the care pathway means patients will be safely redirected to various components of the clinical pathway where diagnostic uncertainty is present. Finally, after deployment of the symptom checker, continuous audit should be performed. This will help find potential unequitable impacts of the software on different population groups and highlight areas of bias and other negative effects in a timely manner so that they can be addressed [39].

Patients and the public.

Conversely, increased patient and public awareness through education of the appropriate use and limitations of these increasingly prevalent technologies is crucial. However, patient education is hampered by the digital divide, which is driven by factors such as age, ethnicity, sex, state of health and socioeconomic status. A recent YouGov survey conducted by the British Association of Dermatologists showed that although 41% of the UK public would trust a diagnosis from smart-phone based app to diagnose skin cancer, over 50% were not confident in their ability to appraise the quality of the apps (i.e., determine whether an app can do what they claim) [48]. The inability to independently and confidently recognize and understand the quality of these apps may expose patients to using poor quality apps, which may worsen their health outcomes. Therefore, additional drivers such as health and technology literacy must also be addressed. Initiatives that have previously targeted this include NHS England’s Widening Digital Participation programme, which has trained over 220,000 people, including those most vulnerable to digital exclusion, to use digital health resources [49]. Patient-driven evaluation and outcomes regarding symptom checkers are also important to better understand patient acceptability and usability of the software, and could be achieved through engaging diverse patients and public members in the development process.

Limitations

First, the present study utilised a relatively small sample size of 100 consecutive patients from a single centre, which may limit generalisability of study findings. Additionally, power was limited for cohort subgroups (such as female atypical cases). We also found ethnic minorities to be poorly coded, with large proportions of ‘unknown’ ethnicities and underrepresentation of ethnic groups, such as patients. Second, data in the present study was retrospectively collected from existing routine health records. This represents a substantial limitation as data quality relies on the accuracy and completeness of clinical notes and may not wholly represent patients’ first-hand symptoms. Additionally, answers given to questions asked by clinicians may differ from answers patients would have given to the symptom checker. Third, data inputted only included patients with a confirmed diagnosis of MI, which represents a selection bias. Additionally, this meant that true negatives (i.e., patients with chest pain symptomology but no MI) were not included. Fourth, two of the included symptom checkers were unable to produce a diagnosis of MI and thus their most relevant diagnosis was given, resulting in inclusion of a broader range of diagnoses such as unstable angina. This may have exaggerated the true sensitivity of these two symptom checkers. Fifth, the data collection and statistical analysis was conducted during mid-2021. Given this area of medical technology is ever-developing, it is likely that these symptoms checkers have received updates since the initial data collection. Since the data collection, the MHRA classification of the symptom checkers has or is likely to have changed and thus might not represent the current performance of these devices. Finally, performance was determined based only on cases with an established ground truth (angiography), so while we were able to evaluate the sensitivity of SCs, our analysis did not capture the specificity. Comparison against first port-of-call healthcare professionals (e.g. GP or A&E practitioner) would improve understanding of the utility of symptom checkers.

Conclusions

This study evaluated the performance of eight symptom checkers for diagnosing and triaging myocardial infarctions. Symptom checkers generally provided low sensitivity for diagnosing MI. Triage sensitivity was higher than diagnostic sensitivty, although approximately 20% of cases were under-triaged. Patients who presented with atypical symptoms were under-diagnosed and under-triaged; especially those that were female. Our findings emphasise the need for urgent reassessment of regulation of symptom checkers from policy makers and regulators to ensure good clinical performance, whilst maintaining patient equity and technology transparency.

Materials and methods

Study design

This retrospective diagnostic accuracy study was conducted using anonymised “real world” patient vignettes to assess symptom checker diagnostic and triage sensitivity for confirmed cases of MI. This study has been reported in accordance with the Standards for Reporting Diagnostic Accuracy Studies (STARD) 2015 guidelines [50]. Ethical approval was granted by the Imperial College London Research Ethics Committee (reference: 22/HRA/1824).

Participants

One hundred consecutive anonymised vignettes of patients who were treated for a confirmed MI at a tertiary primary percutaneous coronary intervention centre in London were collected between 1st January 2020 to 31st December 2020. Patients were retrospectively identified by the clinical team on the coronary care unit and cardiology ward. All consecutive adult patients who had a confirmed diagnosis of a MI were included within the time period. Ground truth was established using coronary angiography findings. Patient demographics, relevant past medical history and presenting symptomatology were extracted from electronic health records and verified by two independent investigators. Patients were coded as male or female in the electronic health system, which was assumed to be biological sex. The type of MI (typical or atypical) for each case was also noted. Atypical MI was defined by a burning chest pain or a lack of chest pain as per traditional teaching.

Intervention

A Google internet and Apple app store search was conducted (15 March 2021) to identify symptom checkers for inclusion in this study. The first 100 results on Google were recorded, resulting in 49 potential symptom checkers being identified (S1 Fig). The top 20 most commonly downloaded results on the Apple App resulted in 11 potential symptom checkers. Eligible symptom checkers had to be free at the point of use, publicly available for use in the United Kingdom and in the English language. Symptom checkers were excluded if diagnoses were not provided; only a limited number of conditions could be diagnosed; and not in the English language.

Test methods

Biodata (age and sex) and presenting symptoms of the 100 consecutive cases were entered into each symptom checker website or app by two independent study investigators. This simulated the results each patient would have received if they had used the symptom checker to query their own symptoms upon the onset of MI. The top five diagnoses given by each symptom checker for each case were recorded. The position of MI for each was used to measure diagnostic sensitivty, whether it being first in the list of possible conditions (D1), within the first 3 (D3) or first 5 (D5) results. If the symptom checker failed to provide a final diagnosis, this was recorded as an incorrect response. Two of the symptom checkers (SC2 and SC5) were unable to diagnose MI as a stand-alone condition; the former gave the diagnosis of a serious heart problem, while the latter diagnosed ischaemic heart disease (Table 1). Both results, respectively, were deemed correct due to this being the most serious cardiovascular condition that could be diagnosed and would include MI. Advice was considered triage in nature if the symptom checker provided an assessment of the urgency of the underlying diagnosis and/or provided advice on next steps to follow, including self-care; re-evaluation of symptoms after an interval period; seeking medical help with a GP; and attending the ED. Triage advice was deemed correct if the symptom checkers recommended contacting emergency services or the most urgent triage advice that could be produced. All other advice, such as seeking GP advice or self-care, was deemed inappropriate and recorded as incorrect. If a symptom checker reached an incorrect diagnosis but still correctly triaged the case, this was recorded as correct advice. Data collection was performed in April 2021 using the most up to date version of each symptom checker website or application and was collated on Excel (Microsoft Corporation).

Analysis

Statistical tests were performed using SPSS Statistics 27 (IBM Corporation). All data were confirmed to be normally distributed using Kolmogorov-Smirnov tests. Two-tailed t-tests were performed to assess for overall between-sex sensitivity differences. Pearson’s Chi Squared tests were used for all other sub-group analyses. Significance was indicated if p<0.05.

Supporting information

S1 Fig. Flow chart showing symptom checker selection process.

https://doi.org/10.1371/journal.pdig.0000558.s001

(TIFF)

S2 Fig.

Symptom checker D1 (A), D3 (B) and D5 (C) performance for decade-wise age categories.

https://doi.org/10.1371/journal.pdig.0000558.s002

(TIFF)

Acknowledgments

Infrastructure support for this research was provided by the NIHR Imperial Biomedical Research Centre.

References

  1. 1. Mueller J, Jay C, Harper S, Davies A, Vega J, Todd C. Web use for symptom appraisal of physical health conditions: A systematic review. J Med Internet Res. 2017;19: e202. pmid:28611017
  2. 2. Hochberg I, Allon R, Yom-Tov E. Assessment of the frequency of online searches for symptoms before diagnosis: Analysis of archival data. J Med Internet Res. 2020;22: e15065. pmid:32141835
  3. 3. Berry AC. Online symptom checker applications: Syndromic surveillance for international health. Ochsner Journal. 2018;18: 298–299. pmid:30559611
  4. 4. McIntyre D, Chow CK. Waiting Time as an Indicator for Health Services Under Strain: A Narrative Review. Inquiry. 2020;57: 004695802091030. pmid:32349581
  5. 5. Gottliebsen K, Petersson G. Limited evidence of benefits of patient operated intelligent primary care triage tools: Findings of a literature review. BMJ Health Care Inform. 2020;27: e100114. pmid:32385041
  6. 6. McHale P, Wood S, Hughes K, Bellis MA, Demnitz U, Wyke S. Who uses emergency departments inappropriately and when—a national cross-sectional study using a monitoring data system. BMC Med. 2013;11: 258. pmid:24330758
  7. 7. Missed GP appointments costing NHS millions. In: NHS England [Internet]. 2019 [cited 5 Aug 2021]. Available: https://www.england.nhs.uk/2019/01/missed-gp-appointments-costing-nhs-millions/
  8. 8. Semigran HL, Linder JA, Gidengil C, Mehrotra A. Evaluation of symptom checkers for self diagnosis and triage: Audit study. BMJ. 2015;351: h3480. pmid:26157077
  9. 9. NHS 111 powered by Babylon; Outcomes evaluation. 2017 [cited 5 Aug 2021]. Available: https://assets.babylonhealth.com/nhs/NHS-111-Evaluation-of-outcomes.pdf
  10. 10. Das S. It’s hysteria, not a heart attack, GP app Babylon tells women. In: The Sunday Times [Internet]. 2019 [cited 10 Mar 2021]. Available: https://www.thetimes.co.uk/article/its-hysteria-not-a-heart-attack-gp-app-tells-women-gm2vxbrqk#
  11. 11. Adegoke Y. ’Calm down dear, it’s only an aneurysm’–why doctors need to take women’s pain seriously. In: The Guardian [Internet]. 2019 [cited 15 Mar 2021]. Available: https://www.theguardian.com/commentisfree/2019/oct/16/doctors-women-pain-heart-attack-hysteria
  12. 12. Greenslade JH, Cullen L, Parsonage W, Reid CM, Body R, Richards M, et al. Examining the signs and symptoms experienced by individuals with suspected acute coronary syndrome in the Asia-pacific region: A prospective observational study. Ann Emerg Med. 2012;60: 777–785.e3. pmid:22738683
  13. 13. DeVon HA, Mirzaei S, Zègre-Hemsey J. Typical and Atypical Symptoms of Acute Coronary Syndrome: Time to Retire the Terms? J Am Heart Assoc. 2020;9: e015539. pmid:32208828
  14. 14. Boateng S, Sanborn T. Acute myocardial infarction. Disease-a-Month. 2013;59: 83–96. pmid:23410669
  15. 15. Canto JG. Prevalence, Clinical Characteristics, and Mortality Among Patients With Myocardial Infarction Presenting Without Chest Pain. JAMA. 2000;283: 3223. pmid:10866870
  16. 16. Canto JG, Rogers WJ, Goldberg RJ, Peterson ED, Wenger NK, Vaccarino V, et al. Association of Age and Sex With Myocardial Infarction Symptom Presentation and In-Hospital Mortality. JAMA. 2012;307: 813–822. pmid:22357832
  17. 17. Saczynski JS, Yarzebski J, Lessard D, Spencer FA, Gurwitz JH, Gore JM, et al. Trends in Prehospital Delay in Patients With Acute Myocardial Infarction (from the Worcester Heart Attack Study). American Journal of Cardiology. 2008;102: 1589–1594. pmid:19064010
  18. 18. De Luca G, Suryapranata H, Marino P. Reperfusion Strategies in Acute ST-Elevation Myocardial Infarction: An Overview of Current Status. Prog Cardiovasc Dis. 2008;50: 352–382. pmid:18313480
  19. 19. Nallamothu BK, Bradley EH, Krumholz HM. Time to Treatment in Primary Percutaneous Coronary Intervention. N Engl J Med. 2007;357: 1631–1638. pmid:17942875
  20. 20. Rashid Hons M, Gale Hons CP, Curzen Hons N, Ludman Hons P, De Belder Hons M, Timmis Hons A, et al. Impact of Coronavirus Disease 2019 Pandemic on the Incidence and Management of Out-of-Hospital Cardiac Arrest in Patients Presenting With Acute Myocardial Infarction in England. J Am Heart Assoc. 2020;9: e018379. pmid:33023348
  21. 21. Mesnier J, Cottin Y, Coste P, Ferrari E, Schiele F, Lemesle G, et al. Hospital admissions for acute myocardial infarction before and after lockdown according to regional prevalence of COVID-19 and patient profile in France: a registry study. Lancet Public Health. 2020;5: e536–e542. pmid:32950075
  22. 22. De Rosa S, Spaccarotella C, Basso C, Calabrò MP, Curcio A, Filardi PP, et al. Reduction of hospitalizations for myocardial infarction in Italy in the COVID-19 era. Eur Heart J. 2020;41: 2083–2088. pmid:32412631
  23. 23. Fraser H, Coiera E, Wong D. Safety of patient-facing digital symptom checkers. The Lancet. 2018;392: 2263–2264. pmid:30413281
  24. 24. IMDRF SaMD Working Group. Software as a Medical Device (SaMD): Key Definitions. 9 Dec 2013 [cited 26 Aug 2021]. Available: https://www.imdrf.org/sites/default/files/docs/imdrf/final/technical/imdrf-tech-131209-samd-key-definitions-140901.pdf
  25. 25. Millett ERC, Peters SAE, Woodward M. Sex differences in risk factors for myocardial infarction: cohort study of UK Biobank participants. BMJ. 2018;363: k4247. pmid:30404896
  26. 26. Nicol-Schwarz K. The rise—and fall—of Babylon. In: Sifted [Internet]. 18 Oct 2023 [cited 26 Jan 2024]. Available: https://sifted.eu/articles/the-rise-and-fall-of-babylon
  27. 27. Middleton K, Butt M, Hammerla N, Hamblin S, Mehta K, Parsa A. Sorting out symptoms: design and evaluation of the “babylon check” automated triage system. 2016 Jun. Available: https://arxiv.org/abs/1606.02041
  28. 28. Sounderajah V, Ashrafian H, Aggarwal R, De Fauw J, Denniston AK, Greaves F, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group. Nature Medicine. Nature Research; 2020. pp. 807–808. https://doi.org/10.1038/s41591-020-0941-1 pmid:32514173
  29. 29. Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK, Chan AW, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nature Medicine 2020 26:9. 2020;26: 1364–1374. pmid:32908283
  30. 30. Gilbert S, Mehl A, Baluch A, Cawley C, Challiner J, Fraser H, et al. How accurate are digital symptom assessment apps for suggesting conditions and urgency advice? A clinical vignettes comparison to GPs. BMJ Open. 2020;10: e040269. pmid:33328258
  31. 31. Hill MG, Sim M, Mills B, M.G H, M S. The quality of diagnosis and triage advice provided by free online symptom checkers and apps in Australia. Med J Aust. 2020;212: 514–519. pmid:32391611
  32. 32. Glombiewski JA. The Course of Nonspecific Chest Pain in Primary Care. Arch Intern Med. 2010;170: 251. pmid:20142569
  33. 33. Kytö V, Sipilä J, Rautava P. Gender, age and risk of ST segment elevation myocardial infarction. Eur J Clin Invest. 2014;44: 902–909. pmid:25175007
  34. 34. Chi GC, Kanter MH, Li BH, Qian L, Reading SR, Harrison TN, et al. Trends in acute myocardial infarction by race and ethnicity. J Am Heart Assoc. 2020;9. pmid:32114888
  35. 35. Bansal N, Fischbacher CM, Bhopal RS, Brown H, Steiner MFC, Capewell S. Myocardial infarction incidence and survival by ethnic group: Scottish Health and Ethnicity Linkage retrospective cohort study. BMJ Open. 2013;3: e003415. pmid:24038009
  36. 36. Teoh M, Lalondrelle S, Roughton M, Grocott-Mason R, Dubrey SW. Acute coronary syndromes and their presentation in Asian and Caucasian patients in Britain. Heart. 2007;93: 183. pmid:16914486
  37. 37. Morse KE, Ostberg NP, Jones VG, Chan AS, K.E M, N.P O, et al. Use Characteristics and Triage Acuity of a Digital Symptom Checker in a Large Integrated Health System: Population-Based Descriptive Study. J Med Internet Res. 2020;22: e20549. pmid:33170799
  38. 38. Leslie D, Mazumder A, Peppin A, Wolters MK, Hagerty A. Does “AI” stand for augmenting inequality in the era of covid-19 healthcare? BMJ. 2021;372. pmid:33722847
  39. 39. O’Brien N, Van Dael J, Clarke J, Gardner C, O’Shaughnessy J, Darzi A, et al. Addressing racial and ethnic inequities in data-driven health technologies. London; 2022. https://doi.org/10.25561/94902
  40. 40. Challen R, Denny J, Pitt M, Gompels L, Edwards T, Tsaneva-Atanasova K. Artificial intelligence, bias and clinical safety. BMJ Qual Saf. 2019;28: 231–237. pmid:30636200
  41. 41. Ceney A, Tolond S, Glowinski A, Marks B, Swift S, Palser T. Accuracy of online symptom checkers and the potential impact on service utilisation. PLoS One. 2021;16: e0254088. pmid:34265845
  42. 42. Medical devices: software applications (apps). In: GOV.UK [Internet]. 8 Aug 2014 [cited 21 Apr 2024]. Available: https://www.gov.uk/government/publications/medical-devices-software-applications-apps
  43. 43. Software as a Medical Device (SaMD). In: US Food & Drug Administration [Internet]. 2018 [cited 5 Jun 2022]. Available: https://www.fda.gov/medical-devices/digital-health-center-excellence/software-medical-device-samd
  44. 44. Silverio A, Cavallo P, De Rosa R, Galasso G. Big health data and cardiovascular diseases: A challenge for research, an opportunity for clinical care. Front Med (Lausanne). 2019;6: 36. pmid:30873409
  45. 45. Weintraub WS. Role of Big Data in Cardiovascular Research. J Am Heart Assoc. 2019;8. pmid:31293194
  46. 46. Real-World Evidence. In: US Food & Drug Administration [Internet]. 20 May 2022 [cited 3 Jun 2022]. Available: https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence
  47. 47. Rostamzadeh N, Mincu D, Roy S, Smart A, Wilcox L, Pushkarna M, et al. Healthsheet: Development of a Transparency Artifact for Health Datasets. J ACM. 2022;37: 29.
  48. 48. Doctors issue warning about dangerous AI-based diagnostic skin cancer apps. In: British Association of Dermatologists [Internet]. 2022 [cited 3 Aug 2022]. Available: https://www.bad.org.uk/doctors-issue-warning-about-dangerous-ai-based-skin-cancer-apps/
  49. 49. Estacio EV, Whittle R, Protheroe J. The digital divide: Examining socio-demographic factors associated with health literacy, access and use of internet to seek health information. J Health Psychol. 2019;24: 1668–1675. pmid:28810415
  50. 50. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, et al. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. The BMJ. 2015;351: h5527. pmid:26511519