A population-based study exploring phenotypic clusters and clinical outcomes in stroke using unsupervised machine learning approach

Ralph K. Akyea; George Ntaios; Evangelos Kontopantelis; Georgios Georgiopoulos; Daniele Soria; Folkert W. Asselbergs; Joe Kai; Stephen F. Weng; Nadeem Qureshi

doi:10.1371/journal.pdig.0000334

Abstract

Individuals developing stroke have varying clinical characteristics, demographic, and biochemical profiles. This heterogeneity in phenotypic characteristics can impact on cardiovascular disease (CVD) morbidity and mortality outcomes. This study uses a novel clustering approach to stratify individuals with incident stroke into phenotypic clusters and evaluates the differential burden of recurrent stroke and other cardiovascular outcomes. We used linked clinical data from primary care, hospitalisations, and death records in the UK. A data-driven clustering analysis (kamila algorithm) was used in 48,114 patients aged ≥ 18 years with incident stroke, from 1-Jan-1998 to 31-Dec-2017 and no prior history of serious vascular events. Cox proportional hazards regression was used to estimate hazard ratios (HRs) for subsequent adverse outcomes, for each of the generated clusters. Adverse outcomes included coronary heart disease (CHD), recurrent stroke, peripheral vascular disease (PVD), heart failure, CVD-related and all-cause mortality. Four distinct phenotypes with varying underlying clinical characteristics were identified in patients with incident stroke. Compared with cluster 1 (n = 5,201, 10.8%), the risk of composite recurrent stroke and CVD-related mortality was higher in the other 3 clusters (cluster 2 [n = 18,655, 38.8%]: hazard ratio [HR], 1.07; 95% CI, 1.02–1.12; cluster 3 [n = 10,244, 21.3%]: HR, 1.20; 95% CI, 1.14–1.26; and cluster 4 [n = 14,014, 29.1%]: HR, 1.44; 95% CI: 1.37–1.50). Similar trends in risk were observed for composite recurrent stroke and all-cause mortality outcome, and subsequent recurrent stroke outcome. However, results were not consistent for subsequent risk in CHD, PVD, heart failure, CVD-related mortality, and all-cause mortality. In this proof of principle study, we demonstrated how a heterogenous population of patients with incident stroke can be stratified into four relatively homogenous phenotypes with differential risk of recurrent and major cardiovascular outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes.

Author summary

Using an unsupervised machine learning cluster analysis approach, adult patients with incident stroke were grouped into four clinically meaningful phenotypic clusters based on their demographic, biochemical, comorbidities, and prescribed medication profiles at the time of incident stroke. The findings of this study highlight the significant heterogeneity that exists within patients with incident stroke with respect to subsequent cardiovascular morbidity and mortality outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes and highlights the potential to target modifiable characteristics in clusters for more targeted preventive intervention.

Citation: Akyea RK, Ntaios G, Kontopantelis E, Georgiopoulos G, Soria D, Asselbergs FW, et al. (2023) A population-based study exploring phenotypic clusters and clinical outcomes in stroke using unsupervised machine learning approach. PLOS Digit Health 2(9): e0000334. https://doi.org/10.1371/journal.pdig.0000334

Editor: Gilles Guillot, CSL Behring / Swiss Institute for Translational and Entrepreneurial Medicine (SITEM), SWITZERLAND

Received: March 12, 2023; Accepted: July 19, 2023; Published: September 13, 2023

Copyright: © 2023 Akyea et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. The data that support the findings of this study are available from Clinical Practice Research Datalink (CPRD) through a data request application process (https://cprd.com/data-access). Researchers can contact enquiries@cprd.com for more information.

Funding: RKA was funded by a National Institute for Health Research School for Primary Care Research (NIHR SPCR) PhD Studentship award, supervised by NQ, FWA, and JK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: SFW has received independent research grant funding from AMGEN. NQ and SFW have previously received honorarium from AMGEN. RKA currently holds an NIHR-SPCR funded studentship (2018-2021). SFW is currently an employee of GSK. FWA is supported by UCL Hospitals NIHR Biomedical Research Centre. The remaining authors have no competing interests.

Introduction

Stroke is a leading cause of death and disability globally with a substantial economic cost due to treatment and post-stroke care [1]. Patients at time of incident stroke have varied clinical characteristics, demographics, and biochemical profiles. This heterogeneity in characteristics at time of incident stroke impacts on cardiovascular morbidity and mortality outcomes [2]. Phenotyping (subgrouping) people after incident stroke, in terms of the risk of various cardiovascular outcomes, could provide individuals with the poorest prognosis better care. Intensive secondary prevention strategies including the use of novel medications such as proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors and colchicine in patients at very high risk of adverse cardiovascular morbidity and mortality outcomes.

Cluster analysis, a hypothesis-free unsupervised machine learning data-driven approach, has been widely used to analyse clinical data to identify new phenotypic subgroups of complex and heterogeneous diseases including obstructive sleep apnoea [3], asthma [4,5], chronic obstructive pulmonary disease, chronic heart failure [6], dilated cardiomyopathy [7], sepsis [8], Parkinson’s disease [9], breast cancer [10], and diabetes [11]. This approach does not include outcome data, and may be less biased in its results, especially when using retrospectively collected data. Clustering of clinical data may, therefore, be helpful in identifying subgroups of patients with incident stroke and generating new hypotheses. Efforts to determine such phenotypic groups in patients with incident stroke remain limited.

Using a large population-based cohort of adult patients with incident stroke, the objectives of this study are: (i) to identify patterns in linked primary and secondary clinical data and cluster patients based on phenotypic similarities; (ii) to assess the association between phenotypic clusters and subsequent recurrent stroke or CVD-related mortality, recurrent stroke or all-cause mortality, coronary heart disease (CHD), recurrent stroke, peripheral vascular disease (PVD), heart failure, CVD-related mortality, and all-cause mortality.

Methods

Study design and data source

This prospective population-based cohort study used the UK Clinical Practice Research Datalink (CPRD) GOLD database of anonymised longitudinal primary care electronic health records [12], linked to secondary care hospitalisation data (Hospital Episode Statistics [HES]) [13], national mortality data (Office for National Statistics [ONS]) [14], and social deprivation data (Index of Multiple Deprivation (IMD) 2015) [15]. Patients included in the CPRD GOLD database, from a network of general practices across the UK, are representative of the UK general population in terms of sex, age, and ethnicity [12].

Study population

We identified a cohort of patients with incident non-fatal stroke in either primary care (CPRD GOLD) or secondary care (HES) between 1 January 1998 and 31 December 2017. Details about this cohort were previously reported [16]. Patients with a prior record of coronary heart disease (CHD), peripheral vascular disease (PVD), or heart failure before incident stroke event were excluded. Patients were followed from the date of incident stroke diagnosis until they developed a major adverse cardiovascular event (MACE), died, ceased contributing data, or last data collection date of the practice. The study flow diagram is shown in Fig 1.

Download:

Fig 1. Study flow diagram.

https://doi.org/10.1371/journal.pdig.0000334.g001

Outcomes

The primary outcome was a composite of recurrent stroke or CVD-related mortality event recorded after incident stroke from across the linked data sources (CPRD, HES or ONS registry). The secondary outcomes included: CHD, recurrent stroke, PVD, heart failure, CVD-related mortality, all-cause mortality, and the composite of recurrent stroke or all-cause mortality.

Subsequent outcomes within 30 days were considered to be representing or relating to the incident stroke event [16]. Analyses were, therefore, restricted to patients with subsequent outcomes occurring after 30 days of incident stroke.

Potential candidate variables for phenotyping

Based on availability in the electronic health records and established association with CVD, 336 candidate variables were selected. These included demographic data, vital signs, biochemical parameters, comorbid conditions, and prescribed medications (S1 Table). For vital signs and biochemical test results, the most recent values/records within 24 months before incident stroke were extracted. A prescription within 12 months before incident stroke was considered as a medication prescribed. All comorbid conditions were defined based on the latest record of a comorbid condition any time before incident stroke. All code lists used have been published and available for download [17,18].

Data processing

The variable distributions and missingness were first assessed. Multiple imputation by chained equations was used to account for missing data (S1 Fig, S2 Table). Ten imputed datasets were generated, using all available covariates and all the outcomes, although outcomes were not imputed [19,20]. The imputed datasets were pooled into a single dataset using Rubin’s rules [21]. A high number of dimensions from a dataset with many variables/features is associated with a loss of meaningful differentiation between similar and dissimilar individuals–the ‘curse of dimensionality’ [22]. To improve the cluster analysis process and performance, feature selection was carried out to reduce collinearity, conditional dependence and noise contributing to increasing the variance. Feature selection was based on two (2) widely used data-driven feature selection methods (Boruta [23] and Least Absolute Shrinkage and Selection Operator (Lasso) regression [24]–S2 Fig) and clinical expert consensus. An expert group of clinicians from both primary (Consultant General Practitioners–NQ, JK) and secondary care (Stroke Medicine Consultant/Specialist–GN, GG) were independently consulted to attain consensus on which variables to select for the cluster analysis. Clinical expert consensus was defined as a 75% (3 out of 4) agreement among the clinical experts on each variable. 49 variables were rated important by the clinical experts and at least 1 of the 2 data-driven methods–S1 Table. After evaluating correlation among the 49 selected variables using mixedCor and Lares functions in R for mixed-type data (S3 Fig & S4 Fig), we excluded 10 highly correlated variables based on clinical judgement/importance. The remaining 39 variables, Box 1, were used for the cluster analysis.

Box 1. Phenotypic domains and phenotypic variables used for cluster analysis

Download:

https://doi.org/10.1371/journal.pdig.0000334.t001

Phenotypic clustering

The prediction strength method by Tibshirani and Walther, 2015 [25] in the kamila function and the Elbow method were used to select the optimal number of clusters–S5 Fig. The kamila algorithm for mixed data clustering (S1 Text) was implemented to identify distinct patient phenotypic clusters. To ensure robustness of the clusters identified, 1,000 initialisations (that is, random starting points) were carried out. Plot of the clusters with the principal component analysis (PCA) dimensions was generated (S6 Fig).

Using the h2o package (http://www.h2o.ai), a gradient boosting model was applied to identify as well as rank the key covariates (candidate variables) that predict each of the identified phenotypic clusters. The respective cluster groupings were coded as 1 –belonging to cluster or 0 –belonging to other clusters. SHAP (SHapley Additive exPlanations) was used to assess the discriminative influence of the variables for each of the identified clusters [26].

Statistical analysis

For each cluster descriptive characteristics were provided, reporting proportion (%) for categorical variables and mean (SD) or median (IQR) for continuous variables. Kruskal-Wallis and chi-squared tests were used to compare across clusters, for continuous and categorical data, respectively.

The association between phenotypic clusters and adverse cardiovascular morbidity and mortality outcomes were assessed using Cox proportional hazards regression model. The hazard ratio (HR) for each phenotypic group is presented with 95% confidence intervals (CI) and corresponding p-values. Cumulative incidence plots were derived and differences between phenotypic groups assessed by the log-rank test. All statistical analyses were performed using Stata SE version 17 (StataCorp LP) and R version 4.1.0. An alpha level of 0.05 was used.

Ethics approval and consent to participate

Ethical approval for this study was obtained from the Independent Scientific Advisory Committee (ISAC)–study protocol number 19_023R. De-identified (anonymised) patient data was obtained from the CPRD hence this study was exempt from obtaining informed consent from patients.

Results

Clinical characteristics among phenotypic clusters

We identified 68,642 patients aged ≥18 years old with any incident non-fatal stroke event between 1998 and 2017. A total of 20,528 (29.9%) patients with subsequent clinical outcomes occurring within 30 days of incident stroke event were excluded, as these outcomes were considered to be related to the incident stroke event [16]. Cluster analysis was performed in the remaining 48,114 patients. Four phenotypic clusters with significant differences in clinical characteristics were identified. The identified clusters were numbered from 1 to 4 according to the ascendent overall incidence of subsequent composite outcome of recurrent stroke or CVD-related mortality, the primary outcome. Table 1 describes and compares the clinical characteristics among the phenotypic clusters.

Download:

Table 1. Characteristics of study population at time of incident stroke according to cluster membership (n = 48,114).

https://doi.org/10.1371/journal.pdig.0000334.t002

The plots of the clusters are shown with the principal component analysis (PCA) dimensions in S6 Fig. The cluster profiles are summarised in Box 2.

Box 2. Summary of cluster profiles

Download:

https://doi.org/10.1371/journal.pdig.0000334.t003

Variable importance for clusters

The supervised gradient boosting model to identify key covariates (candidate variables) that predict the respective phenotypic cluster had excellent prediction accuracy–area under the receiver operative curve (AUC) of 0.985, 0.982, 0.974, and 0.970 for clusters 1, 2, 3 and 4, respectively. The most common variables for predicting the respective phenotypic clusters were age at incident stroke, blood pressure, hypertension, LDL cholesterol, and potency of prescribed statin—Fig 2.

Download:

Fig 2. Plot showing the clinical parameters which are the core of each phenotypic cluster.

aki: acute kidney injury; dbp: diastolic blood pressure; dm_eye_comp: diabetic ophthalmic complications; sbp: systolic blood pressure; gfr: glomerular filtration rate; hb: haemoglobin; hdl: high-density lipoprotein cholesterol; ldl: low-density lipoprotein cholesterol; hba1c: glycated haemoglobin; nonRH_aortic: non-rheumatic aortic valve disorder; smi: severe mental illness; tg: triglyceride; tia: transient ischaemic attack. SHAP summary plot combines feature/variable importance with feature effects. Each point on the summary plot is a Shapley value for an individual. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value. The colour represents the value from low to high. The features are ordered according to importance.

https://doi.org/10.1371/journal.pdig.0000334.g002

Association with subsequent clinical outcomes

During the median follow-up time of 12.60 years (IQR, 7.60–16.97 years), there was a total of 24,588 (51.1%) composite recurrent stroke or CVD-related mortality outcome events. The occurrence of recurrent stroke + CVD-related mortality was different across the 4 phenotypic clusters–cluster 1 had the lowest incidence rate (15.13 per 100 person-years; 95% CI, 14.54–15.74), while cluster 4 had the highest incidence rate (23.17 per 100 person-years, 95% CI: 22.67–23.69). The risk of subsequent recurrent stroke + CVD-related mortality was significantly increased in cluster 2 (hazard ratio (HR), 1.07; 95% CI: 1.02–1.12); cluster 3 (HR, 1.20; 95% CI: 1.14–1.26), and cluster 4 (HR, 1.29; 95% CI: 1.26–1.33), when compared with cluster 1. Similar incidence rate and hazard ratio trends were observed for subsequent recurrent stroke + all-cause mortality outcome (cluster 2: HR, 1.07; 95% CI, 1.03–1.12; cluster 3: HR, 1.32, 95% CI, 1.26–1.37; cluster 4: HR, 1.54; 95% CI: 1.48–1.60) and recurrent stroke outcome (cluster 2: HR, 1.10; 95% CI, 1.05–1.16; cluster 3: HR, 1.12, 95% CI, 1.06–1.18; cluster 4: HR, 1.25; 95% CI: 1.19–1.32).

Different trends in incidence rate and hazard ratios were observed, however, for subsequent CHD, PVD, heart failure, CVD-related and all-cause mortality outcomes–Fig 3 and Table 2. When compared with cluster 1, the risk of subsequent CHD events was significantly decreased in the other 3 clusters (cluster 2: HR, 0.49; 95% CI: 0.44–0.55; cluster 3: HR, 0.64; 95% CI, 0.56–0.73; cluster 4: HR, 0.55; 95% CI, 0.49–0.63). A similar decreased risk in the other 3 clusters when compared to cluster 1 was observed for risk of subsequent PVD.

Download:

Fig 3. Incidence rate for the subsequent adverse outcomes by the identified phenotypic clusters.

https://doi.org/10.1371/journal.pdig.0000334.g003

Download:

Table 2. Subsequent major adverse outcomes after incident stroke by phenotypic clusters.

https://doi.org/10.1371/journal.pdig.0000334.t004

For risk of subsequent heart failure, CVD-related mortality and all-cause mortality, cluster 2 had a significantly decreased risk when compared to cluster 1 while clusters 3 and 4 had a significantly increased risk–Table 2. The occurrence of subsequent cardiovascular morbidity and mortality outcomes across the different phenotypic clusters is presented as Kaplan Meier plots in Fig 4.

Download:

Fig 4. Kaplan-Meier plots for subsequent clinical outcomes stratified by phenotypic clusters.

A: Recurrent stroke and CVD-related mortality (log-rank p<0.0001); B: Recurrent stroke and all-cause mortality (log-rank p<0.0001); C:Recurrent stroke (log-rank p<0.0001); D: Coronary heart disease (log-rank p<0.0001); E: Peripheral vascular disease (log-rank p<0.0001); F: Heart failure (log-rank p<0.0001); G: Cardiovascular-related mortality (log-rank p<0.0001); H: All-cause mortality (log-rank p<0.0001).

https://doi.org/10.1371/journal.pdig.0000334.g004

Discussion

This population-based study exploring phenotypic characteristics of patients with incident stroke using a data-driven-cluster analysis approach identified four clinically meaningful patient clusters based on the phenotypic characteristics at time of incident stroke. There was a varied relationship between the identified phenotypic clusters and subsequent risk of adverse cardiovascular morbidity and mortality outcomes.

In our study, four distinct and clinically meaningful phenotypic clusters were identified. Smoking, a strong independent modifiable risk factor for cardiovascular morbidity and mortality outcomes [27], was most highly prevalent in clusters 1 and 2. Preventative strategy to communicate the risks of smoking and the benefits of quitting to this cluster of patients could be an effective means to promote smoking cessation and reduce risk for subsequent adverse events [28]. With the exception of clusters 2, the 3 other clusters included had high prevalence of multiple long-term conditions as well as CVD risk factors at time of incident stroke. Patients with incident stroke have been shown to commonly have pre-existing long-term conditions [29]. To optimally manage the possible atherogenic effect of these comorbid condition to reduce risk of subsequent cardiovascular morbidity and mortality outcomes, both non-pharmacological (that is, lifestyle modification [30,31]) and pharmacological (antihypertensives for blood pressure management [32]; lipid-lowering medications such as statins for cholesterol management [33]; antidiabetics for blood sugar control [30]; and antiplatelets/anticoagulants to manage arrhythmia [34]) strategies need to be prioritised in line with clinical guidelines [35]. Frequent monitoring/reviews to ensure treatment targets are being met is important [36]. Age, a non-modifiable risk factor, was a key factor for the patient cluster membership. Among older adults (typical of cluster 4), incidence of aortic disease, PVD and venous thromboembolism increase as age-related alterations in vascular structure and function are compounded by the longer exposure to CVD risk factors [37].

Clustering is a common approach used to analyse large datasets, to identify both the number of subgroups in the data and the attributes of each subgroup, as has been done in this study. Data analysed in real applications including healthcare (from electronic health records) are mostly characterised by a mix of continuous and categorial variables. More common approaches that have been applied to mixed data include converting the variables to a single data type by either coding the categorical variables as numbers or dummy coding the variables and then applying standard distance methods such as k-means designed for continuous variables to the transformed data to achieve the clustering objective(s) [38,39]. Continuous variables have also been converted to categorical variables using interval-based bucketing [40,41]. Similarities that may have been observed in the original data may be lost when the data is transformed in such ways [40]. Kamila clustering algorithm has, however, been shown to better handle high imbalance between continuous and categorical data than any other method [40,42]. From a computational perspective, when compared with other algorithms, the Kamila algorithm offers the best performance and most time-efficient when dealing with large datasets (in relation to both observations and variables) in the setting of heterogeneous data, as was the situation in our study [40,42].

Strengths and limitations

To our knowledge, this is the first time that a data-driven cluster analysis aimed at identifying stroke phenotypes in a well characterised large population-based cohort of adults with any incident stroke. This allows us to cover a large range of stroke phenotypes. Most importantly, we had a comprehensive linked database with a broad spectrum of clinical data with many of these variables being explored in cluster analysis for the first time.

There are, however, limitations of this study worth considering. First and foremost, the study was not meant to propose a new classification for stroke, because the clusters are likely to vary according to patient characteristics and available data. These results serve to underscore the need for novel multidimensional stroke classification approaches for improving patient care. Furthermore, they are aimed to generate hypotheses for future studies that will integrate clinical and biological data in patients, with the goal of improving the care of patients with stroke. With immense advancement in machine learning, cluster analysis can be performed in a large number of ways [42,43]. However, the knowledge and experience of the relevant experts remain the best judge in the interpretation of findings from cluster analysis, hence the involvement of a diverse group of clinical specialists, clinical researchers, and data experts in our study. The presence of missing data is a common occurrence in clinical research using electronic health records collected as part of routine care. For example, laboratory tests are typically requested only when considered necessary for a patient’s health condition. Similarly, information on BMI or smoking status may not be consistently recorded, leading to potential bias in patterns of data completeness. To address this issue, multiple imputation by chained equations, as outlined in the methods section, was used to handle missing data in our study, which is the preferred option under any missingness mechanism [19,20].

Implications

Cluster analysis is most suited to address the multidimensional complexity of disease conditions with considerable heterogeneity such as stroke. Population-based cluster analysis could provide further understanding of disease patterns. Additionally, patients could be phenotyped and allocated to specific clusters that could be associated with different risks for various outcomes. Different treatment strategies or interventions could be targeted at specific phenotypic clusters, based available evidence on risk and possible response. Future clinic trial design could also focus on high-risk clusters or focus on specific aspects within a cluster.

Conclusions

Using an unsupervised learning data-driven cluster analysis on a broad spectrum of baseline clinical data of patients with incident stroke, we identified four phenotypic and clinically meaningful clusters with respect to risk of subsequent major adverse outcomes. These findings highlight the significant heterogeneity that exists within patients with incident stroke with respect to subsequent adverse outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes. Further exploration in different patient cohorts and populations is needed.

Supporting information

S1 Text. Additional Methods.

https://doi.org/10.1371/journal.pdig.0000334.s001

(DOCX)

S1 Fig. All clinical variables with missing values.

https://doi.org/10.1371/journal.pdig.0000334.s002

(DOCX)

S2 Fig. Feature selection.

https://doi.org/10.1371/journal.pdig.0000334.s003

(DOCX)

S3 Fig. Plot of correlation matrix of 49 selected variables.

https://doi.org/10.1371/journal.pdig.0000334.s004

(DOCX)

S4 Fig. Ranked cross-correlation plot of 49 selected variables.

https://doi.org/10.1371/journal.pdig.0000334.s005

(DOCX)

S5 Fig. Optimal number of clusters.

https://doi.org/10.1371/journal.pdig.0000334.s006

(DOCX)

S6 Fig. Principal component analysis (PCA) plots.

https://doi.org/10.1371/journal.pdig.0000334.s007

(DOCX)

S1 Table. Overview of all variables and the in- or exclusion at the various data processing steps.

https://doi.org/10.1371/journal.pdig.0000334.s008

(DOCX)

S2 Table. Observed versus imputed values after multiple imputation for all clinical variables with missing data.

https://doi.org/10.1371/journal.pdig.0000334.s009

(DOCX)

Acknowledgments

We thank the practices that contributed to the CPRD GOLD.

References

1. Rajsic S, Gothe H, Borba HH, Sroczynski G, Vujicic J, Toell T, et al. Economic burden of stroke: a systematic review on post-stroke care. Eur J Heal Econ. 2019;20: 107–134. pmid:29909569
- View Article
- PubMed/NCBI
- Google Scholar
2. Prosser J, MacGregor L, Lees KR, Diener HC, Hacke W, Davis S. Predictors of early cardiac morbidity and mortality after ischemic stroke. Stroke. 2007;38: 2295–2302. pmid:17569877
- View Article
- PubMed/NCBI
- Google Scholar
3. Joosten SA, Hamza K, Sands S, Turton A, Berger P, Hamilton G. Phenotypes of patients with mild to moderate obstructive sleep apnoea as confirmed by cluster analysis. Respirology. 2012;17: 99–107. pmid:21848707
- View Article
- PubMed/NCBI
- Google Scholar
4. Haldar P, Pavord ID, Shaw DE, Berry MA, Thomas M, Brightling CE, et al. Cluster analysis and clinical asthma phenotypes. Am J Respir Crit Care Med. 2008;178: 218–224. pmid:18480428
- View Article
- PubMed/NCBI
- Google Scholar
5. Siroux V, Basagan X, Boudier A, Pin I, Garcia-Aymerich J, Vesin A, et al. Identifying adult asthma phenotypes using a clustering approach. Eur Respir J. 2011;38: 310–317. pmid:21233270
- View Article
- PubMed/NCBI
- Google Scholar
6. Ahmad T, Pencina MJ, Schulte PJ, O’Brien E, Whellan DJ, Piña IL, et al. Clinical implications of chronic heart failure phenotypes defined by cluster analysis. J Am Coll Cardiol. 2014;64: 1765–1774. pmid:25443696
- View Article
- PubMed/NCBI
- Google Scholar
7. Verdonschot JAJ, Merlo M, Dominguez F, Wang P, Henkens MTHM, Adriaens ME, et al. Phenotypic clustering of dilated cardiomyopathy patients highlights important pathophysiological differences. Eur Heart J. 2021;42: 162–174. pmid:33156912
- View Article
- PubMed/NCBI
- Google Scholar
8. Seymour CW, Kennedy JN, Wang S, Chang CCH, Elliott CF, Xu Z, et al. Derivation, Validation, and Potential Treatment Implications of Novel Clinical Phenotypes for Sepsis. J Am Med Assoc. 2019;321: 2003–2017. pmid:31104070
- View Article
- PubMed/NCBI
- Google Scholar
9. Fereshtehnejad SM, Romenets SR, Anang JBM, Latreille V, Gagnon JF, Postuma RB. New clinical subtypes of Parkinson disease and their longitudinal progression a prospective cohort comparison with other phenotypes. JAMA Neurol. 2015;72: 863–873. pmid:26076039
- View Article
- PubMed/NCBI
- Google Scholar
10. Soria D, Garibaldi JM, Ambrogi F, Green AR, Powe D, Rakha E, et al. A methodology to identify consensus classes from clustering algorithms applied to immunohistochemical data from breast cancer patients. Comput Biol Med. 2010;40: 318–330. pmid:20106472
- View Article
- PubMed/NCBI
- Google Scholar
11. Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, et al. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018;6: 361–369. pmid:29503172
- View Article
- PubMed/NCBI
- Google Scholar
12. Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data Resource Profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol. 2015;44: 827–836. pmid:26050254
- View Article
- PubMed/NCBI
- Google Scholar
13. NHS Digital. Hospital Episode Statistics (HES). In: NHS Digital [Internet]. 2019 [cited 21 Jun 2019]. Available: https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics
14. Office for National Statistics. Deaths Registration Data. In: ONS [Internet]. 2018 [cited 21 Jun 2019]. Available: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths
15. Department of Communities and Local Government. English Indices of Deprivation 2015. 2015 [cited 10 Jul 2016] pp. 1–11. Available: https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015
16. Akyea RK, Vinogradova Y, Qureshi N, Patel RS, Kontopantelis E, Ntaios G, et al. Sex, Age, and Socioeconomic Differences in Nonfatal Stroke Incidence and Subsequent Major Adverse Outcomes. Stroke. 2021;52: 396–405. pmid:33493066
- View Article
- PubMed/NCBI
- Google Scholar
17. Kuan V, Denaxas S, Gonzalez-Izquierdo A, Direk K, Bhatti O, Husain S, et al. A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service. Lancet Digit Heal. 2019;1: e63–e77. pmid:31650125
- View Article
- PubMed/NCBI
- Google Scholar
18. CPRD @ Cambridge. Codes Lists (GOLD). [cited 6 Mar 2021]. Available: https://www.phpc.cam.ac.uk/pcu/research/research-groups/crmh/cprd_cam/codelists/v11/
19. Royston P. Multiple imputation of missing values: Update of ice. Stata J. 2005;5: 527–536.
- View Article
- Google Scholar
20. Kontopantelis E, White IR, Sperrin M, Buchan I. Outcome-sensitive multiple imputation: A simulation study. BMC Med Res Methodol. 2017;17: 1–13. pmid:28068910
- View Article
- PubMed/NCBI
- Google Scholar
21. Rubin DB. Multiple imputation for nonresponse in surveys. Wiley; 1987. https://doi.org/10.1002/9780470316696
22. Altman N, Krzywinski M. The curse(s) of dimensionality this-month. Nat Methods. 2018;15: 399–400. pmid:29855577
- View Article
- PubMed/NCBI
- Google Scholar
23. Kursa MB, Rudnicki WR. Feature selection with the Boruta package. J Stat Softw. 2010;36: 1–13.
- View Article
- Google Scholar
24. Tishbirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996. pp. 267–88.
- View Article
- Google Scholar
25. Foss AH, Markatou M. kamila: Clustering mixed-type data in R and hadoop. J Stat Softw. 2018;83: 1–44.
- View Article
- Google Scholar
26. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2: 56–67. pmid:32607472
- View Article
- PubMed/NCBI
- Google Scholar
27. Mons U, Müezzinler A, Gellert C, Schöttker B, Abnet CC, Bobak M, et al. Impact of smoking and smoking cessation on cardiovascular events and mortality among older adults: Meta-analysis of Individual participant data from prospective cohort studies of the CHANCES consortium. BMJ. 2015;350: 18. pmid:25896935
- View Article
- PubMed/NCBI
- Google Scholar
28. Duncan MS, Freiberg MS, Greevy RA, Kundu S, Vasan RS, Tindle HA. Association of Smoking Cessation with Subsequent Risk of Cardiovascular Disease. JAMA—J Am Med Assoc. 2019;322: 642–650. pmid:31429895
- View Article
- PubMed/NCBI
- Google Scholar
29. Gallacher KI, Batty GD, McLean G, Mercer SW, Guthrie B, May CR, et al. Stroke, multimorbidity and polypharmacy in a nationally representative sample of 1,424,378 patients in Scotland: Implications for treatment burden. BMC Med. 2014;12: 1–9. pmid:25280748
- View Article
- PubMed/NCBI
- Google Scholar
30. Kernan WN, Ovbiagele B, Black HR, Bravata DM, Chimowitz MI, Ezekowitz MD, et al. Guidelines for the prevention of stroke in patients with stroke and transient ischemic attack: A guideline for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2014;45: 2160–2236. pmid:24788967
- View Article
- PubMed/NCBI
- Google Scholar
31. Billinger SA, Arena R, Bernhardt J, Eng JJ, Franklin BA, Johnson CM, et al. Physical activity and exercise recommendations for stroke survivors: A statement for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2014;45: 2532–2553. pmid:24846875
- View Article
- PubMed/NCBI
- Google Scholar
32. Arima H, Chalmers J, Woodward M, Anderson C, Rodgers A, Davis S, et al. Lower target blood pressures are safe and effective for the prevention of recurrent stroke: The PROGRESS trial. J Hypertens. 2006;24: 1201–1208. pmid:16685221
- View Article
- PubMed/NCBI
- Google Scholar
33. Fulcher J, O’Connell R, Voysey M, Emberson J, Blackwell L, Mihaylova B, et al. Efficacy and safety of LDL-lowering therapy among men and women: Meta-analysis of individual data from 174 000 participants in 27 randomised trials. Lancet. 2015;385: 1397–1405. pmid:25579834
- View Article
- PubMed/NCBI
- Google Scholar
34. Gent M. A randomised, blinded, trial of clopidogrel versus aspirin in patients at risk of ischaemic events (CAPRIE). Lancet. 1996;348: 1329–1339. pmid:8918275
- View Article
- PubMed/NCBI
- Google Scholar
35. Kleindorfer DO, Towfighi A, Chaturvedi S, Cockroft KM, Gutierrez J, Lombardi-Hill D, et al. 2021 Guideline for the prevention of stroke in patients with stroke and transient ischemic attack; A guideline from the American Heart Association/American Stroke Association. Stroke. 2021;52: E364–E467. pmid:34024117
- View Article
- PubMed/NCBI
- Google Scholar
36. National Institute for Health and Care Excellence. Multimorbidity: clinical assessment and management. NICE; 2016 [cited 1 Oct 2021]. Available: https://www.nice.org.uk/guidance/ng56
37. Miller AP, Huff CM, Roubin GS. Vascular disease in the older adult. J Geriatr Cardiol. 2016;13: 727–732. pmid:27899936
- View Article
- PubMed/NCBI
- Google Scholar
38. Dougherty J, Kohavi R, Sahami M. Supervised and Unsupervised Discretization of Continuous Features. Mach Learn Proc 1995. 1995; 194–202.
- View Article
- Google Scholar
39. Hennig C, Liao TF. How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. J R Stat Soc Ser C Appl Stat. 2013;62: 309–369.
- View Article
- Google Scholar
40. Foss A, Markatou M, Ray B, Heching A. A semiparametric method for clustering mixed data. Mach Learn. 2016;105: 419–458.
- View Article
- Google Scholar
41. Ichino M, Yaguchi H. Generalized Minkowski Metrics for Mixed Feature-Type Data Analysis. IEEE Trans Syst Man Cybern. 1994;24: 698–708.
- View Article
- Google Scholar
42. Preud’homme G, Duarte K, Dalleau K, Lacomblez C, Bresso E, Smaïl-Tabbone M, et al. Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark. Sci Rep. 2021;11: 1–14. pmid:33603019
- View Article
- PubMed/NCBI
- Google Scholar
43. Mclachlan GJ. Cluster analysis and related techniques in medical research. Stat Methods Med Res. 1992;1: 27–48. pmid:1341650
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Rajsic S, Gothe H, Borba HH, Sroczynski G, Vujicic J, Toell T, et al. Economic burden of stroke: a systematic review on post-stroke care. Eur J Heal Econ. 2019;20: 107–134. pmid:29909569
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Prosser J, MacGregor L, Lees KR, Diener HC, Hacke W, Davis S. Predictors of early cardiac morbidity and mortality after ischemic stroke. Stroke. 2007;38: 2295–2302. pmid:17569877
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Joosten SA, Hamza K, Sands S, Turton A, Berger P, Hamilton G. Phenotypes of patients with mild to moderate obstructive sleep apnoea as confirmed by cluster analysis. Respirology. 2012;17: 99–107. pmid:21848707
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Haldar P, Pavord ID, Shaw DE, Berry MA, Thomas M, Brightling CE, et al. Cluster analysis and clinical asthma phenotypes. Am J Respir Crit Care Med. 2008;178: 218–224. pmid:18480428
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Siroux V, Basagan X, Boudier A, Pin I, Garcia-Aymerich J, Vesin A, et al. Identifying adult asthma phenotypes using a clustering approach. Eur Respir J. 2011;38: 310–317. pmid:21233270
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Ahmad T, Pencina MJ, Schulte PJ, O’Brien E, Whellan DJ, Piña IL, et al. Clinical implications of chronic heart failure phenotypes defined by cluster analysis. J Am Coll Cardiol. 2014;64: 1765–1774. pmid:25443696
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Verdonschot JAJ, Merlo M, Dominguez F, Wang P, Henkens MTHM, Adriaens ME, et al. Phenotypic clustering of dilated cardiomyopathy patients highlights important pathophysiological differences. Eur Heart J. 2021;42: 162–174. pmid:33156912
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Seymour CW, Kennedy JN, Wang S, Chang CCH, Elliott CF, Xu Z, et al. Derivation, Validation, and Potential Treatment Implications of Novel Clinical Phenotypes for Sepsis. J Am Med Assoc. 2019;321: 2003–2017. pmid:31104070
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Fereshtehnejad SM, Romenets SR, Anang JBM, Latreille V, Gagnon JF, Postuma RB. New clinical subtypes of Parkinson disease and their longitudinal progression a prospective cohort comparison with other phenotypes. JAMA Neurol. 2015;72: 863–873. pmid:26076039
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Soria D, Garibaldi JM, Ambrogi F, Green AR, Powe D, Rakha E, et al. A methodology to identify consensus classes from clustering algorithms applied to immunohistochemical data from breast cancer patients. Comput Biol Med. 2010;40: 318–330. pmid:20106472
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, et al. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018;6: 361–369. pmid:29503172
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data Resource Profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol. 2015;44: 827–836. pmid:26050254
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. NHS Digital. Hospital Episode Statistics (HES). In: NHS Digital [Internet]. 2019 [cited 21 Jun 2019]. Available: https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics

[ref14] 14. Office for National Statistics. Deaths Registration Data. In: ONS [Internet]. 2018 [cited 21 Jun 2019]. Available: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths

[ref15] 15. Department of Communities and Local Government. English Indices of Deprivation 2015. 2015 [cited 10 Jul 2016] pp. 1–11. Available: https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015

[ref16] 16. Akyea RK, Vinogradova Y, Qureshi N, Patel RS, Kontopantelis E, Ntaios G, et al. Sex, Age, and Socioeconomic Differences in Nonfatal Stroke Incidence and Subsequent Major Adverse Outcomes. Stroke. 2021;52: 396–405. pmid:33493066
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref17] 17. Kuan V, Denaxas S, Gonzalez-Izquierdo A, Direk K, Bhatti O, Husain S, et al. A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service. Lancet Digit Heal. 2019;1: e63–e77. pmid:31650125
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref18] 18. CPRD @ Cambridge. Codes Lists (GOLD). [cited 6 Mar 2021]. Available: https://www.phpc.cam.ac.uk/pcu/research/research-groups/crmh/cprd_cam/codelists/v11/

[ref19] 19. Royston P. Multiple imputation of missing values: Update of ice. Stata J. 2005;5: 527–536.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref20] 20. Kontopantelis E, White IR, Sperrin M, Buchan I. Outcome-sensitive multiple imputation: A simulation study. BMC Med Res Methodol. 2017;17: 1–13. pmid:28068910
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref21] 21. Rubin DB. Multiple imputation for nonresponse in surveys. Wiley; 1987. https://doi.org/10.1002/9780470316696

[ref22] 22. Altman N, Krzywinski M. The curse(s) of dimensionality this-month. Nat Methods. 2018;15: 399–400. pmid:29855577
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref23] 23. Kursa MB, Rudnicki WR. Feature selection with the Boruta package. J Stat Softw. 2010;36: 1–13.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref24] 24. Tishbirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996. pp. 267–88.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref25] 25. Foss AH, Markatou M. kamila: Clustering mixed-type data in R and hadoop. J Stat Softw. 2018;83: 1–44.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref26] 26. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2: 56–67. pmid:32607472
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref27] 27. Mons U, Müezzinler A, Gellert C, Schöttker B, Abnet CC, Bobak M, et al. Impact of smoking and smoking cessation on cardiovascular events and mortality among older adults: Meta-analysis of Individual participant data from prospective cohort studies of the CHANCES consortium. BMJ. 2015;350: 18. pmid:25896935
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref28] 28. Duncan MS, Freiberg MS, Greevy RA, Kundu S, Vasan RS, Tindle HA. Association of Smoking Cessation with Subsequent Risk of Cardiovascular Disease. JAMA—J Am Med Assoc. 2019;322: 642–650. pmid:31429895
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref29] 29. Gallacher KI, Batty GD, McLean G, Mercer SW, Guthrie B, May CR, et al. Stroke, multimorbidity and polypharmacy in a nationally representative sample of 1,424,378 patients in Scotland: Implications for treatment burden. BMC Med. 2014;12: 1–9. pmid:25280748
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref30] 30. Kernan WN, Ovbiagele B, Black HR, Bravata DM, Chimowitz MI, Ezekowitz MD, et al. Guidelines for the prevention of stroke in patients with stroke and transient ischemic attack: A guideline for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2014;45: 2160–2236. pmid:24788967
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref31] 31. Billinger SA, Arena R, Bernhardt J, Eng JJ, Franklin BA, Johnson CM, et al. Physical activity and exercise recommendations for stroke survivors: A statement for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2014;45: 2532–2553. pmid:24846875
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref32] 32. Arima H, Chalmers J, Woodward M, Anderson C, Rodgers A, Davis S, et al. Lower target blood pressures are safe and effective for the prevention of recurrent stroke: The PROGRESS trial. J Hypertens. 2006;24: 1201–1208. pmid:16685221
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref33] 33. Fulcher J, O’Connell R, Voysey M, Emberson J, Blackwell L, Mihaylova B, et al. Efficacy and safety of LDL-lowering therapy among men and women: Meta-analysis of individual data from 174 000 participants in 27 randomised trials. Lancet. 2015;385: 1397–1405. pmid:25579834
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref34] 34. Gent M. A randomised, blinded, trial of clopidogrel versus aspirin in patients at risk of ischaemic events (CAPRIE). Lancet. 1996;348: 1329–1339. pmid:8918275
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref35] 35. Kleindorfer DO, Towfighi A, Chaturvedi S, Cockroft KM, Gutierrez J, Lombardi-Hill D, et al. 2021 Guideline for the prevention of stroke in patients with stroke and transient ischemic attack; A guideline from the American Heart Association/American Stroke Association. Stroke. 2021;52: E364–E467. pmid:34024117
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref36] 36. National Institute for Health and Care Excellence. Multimorbidity: clinical assessment and management. NICE; 2016 [cited 1 Oct 2021]. Available: https://www.nice.org.uk/guidance/ng56

[ref37] 37. Miller AP, Huff CM, Roubin GS. Vascular disease in the older adult. J Geriatr Cardiol. 2016;13: 727–732. pmid:27899936
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref38] 38. Dougherty J, Kohavi R, Sahami M. Supervised and Unsupervised Discretization of Continuous Features. Mach Learn Proc 1995. 1995; 194–202.
View Article
Google Scholar

[128] View Article

[129] Google Scholar

[ref39] 39. Hennig C, Liao TF. How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. J R Stat Soc Ser C Appl Stat. 2013;62: 309–369.
View Article
Google Scholar

[131] View Article

[132] Google Scholar

[ref40] 40. Foss A, Markatou M, Ray B, Heching A. A semiparametric method for clustering mixed data. Mach Learn. 2016;105: 419–458.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref41] 41. Ichino M, Yaguchi H. Generalized Minkowski Metrics for Mixed Feature-Type Data Analysis. IEEE Trans Syst Man Cybern. 1994;24: 698–708.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

[ref42] 42. Preud’homme G, Duarte K, Dalleau K, Lacomblez C, Bresso E, Smaïl-Tabbone M, et al. Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark. Sci Rep. 2021;11: 1–14. pmid:33603019
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref43] 43. Mclachlan GJ. Cluster analysis and related techniques in medical research. Stat Methods Med Res. 1992;1: 27–48. pmid:1341650
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

Figures

Abstract

Author summary

Introduction

Methods

Study design and data source

Study population

Outcomes

Potential candidate variables for phenotyping

Data processing

Box 1. Phenotypic domains and phenotypic variables used for cluster analysis

Phenotypic clustering

Statistical analysis

Ethics approval and consent to participate

Results

Clinical characteristics among phenotypic clusters

Box 2. Summary of cluster profiles

Variable importance for clusters

Association with subsequent clinical outcomes

Discussion

Strengths and limitations

Implications

Conclusions

Supporting information

S1 Text. Additional Methods.

S1 Fig. All clinical variables with missing values.

S2 Fig. Feature selection.

S3 Fig. Plot of correlation matrix of 49 selected variables.

S4 Fig. Ranked cross-correlation plot of 49 selected variables.

S5 Fig. Optimal number of clusters.

S6 Fig. Principal component analysis (PCA) plots.

S1 Table. Overview of all variables and the in- or exclusion at the various data processing steps.

S2 Table. Observed versus imputed values after multiple imputation for all clinical variables with missing data.

Acknowledgments

References