Figures
Abstract
Since the emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), large-scale social contact surveys are now longitudinally measuring the fundamental changes in human interactions in the face of the pandemic and non-pharmaceutical interventions. Here, we present a model-based Bayesian approach that can reconstruct contact patterns at 1-year resolution even when the age of the contacts is reported coarsely by 5 or 10-year age bands. This innovation is rooted in population-level consistency constraints in how contacts between groups must add up, which prompts us to call the approach presented here the Bayesian rate consistency model. The model can also quantify time trends and adjust for reporting fatigue emerging in longitudinal surveys through the use of computationally efficient Hilbert Space Gaussian process priors. We illustrate estimation accuracy on simulated data as well as social contact data from Europe and Africa for which the exact age of contacts is reported, and then apply the model to social contact data with coarse information on the age of contacts that were collected in Germany during the COVID-19 pandemic from April to June 2020 across five longitudinal survey waves. We estimate the fine age structure in social contacts during the early stages of the pandemic and demonstrate that social contact intensities rebounded in an age-structured, non-homogeneous manner. The Bayesian rate consistency model provides a model-based, non-parametric, computationally tractable approach for estimating the fine structure and longitudinal trends in social contacts and is applicable to contemporary survey data with coarsely reported age of contacts as long as the exact age of survey participants is reported.
Author summary
The transmission of respiratory infectious diseases occurs during close social contacts. Hence, measuring the intensity and patterns in social contacts leads to a better understanding of disease spread and provides essential data to estimate central quantities such as the reproduction number in real-time. Unlike pre-pandemic surveys, which largely recorded contacts’ age in one-year age intervals, most COVID-era studies only recorded the age of contacts in broad age categories to facilitate reporting. Some studies allowed participants to report an estimate for the total number of contacts for which they could not remember age and gender information. Many studies were partially longitudinal, which introduced the issue of reporting fatigue. Thus, directly applying existing statistical methods for estimating social contact matrices may result in losing age detail and confounded estimates. To this end, we develop a model-based approach which estimates fine-age contact patterns from coarse-age data by exploiting particular constraints that must hold mathematically in closed populations. The model can also adjust for the confounding effects of aggregate contact reporting and reporting fatigue and estimate the time trends in social contact dynamics. We hope this statistical model is a useful addition to the global pandemic preparedness toolkit to reconstruct the fine structure of social contact patterns and measure real-time effective reproduction numbers with greater precision.
Citation: Dan S, Chen Y, Chen Y, Monod M, Jaeger VK, Bhatt S, et al. (2023) Estimating fine age structure and time trends in human contact patterns from coarse contact data: The Bayesian rate consistency model. PLoS Comput Biol 19(6): e1011191. https://doi.org/10.1371/journal.pcbi.1011191
Editor: Virginia E. Pitzer, Yale School of Public Health, UNITED STATES
Received: October 20, 2022; Accepted: May 17, 2023; Published: June 5, 2023
Copyright: © 2023 Dan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data and code required to reproduce all the results presented in this scientific paper are publicly available from the GitHub repository at the following URL: https://github.com/MLGlobalHealth/bayes-rate-consistency.
Funding: This work was funded by the Imperial President’s PhD Scholarship program and the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning at Imperial and Oxford (EP/S023151/1) to YC and MM. The Bill & Melinda Gates Foundation (OPP1175094) to MM and OR. The Medical Research Council (MR/V038109/1) to OR. The MRC Centre for Global Infectious Disease Analysis (MR/R015600/1), jointly funded by the UK Medical Research Council (MRC) and the UK Foreign, Commonwealth & Development Office (FCDO), under the MRC/FCDO Concordat agreement, and also part of the EDCTP2 programme supported by the European Union and the Novo Nordisk Foundation via The Novo Nordisk Young Investigator Award (NNF20OC0059309) and the Danish National Research Foundation via a chair position and the Eric and Wendy Schmidt Fund For Strategic Innovation via the Schmidt Polymath Award (G-22-63345) and the National Institute for Health Research (NIHR) via the Health Protection Research Unit in Modelling and Health Economics to SB. COVIMOD is funded by intramural funds of the Institute of Epidemiology and Social Medicine, University of Munster, and of the Institute of Medical Epidemiology, Biometry and Informatics, Martin Luther University Halle-Wittenberg, as well as by funds provided by the Robert Koch Institute, Berlin, the Helmholtz-Gemeinschaft Deutscher Forschungszentren e.V. via the HZEpiAdHoc "The Helmholtz Epidemiologic Response against the COVID-19 Pandemic" project, the Saxonian COVID-19 Research Consortium SaxoCOV (co-financed with tax funds on the basis of the budget passed by the Saxon state parliament), the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, via the project SpaceImpact project number 458526380) and the Federal Ministry of Education and Research (BMBF) via the projects Respinow (project number 031L0298F) and OptimAgent (project number 031L0299J) and as part of the Network University Medicine (NUM) via the egePan Unimed project (funding code: 01KX2021). The funders had no role in study design, data collection and analysis, the decision to publish, or the preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
This is a PLOS Computational Biology Methods paper.
Introduction
The transmission of human respiratory diseases such as influenza, tuberculosis, and COVID-19 is directly driven by the rate of close social contact between individuals. Social contact studies such as the pivotal POLYMOD study [1] have been widely acknowledged as an effective method of obtaining social contact estimates to assess infection risk and to parameterise mathematical infectious disease models [2–4]. Consequently, they provide critical epidemiological insights which inform the implementation and evaluation of non-pharmaceutical interventions [5] as well as public health policies such as vaccination schedules [6].
Since the outbreak of COVID-19, longitudinal social contact surveys have been conducted in Europe and around the world, providing indispensable information on the evolving patterns of human mixing behaviour during the pandemic [7, 8]. In Germany, the COVIMOD study collected social contact data for nearly two years, and initial analyses [9] focused on data from April to June 2020 during the first partial lockdown in Germany to quantify the scale of social contact reductions relative to pre-pandemic contact patterns. It is important to estimate high-resolution age- and gender-specific human contact patterns and their trends over time, but this has remained an unsolved challenge due to changes in survey methodology from pre-pandemic studies [1, 9] and limitations in available inference methods [2, 10, 11].
First, COVIMOD and other COVID-era social contact studies record the age of contacts by large age bands of 5 to 10 years, reflecting that often it is difficult for study participants to know the exact age of their contacts. In contrast, the pre-pandemic POLYMOD surveys primarily collected data on the exact age of contacts. Subsequently developed methods, including thin-plate regressions [2] and Gaussian Markov Random Field model approaches [10] relied on such data to estimate high-resolution contact patterns, and these methods are no longer directly applicable. Most studies have thus resorted to the convenient and speedy bootstrap approach implemented in the socialmixr library [11], although it can only provide contact estimates in the same coarse age bands that the data are reported in, which mask the full structure of the actual contact patterns [12].
Second, most COVID-era studies adopted retrospective web-based survey protocols and conducted longitudinal repeat surveys [7, 13–15]. The survey waves are typically inter-dependent because a number of participants were surveyed in multiple waves, and additional participants were recruited to replenish the cohort size. While this approach provides valuable longitudinal data, it also introduces the issue of reporting fatigue, where participants tend to report fewer contacts in subsequent participation due to becoming tired of filling out the survey. It follows that directly applying existing methods, which do not incorporate adjustments to counter the confounding reporting fatigue effects, is bound to lead to incorrect estimates. Additionally, participants sometimes found it difficult to recall specific age and gender information for all of their contacts. Instead, they were allowed to report an estimate for the total number of contacts on that occasion [9], which again may result in under-ascertainment of contact intensities if these data are not accounted for in inference approaches of contact patterns.
This work presents a non-parametric Bayesian model to infer age- and gender-specific contact patterns and time trends at high 1-year resolution from longitudinal survey data. The primary innovation of the model is the ability to infer contact patterns by 1-year age bands even when the age contacts are reported in broad age bands. We call this model-based approach the Bayesian rate consistency model for reasons that will be clear soon. In addition, we use recently developed Hilbert Space Gaussian Process approximations [16] to gain substantial advances in computational efficiency, enabling us to make full Bayesian inferences over time and uncover the time trends in social contact structure. We demonstrate that it is crucial to model contact patterns over time to account for reporting fatigue effects in inter-dependent longitudinal survey waves. The primary purpose behind developing the Bayesian rate consistency model is its application to contemporary COVID-19-era survey data, which we present for data spanning the first five survey waves of the COVIMOD study in Germany. We present high-resolution estimates of age- and gender-specific social contacts for each survey wave and describe their time evolution. We also place the inferred contact dynamics into a pre-pandemic context and quantify the differences in contact intensity change by the age of contacts.
Methods
Ethics statement
COVIMOD was approved by the ethics committee of the Medical Board Westfalen-Lippe and the University of Münster, reference number 2020–473-fs. Written consent was obtained from all participants. The POLYMOD data collection was approved by national institutional review boards [1]. As only anonymised COVIMOD and POLYMOD data were used in this study, an institutional review was not required for reanalysis.
The COVIMOD study
The COVIMOD study was launched in April 2020 and continued until December 2021, constituting 33 survey waves. Participants were recruited through email invitations to existing panel members of the online market research platform IPSOS i-say [17]. To ensure the sample’s broad representativeness of the German population, quota sampling was conducted based on age, gender, and region. Participants were invited to participate in multiple waves to track changes in social behaviour and attitudes toward COVID-19. When the participant size did not meet the sampling quota due to study withdrawals, new participants were recruited into the study. This approach enabled the COVIMOD study to obtain longitudinal samples, but it also introduced the issue of response fatigue, where the number of detailed contacts reported decreased compared to previous participation, irrespective of the survey wave. To procure information on children, a subgroup of adult participants living with children under the age of 18 were selected to be proxies. This procedure meant that middle-aged adults were under-sampled as they completed the survey on behalf of their children.
The COVIMOD questionnaire was based on the CoMix study and includes questions on demographics, the presence of a household member belonging to a high-risk group, attitudes towards COVID-19 as well as related government measures, and current preventative behaviors [9, 18]. Participants were also asked to provide information about their social contacts between 5 a.m. the preceding day to 5 a.m. the day of answering the survey. Following the pre-pandemic POLYMOD study, a contact is defined as either a skin-to-skin contact such as a kiss or a handshake (physical contact) or an exchange of words in the presence of another person (non-physical contacts) [1]. Participants were asked to report the age group, gender, relation, the contact setting (e.g. home, school, workplace, place of entertainment, etc.), and whether the contact was a household member. For survey waves 1 and 2, participants were asked to provide each contact’s information separately. However, some participants reported contacts to groups of individuals (e.g., customers, clients) for which a specific number of contacts was assumed (Additional file 2 of [9]). From wave 3 onward, in addition to being asked to report contacts individually, participants were allowed to report a total number of contacts that were either physical or non-physical contacts in the sense described above and which they did not report individually, which we refer to as aggregate contacts. Additionally, some participants could not recall or preferred not to answer the age or gender information of some of the individual contacts that they reported. We treat these three types of entries with missing age or gender equally and refer to them as missing & aggregate contact reports. A copy of the COVIMOD questionnaire may be found in Additional file 1 of [9].
This current work concerns the first five survey waves of the COVIMOD study. In Fig 1A and 1B, we show the sampling periods with the number of daily COVID-19 cases, cumulative COVID-19-related deaths, and the OxCGRT Stringency Index which is a measure that captures the degree of containment and closure policies in place for COVID-19 on a scale of 0% to 100% [19]. The following COVID-19 policy timeline is obtained from the ACAPS COVID-19 Government Measures dataset [20]. The first COVIMOD survey was administered from April 30th to May 6th in the year 2020, towards the end of the first partial lockdown and the first wave of cases. Before the beginning of the first survey (April 20th), small stores, auto dealers, and bookstores were allowed to reopen under strict hygiene regulations. During the final few days of the survey period (May 4th to 6th), phase-out measures were announced by the government, including the step-wise uptake of schools, the reopening of hairdressers under strict hygiene regulations, lifting of the ban on public gatherings of 30 people indoors and 50 outdoors, resumption of religious services, and reopening of public services such as museums, botanical gardens, zoos, and playgrounds. The second wave of the COVIMOD survey was administered from May 14th to May 21st. During this period, additional phase-out measures were announced, including the resumption of all cross-country transport and the reopening of hotels and restaurants. International travel to neighbouring countries was also slightly relaxed during this period. The third, fourth, and fifth waves of COVIMOD surveys were taken from May 28th to July 4th, June 11th-22nd, and June 26th to July 1st, respectively. There was no notable introduction or reduction of social contact restriction measures during this time, but international travel restrictions were relaxed primarily for Schengen and EU countries. COVID-19 cases and deaths remained stable during this period (Fig 1A and 1B).
A. Daily COVID-19 case counts in Germany (red bars), the OxCGRT Stringency Index (blue line), and COVIMOD survey administration periods (grey ribbons). B. Cumulative COVID-19-related deaths in Germany (red line), the OxCGRT Stringency Index (blue line), and COVIMOD survey administration periods (grey ribbons). C. Sample sizes and the proportion of people repeatedly sampled in the COVIMOD survey for which zero repeats indicate first-time participants.
After excluding participants who prefer not to provide age or gender information and 25 participants above the age of 84, there were 1549, 1345, 1076, 1881, and 1603 participants for waves 1 to 5. We observed 3244, 4852, 6344, 13471, and 8353 total contacts for each wave. In Fig 1C, we show the proportion of participants who consented to the survey multiple times. Most participants in waves 2 and 3 had participated in wave 1, with only 6.8% and 16% of participants being new to the survey. The proportion dropped sharply in wave 4, where only 35.1% of initial participants remained. Hence the majority (57.7%) of wave 4 participants were first-time participants. On the contrary, no new participants were enrolled for wave 5, and individuals who participated for the second, third, and fifth time took up approximately 45%, 10.7%, 9.6%, and 34.6% of the sample.
Data processing
In this study, we excluded reports from 20 participants (0.3% of the total) who did not report their own age or gender, as these data are essential for our modelling framework. Further, and following ethical guidelines, the participant age information for children was reported in discrete age bands, i.e., 0–4, 5–9, 10–14, 15–18 years. To obtain fine-age information for participants under 18, we imputed their age by drawing from a discrete uniform distribution with bounds set as the minimum and maximum age of the participant’s age category. Finally, regarding the three types of missing & aggregate contacts reported, i.e. contacts reported to groups of individuals, additional contacts reported for who participants did not provide individual information, and individual contacts reported for who participants did not provide age and gender information, we truncated these at 60 (90th percentile) to remove the effects of extreme outliers. We show the distribution of observed contacts in S1 and S2 Figs.
Estimating contact patterns
Throughout, we focus on estimating contact patterns and dynamics between men and women (superscripts g, h ∈ {M, F}) and high resolution one year age groups (subscripts ). We note that the cutoff at age 84 is a specific choice on account of the age aggregation scheme and the age-specific sample size of COVIMOD. One may change this cutoff for different surveys in an appropriate fashion. To introduce basic notations, let us momentarily consider male-to-female contacts at some fixed time t, suppress the time index, and assume that the exact age of reported contacts is known. We start by considering the total number of contacts
that are reported by men of age a who participate in the survey to all women of age b in the population. The number of male survey participants of age a is
, and the number of women of age b in the population is
. When
for age group a, the contact counts
are defined for all
, and are either zero or positive. When there are no participants for some age group a, there is no corresponding contact data, and we denote the participant age groups in the survey by
. Finally, we denote the number of age groups in
by AM and the number of age groups in
by B. We also consider analogous notations for all other gender combinations using the superscripts g and h.
From the observed data, we seek to estimate the contact intensity , the average number of contacts from one male participant of age a to all women aged b in the population within a 24-hour time window. We follow [1, 2, 8, 10, 21–23] and model the count data with an overdispersion adjusting Negative Binomial observation model, in shape-scale form,
(1a)
(1b)
(1c
where g, h ∈ {M, F},
,
. Here,
are the expected contacts, which are expressed in terms of the target quantity of interest,
, and the known
. The overdispersion parameter ν is strictly greater than zero (ν > 0), such that the variance of the contact counts
is greater than the mean.
The contact rate is defined as the probability of contact between one male aged a and one female aged b within a 24-hour time window, i.e.
(2)
Crucially, in a closed population, the number of total contacts between the same population strata must be self-consistent and add up to the same, i.e.
(3)
for all
. From this, we find contact rates are symmetric because
for all a, b, and similarly
and
for all a < b. Property (3) implies that data on age group a informs contact rates in both age dimensions, which we will exploit heavily below. If the survey captures participants for all possible age groups, i. e.
, the estimation problem reduces to B × B + B × (B + 1)/2 × 2 = B(2B + 1) free contact rate parameters rather than 4B2 free parameters, an almost 50% reduction. To take advantage of these self-consistency constraints, we follow [10] and expand Eq (1c) to
(4a)
(4b)
(4c)
(4d)
where
is a real-valued baseline parameter, and fMF, fMM, fFF are three real-valued, random functions of two-dimensional continuous inputs on the compact domain [0, 84] × [0, 84]. Specifically, we model the fgh through computationally efficient Gaussian process approximations as described below, and for ease of notation, write fgh(a, b). The random functions act as age-age-specific offsets to the baseline parameter and thus capture the age structure in human contact intensities. Using random functions, we can estimate arbitrary age-specific contact patterns. This is important because human contact patterns have changed substantially since the COVID-19 pandemic with school closures and other non-pharmaceutical interventions.
Recovering fine age structure from coarse data
For COVIMOD and similar contact surveys, participants were asked to report their contacts in the coarse age groups
(5)
Importantly, the exact age of the participant is known, and we can leverage this information through the symmetry property in Eq (3) to estimate contact intensities at a much finer resolution. Because of this fundamental property, we call our resulting model the “Bayesian rate consistency model”. Using the shape-scale parameterisation in Eq (1), it follows that
(6a)
(6b)
(6c)
where g, h ∈ {M, F},
,
, and
. We will demonstrate below that the high-resolution contact intensities
are identifiable from coarse contact data.
Estimating dynamics in contact patterns
It is in principle, straightforward to extend model (1–6) to capture time trends in contact patterns, but a particular challenge arises in the context of repeat surveillance. Across COVIMOD survey waves, many participants agreed to report data in multiple rounds, and analyses indicate that participants tend to repeat fewer contacts in subsequent surveys, a phenomenon called reporting fatigue. The time trends in the primary data are thus confounded by longitudinal reporting behaviour. To control for reporting fatigue, we denote by the number of contacts to individuals of age group c and gender h that are reported in survey wave t by all participants of age a and gender g who have participated r time(s). All other notation extends analogously.
In the simplest case, we introduce age-homogeneous reporting fatigue effects at repeat response times r = 0, 1, 2, …, and jointly model the longitudinal data with
(7a)
(7b)
(7c)
where t = 1, 2, … indicates the survey waves, r = 0, 1, 2, … repeat surveillance, and ρ0 = 0 for ease of notation. Here, reporting fatigue is captured by negative ρr (that decreases with r), which in turn will adjust the contact intensities
in follow-up survey rounds to higher estimates than the primary data suggest. New participants entered the COVIMOD survey in each survey wave, so we have data to provide independent information on the contact dynamics and reporting fatigue, with the model borrowing strength across all the data available.
A second challenge is that the number of missing & aggregate contact reports also fluctuated over time. As denominator in our sample, we now consider all participants of age a and gender g regardless of whether they reported individual contacts or missing & aggregate contacts and denote their total in wave t with repeat participation r by as in (7c). We further denote the number of missing & aggregate contact reports by participants of age a and gender g in survey wave t by
. For instance, if a participant in wave t of age a and gender g had 2 contacts with missing age or information and reported 18 aggregate contacts, then we added 20 to
. In addition, consider the number of total contacts with detailed age information by participants of age a and gender g in survey wave t by
. Thus, we can calculate the proportion of contacts that are reported with detailed age information,
(8)
Assuming that the age of contacts is missing at random within each participant group of gender g, age a and wave t, we then use (8) as an additional offset term in the linear predictor,
(9)
In practice, if increasingly many participants only provide aggregated contact reports, we have that , and in turn, this will adjust the contact intensities
in later survey waves to higher estimates than the data with full age-specific details suggest.
Non-parametric modelling of contact dynamics
We regularise our inferences in high-dimensional parameter space by associating the random functions ,
,
in Eq (4) with computationally efficient, zero-mean, two-dimensional Hilbert Space Gaussian Process priors [24, 25]. In what follows, we will drop the time and gender sub- and superscripts to ease notation and present our modelling of a generic random function f that represents age structure in contact patterns. Zero-mean two-dimensional Gaussian Processes (GPs) are powerful prior models for random functions. For any finite collection of two-dimensional inputs, the function values are multivariate normal with mean zero. We always have AB count observations on the grid x1 = (a1, b1), …, xAB = (aA, bB) defined by A male and female participant age groups in
and all possible B population age groups in
. The multivariate normal has then a covariance matrix
whose i, jth entries are specified by a covariance kernel function k(xi, xj). Here, we decompose the 2D kernel function for computational efficiency and model each component through squared exponential or Matérn class kernels. Specifically, using the squared exponential kernels as an example, we have
(10a)
(10b)
(10c)
where the scaling parameters αa, αb control the magnitude of the random function in the corresponding dimension, and the lengthscale parameters la, lb control the bandwidth. The product in Eq (10a) is also known as Kronecker decomposition because the covariance matrix K equals the Kronecker product of the covariance matrices of the kernels with one-dimensional inputs, K = K2 ⊗ K1, where the i, jth entry in
is given by k1(ai, aj) and the i, jth entry in
is given by k2(bi, bj). For computing purposes, we exploit that K1, K2 are positive semi-definite and decompose the covariance matrices as K1 = L1L1⊤, K2 = L2L2⊤, where the superscript ⊤ denotes transposition. Using the mixed product property of Kronecker operations, we obtain
(11)
This shows that the zero-mean two-dimensional GP prior attached to the random function f on the AB inputs x = (x1, …, xAB) can be obtained by linear transformation of AB i. i. d. standard Gaussian random variables ,
(12)
In Eq (12), the left-hand side denotes the AB-dimensional column vector of the random function evaluated at the inputs, and the right-hand side shows how the Kronecker product is calculated by a series of basic arithmetic operations. The reshape operation transforms the AB dimensional column vector z column-wise into a A × B dimensional matrix, and the vec operation flattens A × B dimensional matrices column-wise into an AB dimensional column vector.
Eq (12) also shows that the computational cost of two-dimensional GPs is entirely determined by calculating, first, L1, L2 for each new set of GPs hyperparameters, and then, second, performing the arithmetic operations associated with (L2 ⊗ L1)z. We use Hilbert Space Gaussian Process (HSGP) approximations [16] to each of the kernels k1 and k2 in Eq (10) to reduce the computational cost associated with the first step. For brevity, we refer readers to the excellent introductions to HSGPs in [24, 25], and here merely note that the stationary isotropic kernels can be expressed as an infinite sum that involves the spectral density S, eigenfunctions ϕi and eigenvalues λi, i = 1, …, ∞, associated with a certain Laplacian eigenvalue problem on a compact domain Ω that is strictly larger than . For convenience, the input domain
is shifted with the midpoint at zero, and then Ω is written as [−L, L] for some L > 0. To ease notation, we continue to write the shifted inputs in what follows as ai, and bi. The HSGP approximation
to k1 on the domain [−L1, L1] is then obtained by truncating the infinite sum to the first M1 terms,
(13a)
(13b)
(13c)
(13d)
Crucially, the GP hyperparameters αa, la enter only in Eq (13b), and the eigenvalues and eigenfunctions are the same regardless of the GP hyperparameters and depend only on the domain boundary value L1 together with the observed inputs a, a′. This speeds up Bayesian computations significantly because the eigenvalues in Eq (13c) and eigenfunctions in Eq (13d) can be precomputed once and for all. For Matérn and other kernels, the corresponding spectral densities may be found in [26]. Rewriting Eq (13a) in matrix notation, we see that L1 is approximated by
(14)
where the A × M1 matrix Φ1 has the i, j entries
, and the M1 × M1 matrix Δ1 is diagonal with j, j entries
. Again, Φ1 does not depend on the GP hyperparameters and can be precomputed. The arithmetic operations in Eq (14) can harness computationally efficient diagonal-post-multiply functions in many linear algebra libraries. The HSGP approximation to the k2 kernel is analogous. The tuning parameters of the HSGP approximations are the integers M1, M2 and the boundary values L1, L2, and we determine these using established diagnostics [25]. The zero-mean Kronecker-decomposed HSGP prior associated with our random functions f on the input grid x is then
(15)
where
is a M1M2 dimensional column vector of i. i. d. standard normal random variables, and the non-negative hyperparameters are θ = (αa, la, αb, lb).
Difference-in-age parameterisation
Human contact patterns tend to concentrate among individuals of similar age and individuals with similar age gaps (parent-child, grandparent-child and grandparent-parent). To capture this diagonal structure in the simple Kronecker decomposed priors in Eq (10) for our 2D random functions f, we follow [22] and define f on an age by difference-in-age space rather than an age by age space. This amounts to rotating the age by age space by 45 degrees so that the peer-peer, parent-child, grandparent-child, and grandparent-parent contacts correspond to horizontal lines in the re-parameterised space and match the structure of our Kronecker decomposed priors (10) (see also Fig 1 of Vandendijck et al. [22]). Specifically, we consider age differences , and re-parameterise the points
to
. We are only interested in the random functions evaluated on the original points, which we write as f(d(a, b)) for all
. The number of age differences D = 169 in
is larger than the number of one-year age groups B = 85 and so the difference-in-age parameterisation entails higher computational cost in the calculations that underpin Eq (15). We will show in the results section that this parameterisation is crucial to obtaining accurate estimates of typical fine-age diagonal and off-diagonal human contact patterns from coarsely reported age data, and so recommend its use despite the higher computational cost.
Full Bayesian model and numerical inference
To complete our model for inferring contact dynamics from longitudinal survey data, we specified commonly used priors on all model parameters. The full model is specified for survey waves t = 1, …, 5, reporting repeats r = 0, …, 4, gender g, h ∈ {M, F}, participant age groups and population age groups
by
(16a)
(16b)
(16c)
(16d)
(16e)
(16f)
(16g)
and
(17a)
(17b)
(17c)
(17d)
(17e)
(17f)
(17g)
Monte Carlo draws from the joint posterior distribution of all parameters was obtained with the probabilistic computing language Stan [27] via the cmdstanr interface version 0.5.2. Eight chains were run in parallel for 500 warmup iterations and 1000 iterations thereafter. Initial sampling was facilitated by adding the nugget 10−13 to . We typically observed a small number of divergences in the NUTS algorithm, but these accounted for less than 0.005% of samples and were considered to be of no concern. The smallest effective sample size was 1892, and the
convergence diagnostics were below 1.01, indicating that the Markov chains converged and mixed well [28, 29]. Trace plots for the parameters with the smallest effective sample size and maximum
are shown in S8 Fig. The R and Stan codes for our models, along with other analysis codes are available at https://github.com/MLGlobalHealth/bayes-rate-consistency.
Simulated social contact data
To validate the Bayesian rate consistency model, we created synthetic datasets that mimic social contact patterns before the COVID-19 pandemic (pre-COVID-19) and during the pandemic (in-COVID-19). To reduce experiment run time, we limited participants and contact ages from 6 to 49 years and assumed that contact intensity patterns do not vary by gender.
Contact intensities were set to be highest among individuals of similar age, mimicking age-assortative contact behaviour. To simulate parent-child contact dynamics, we defined individuals between 6–18 as children and individuals between 30–39 as parents and set the contact intensities between parent-child groups to higher values. Similarly, individuals between 19–29 were defined as children of individuals from 40 to 49, and contact intensities between these groups were also set to a higher value. The resulting patterns are shown in the top left panel of Figs 2 and 3. For full details, we refer readers to S1 Text. From the stylised contact intensity scenarios, we next randomly generated age- and gender-specific contact counts for five different participant size configurations, N = 250, 500, 1000, 2000, 5000, by sampling from a Poisson distribution such that where
. We set
such that the age-gender counts of the participants were representative of the 2011 German census population [30]. To mimic the age reporting scheme in the COVIMOD surveys, we aggregated the simulated contact counts by
where c ∈ {6-9, 10-14, 15-19, 20-24, 25-34, 35-44, 45-49} as illustrated in the top right panels of Figs 2 and 3. In this fashion, we generated 10 replicate datasets for each experiment configuration (pre-COVID-19/in-COVID-19 and sample size) to obtain representative accuracy and runtime estimates.
(Top left) Simulated social contact intensities for Male-Male contacts. (Top right) Simulated social contact counts for Male-Male contacts with a COVIMOD-like age aggregation scheme. (Bottom left) Posterior median estimates of social contact intensities from the age-age parameterised HSGP model. (Bottom right) Posterior median estimates of social contact intensities from the difference-in-age parameterised HSGP model.
(Top left) Simulated social contact intensities for Male-Male contacts. (Top right) Simulated social contact counts for Male-Male contacts with a COVIMOD-like age aggregation scheme. (Bottom left) Posterior median estimates of social contact intensities from the age-age parameterised HSGP model. (Bottom right) Posterior median estimates of social contact intensities from the difference-in-age parameterised HSGP model.
Results
Contact patterns by 1-year age band can be estimated
Fig 2 illustrates the simulated contact patterns in the pre-COVID-19 scenario by 1-year age bands, along with the corresponding data by coarse 5 and 10-year age bands for a sample size of 2,000 participants, and the fits with the Bayesian rate consistency model that aim to recover the fine age structure by 1-year age bands. Fig 3 shows our results for the in-COVID-19 scenario. The age-age parameterisation performed poorly for both simulation scenarios, especially in regions of the contact matrix where the degree of age aggregation is large, i.e., 10-year age intervals as opposed to 5-year age intervals. For such large reporting intervals of age groups, the contact intensity patterns that we could estimate with the age-age parameterisation showed idiosyncratic bimodal patterns along the main diagonal of the contact intensity matrices (bottom left panels in Figs 2 and 3). In comparison, the difference-in-age parameterisation captured age-assortative contact patterns and the sub-diagonal parent-children contact patterns with much better accuracy. The estimated contact intensity patterns for other gender combinations and simulation scenarios were qualitatively very similar and reported in S4 and S5 Figs.
In several previous social contact surveys, participants were asked to report the exact age of their contacts. This allowed us to further test the Bayesian rate consistency model by artificially withholding the exact age of each contact in real-world data. Specifically, we tested how closely the reconstructed fine-age contact intensities inferred on the masked data with the age of contact masked into broad age bands match those inferred from the actual data with the exact age of contacts known. Fig 4 shows the outcomes of this test on data from 1,292 participants who reported 31,670 contacts in Germany as part of the POLYMOD study [1]. The contact intensities learned from the data with the exact age of contact and those learned from the data with age of contact masked into broad age bands were qualitatively very similar, and the mean absolute difference across each age-age gender-gender group was 0.145 in the context of a mean contact intensity of 0.120 in each age-age gender-gender group. We repeated this test on data from 1,122 participants from the Manicaland HIV/STD Prevention Study [31], a general population cohort study carried out since 1998 in Manicaland, the easternmost province of Zimbabwe. In 2013, participants were asked to report on their social contacts, which were either physical or non-physical, in two-way conversations with three or more words in the physical presence of another person, and the exact age was reported for 24,480 contacts. We chose these data for a second test because Manicaland, Zimbabwe’s population structure and social contact patterns differ markedly from those seen in Europe [1, 31]. Fig 5 shows our test outcomes on the Manicaland data. The Bayesian rate consistency model generated again very similar estimates on the data with the age of contact masked into broad age bands as compared to the actual data with the exact age of contact known, and the mean absolute difference across each age-age gender-gender group was 0.179 in the context of a mean contact intensity of 0.125 in each age-age gender-gender group.
(Top row) Crude empirical contact intensity patterns. (Middle row) Posterior median contact intensity estimates from the Bayesian rate consistency model applied to fine-age contact data meaning the age of contacts was not aggregated into large age bands. (Bottom row) Posterior median contact intensity estimates from the Bayesian rate consistency model applied to coarse-age contact data meaning the age of contacts was aggregated into larger age bands similarly to the COVIMOD study.
(Top row) Crude empirical contact intensity patterns. (Middle row) Posterior median contact intensity estimates from the Bayesian rate consistency model applied to fine-age contact data (i.e., the age of contacts is not aggregated into large age bands). (Bottom row) Posterior median contact intensity estimates from the Bayesian rate consistency model applied to coarse-age contact data (i.e., the age of contacts is aggregated into large age bands in a similar fashion to COVIMOD).
Gaussian process approximations enable fast Bayesian inference
In Table 1 and S6 and S7 Figs, we compare the performance of the Bayesian rate consistency model with various parameterisations for different sample sizes and scenarios in terms of estimation accuracy and computing runtimes. Specifically, to assess how well HSGP models can approximate full-rank 2DGP models, we ran simulations for both scenarios and parameterisations with a sample size fixed at 2000. We compared the model fits to the simulation truth in terms of the mean absolute error (MAE) based on the age-age-specific inferred contact intensities, expected log posterior density (ELPD), the percentage that the predicted values are inside the 95% prediction intervals according to posterior predictive check (PPC), and the median running time.
The difference-in-age parameterisation achieved significantly better accuracy than the age-age parameterisation but also required more time to fit due to the introduction of additional A2 − A nuisance parameters. The computational toll of the difference-in-age parameterisation was most strongly reflected in the median runtimes under the full-rank 2DGPs, which could take more than half a day to fit. Using HSGPs, we reduced median runtimes by more than 15-fold while maintaining an accuracy close to that of full-rank 2DGP models. The bottom section of Table 1 compares the accuracy of the difference-in-age HSGP models across survey sample sizes. In general, smaller sample sizes led to less accurate estimates.
Modelling marked structures in age-specific contact patterns
Social contacts are strongly structured by age, reflecting common behaviour and social norms around family size, reproductive age, schooling, and other factors [22]. In turn, smooth process kernels such as the squared exponential (Eq (10b)) may not be well suited to describe marked changes in contact intensities. On our simulated contact scenarios, we find indeed that Matérn and Matérn
kernels performed better in comparison to squared exponential kernels in terms of accuracy (S1 Table). The difference in accuracy between Matérn
and Matérn
was small and qualitatively indistinguishable (S4 and S5 Figs), and in the following results, we considered the Matérn
kernel.
Model-based estimates of contact patterns in Germany by 1-year age groups
Fig 6 shows the crude empirical age- and gender-specific contact intensities for the first wave of the COVIMOD survey in early May 2020, calculated from the data without any statistical modelling via
where
denote the total number of contacts from participants of gender g and age group a to individuals of gender h and age category c,
are age- and gender-specific sample size, and
are age- and gender-specific proportion of reports with complete age and gender information. Fig 6 also shows the contact intensity estimates obtained via bootstrapping with the socialmixr package [11], and the estimates from the Bayesian rate consistency model. We found that the crude estimates are sparse and fluctuate greatly, even between neighbouring age groups. Next, the socialmixr estimates aggregate the observations by the large age categories in which the contacts were reported and do not borrow available information through the exact age of participants to obtain higher resolution estimates. The socialmixr estimates are adjusted for symmetry in contact rates but are not adjusted for reporting fatigue or missing & aggregate contact reports (see respectively Eqs (7) and (9). We also observed that contacts with missing age information are imputed by sampling the missing age only from all contacts of the participants of the same age group [11].
(Top row) Crude empirical social contact intensity patterns. There are some age groups with no participants, and they are represented by white vertical columns. (Middle row) Contact intensity patterns as estimated by the socialmixr R package [11]. (Bottom row) Posterior median contact intensity estimates from the Bayesian rate consistency model. For the crude estimates and socialmixr estimates, estimates are divided by the number of integer ages in the corresponding age group to make the estimates more comparable (e.g., if the age of contacts is 20 − 24 than the contact estimate is divided by 5 while if the age of contacts is 25 − 34 the contact estimate is divided by 10). The exact runtime arguments for this comparison are given in script figure-6.R on our accompanying GitHub repository.
In comparison, the contact patterns estimated with the Bayesian rate consistency model align with those estimated with socialmixr, but provide finer age resolution. We achieve this higher resolution by logical constraints on who contacts whom in a closed population (recall Eq (1)), and not via imputation. These constraints imply that data on the exact age of survey participants provide information on the exact age of contacts even though they are reported in coarse age brackets. The patterns reveal strong age-assortativeness in mixing patterns that is indicated by the high intensities on the main diagonals of the contact intensity matrices shown in the bottom row of Fig 6. Lying approximately 30 years away from the main diagonal, two strips of high contact intensity fade with increasing age and correspond to inter-generational contacts between parents and children. While this age-dependent pattern persists over time (S9 and S12 Figs), we will also show below that the increases in social contact intensities in subsequent waves were far from uniform across age.
Controlling for time-varying reporting effects
We can sum the contacts’ age dimension of estimated contact intensities to obtain the average number of contacts from one person of age a per day. For brevity, we call these “marginal” contact intensities. In Fig 7, we show the estimated marginal contact intensities under the Bayesian rate consistency model, which simultaneously accounts for the time-varying reporting effects that emerge through reporting fatigue (repeat participation in the longitudinal COVIMOD survey) and missing & aggregate contact reports (participants unable to list contacts individually), and which are clearly present in the data as shown in Figs 1 and S1. We compare these (adjusted) marginal contact intensities in Fig 7 to those obtained without adjusting for missing & aggregate contact reports (but adjusting for reporting fatigue), those obtained without adjusting for reporting fatigue (but adjusting for missing & aggregate contact reports), and those obtained without any adjustments for missing & aggregate contact reports or reporting fatigue. Furthermore, we compared the marginal contact intensities of wave 4 given by the final longitudinal model, which adjusts for reporting fatigue with a cross-sectional model fitted on data from the same wave but with repeating participants excluded from the data (S13 Fig). We found that the reporting effect sizes can be estimated simultaneously with all other age, gender and time parameters, meaning that they are mathematically consistent with the symmetry constraints in contact rates in closed populations. The results clearly show that adjusting for missing & aggregate contact reports and reporting fatigue significantly increased the estimated marginal contact intensities in a non-trivial manner that depends on the contribution of repeat survey participants to each survey wave. It is implausible that the actual contact intensities did not increase as contact reduction measures were progressively eased between wave 1 to wave 5 of the COVIMOD survey, as is suggested by the estimates without adjustments. This led us to conclude that longitudinal contact patterns must be estimated from the available data with a model-based framework that can adjust for fatigue effects and missingness.
Posterior median estimates (line) and 95% credible intervals (ribbon) of the number of social contacts in the last 24 hours are shown by age and sex of participants and survey wave in red under the Bayesian rate consistency model that adjusts for both missing & aggregate contact reports and reporting fatigue. For comparison, posterior median estimates are shown for the model incorporating no adjustments (long dashed blue), adjustments for reporting fatigue but not missing & aggregated contact reports (green dashed lines), and adjustments for missing & aggregated contact reports but not reporting fatigue (turquoise small dashed lines).
Social contact intensities from May to July 2020 in Germany remained largely below pre-pandemic levels
With these adjustments in place, we next compared the contact intensities seen during the first 5 waves of the COVIMOD study to those observed in the pre-pandemic POLYMOD study conducted between 2006 and 2008 [1] (Figs 8 and S14). Social contact intensities in the first 5 waves remained substantially below pre-pandemic levels, which illustrates the stark impact that the COVID-19 pandemic and non-pharmaceutical interventions had on social interactions when protective vaccines were unavailable, and mortality following COVID-19 infection was high [32]. Interestingly, the POLYMOD data also suggest that women between ages 20 and 50 had more contacts than men of the same age range before the pandemic (Fig 8). Another point that emerges from the fine-age analysis is that by wave 5, the marginal contact intensities of individuals aged 70 were very similar to pre-pandemic levels. This could reflect that the relatively few social contacts of older individuals aged 70 and above are essential human interactions that are challenging to reduce.
Solid lines represent posterior median estimates of the marginal contact intensities by age and gender of participants, and after adjusting for missing & aggregate contacts and reporting fatigue for the COVIMOD survey. Dashed lines represent posterior median estimates from the pre-pandemic POLYMOD study and, unlike the estimates reported in [1], also include missing & aggregate contacts. Shaded ribbons represent 95% credible intervals.
Non-homogeneous increases in contact intensities from May to July 2020 in Germany
Fig 7 shows that the contact intensities increased consecutively from wave 1 to wave 5, with the increases being most substantial from wave 1 to wave 2. These increases are primarily due to marked increases in the number of contacts that were reported in aggregate over the first five survey waves (S1–S3 Figs). Fig 9 illustrates the relative percentage increase in the marginal contact intensities relative to those in wave 1. Although there were marked differences in the contact patterns between men and women (Fig 7), the relative increases showed no significant differences between the two genders, which suggests that the gender differences in contact patterns may arise from underlying gender-dependent social contact structures that are captured in the survey, rather than non-pharmaceutical interventions. Comparing waves 2 and 3 to wave 1, contact intensities tended to increase more in older age groups. For wave 4, we observe a sharp increase in contacts among men approximately 20 years of age. However, we find that this pattern is sensitive to data pre-processing criteria, as we explain below. Comparing wave 5 to wave 1, we found the strongest evidence that contact intensities did not increase homogeneously for all age groups over time and instead tended to increase more in older age groups.
The red and blue lines represent posterior median estimates of the relative percentage change in marginal contact intensity estimates in women and men by the age of participants. Shaded ribbons represent 95% credible intervals.
Differential rebound of age-specific social contacts
We next focus on characterising the dynamics in contact intensities for specific age groups. Intuitively, this corresponds to slicing the contact intensity matrices reported in Fig 6 across rows for a fixed column, and for brevity, we call these the “conditional” contact intensities. In Fig 10 (top), we illustrate the conditional contact intensities for individuals aged 10, 20, 35, and 70 years to represent the contact intensities of school children, young adults, the working, and the ageing population. We observe two peaks for participants aged 10 and 20: a larger sharp peak corresponding to contacts between peers and a shorter rounded peak with individuals approximately 45–50 years older, representing contacts with their parents. For participants aged 35, we observe an additional third peak with individuals aged 60 to 70, predominantly corresponding to contacts with their parents. Participants aged 70 generally mixed with individuals of a similar age, with some contact between individuals aged approximately 40 but almost no contact with individuals under 20. These core patterns of social contacts are present across all survey waves.
The posterior median estimates for conditional contact intensities (top row) and relative change in the conditional contact intensities in waves 2 to 5 relative to those in wave 1 (bottom row) for individuals aged 10, 20, 35 and 70, respectively. Conditional contact intensities were aggregated across men and women. The colours represent different COVIMOD survey waves, and the shaded ribbons represent 95% credible intervals. We only show credible intervals for wave 5 to reduce overlaps and ease interpretation.
Fig 10 (bottom) shows the ratio of the conditional contact intensities in survey waves 2, 3, 4 and 5 relative to those in wave 1. Ratios above 1 thus indicate increases in social contacts. We find that the increases in social contacts were not homogeneous by the age of contacted individuals. Focusing on wave 5 relative to wave 1, we find that for children aged 10, the conditional contact intensities in individuals of the same age and roughly aged 70 rose particularly strongly. For individuals aged 20, increases in their social contacts were more homogeneous, except for those with young children, which remained similar to those seen in wave 1. For individuals aged 35, increases in their social contacts were concentrated in slighter younger and all older individuals, but not their peers, reflecting that individuals in the 35-year age group retained social contact with their peers during intense non-pharmaceutical interventions in wave 1. Individuals aged 70 homogeneously increased their conditional contact intensities with younger individuals. These findings suggest that the age patterns in social contacts changed in a structured and complex manner after the first COVID-19 wave in Germany.
Discussion
We developed the Bayesian rate consistency model in order to regain the ability to quantify and characterise social contact patterns at high age resolution from contemporary, longitudinal survey data on social contacts. The main contributions of our model-based approach to estimating the trends in human contact patterns are as follows. The Bayesian rate consistency model can reasonably accurately estimate high-resolution social contact patterns from data that aggregate the age of contacts into large age bands. The model also enables adjusting for the confounding effects due to the aggregated reporting of contacts and reporting fatigue in longitudinal social contact surveys. These advancements are particularly relevant to COVID-era social contact studies for which the age of contacts is reported in coarse age brackets and to which participants contributed to multiple survey waves [7, 13–15]. We draw from methodologies which serve as principal workhorses in spatial statistics to map the landscape of social contacts [33, 34] and incorporate a recently developed GP approximation technique to alleviate the computational bottleneck which often plagues such spatial models [16, 25].
In applying our model to the first five waves of the COVIMOD study, we gained insights into the age- and gender-specific social contact dynamics and their evolution over time at a resolution of 1-year age bands. We found relatively limited rebounds in contact intensities after lifting contact reduction measures and that social contact intensities remained substantially below pre-pandemic values until the end of our study period in July 2020 (Fig 7). These results are consistent with social contact survey data in England [13], Belgium [14], the Netherlands [15], and elsewhere. Relative to pre-pandemic estimates, the largest contact reductions occurred among children and young adults (Fig 7). Still, reductions were substantial across all age groups, indicating the strong impact of non-pharmaceutical interventions on social interactions and successfully reduced infection risk during the first months of the COVID-19 pandemic [32].
Over the first five COVIMOD survey waves between May and July 2020, we observed structured changes in contact patterns, with social contact intensities rebounding more rapidly with the increasing age of individuals (Fig 7). These increases in social contacts were also far from uniform across population age bands of contacted individuals. Most increases tended to centre on the off-diagonals of the age-age social contact intensity matrix (Fig 6). These findings are hard to obtain without detailed inferences of social contact patterns by 1-year age bands [13–15]. We also noted that despite a continued increase in contact intensity, COVID-19 cases remained stable during the analysis period (Fig 1). This may be because contact counts remained much lower than pre-pandemic estimates despite the ease in non-pharmaceutical interventions [9]. Other protective measures such as face masks, hygiene regulations including surface disinfection, remote work, and warmer summer weather may also have contributed to keeping infections at bay [5]. These observations demonstrate that longitudinal data collection on social contacts and tailored tools to analyse these data are essential to characterise the impact of non-pharmaceutical interventions and the possible social contact routes through which pathogens can spread because the observed dynamics in social contacts do not follow easily predictable patterns.
Our work is not without limitations. First, some issues arise from the sampling methodology of the COVIMOD survey. Participants were recruited from an online panel for market research with email invitations, and although quota sampling was performed, the final samples were not fully representative of the population [9]. Previous work proposed using post-stratification weights to re-scale the data, but a sensitivity analysis did not reveal large differences between weighted and unweighted estimates [9]. Participants consisted only of those with internet access who possibly adhered more to social distancing rules as such a demographic is more likely to respond to health surveys. Participants received guidance only through the text within the questionnaires, which may have been misinterpreted, and participants may report more contacts in paper-based surveys than in an online survey [35]. Additionally, we truncated aggregate contact reports at 60, but different thresholds may lead to slight changes in the inference results (S1 Fig). Next, the difference-in-age parameterisation may not be appropriate if social contacts do not follow the pattern where high intensities lie on the contact matrix’s main diagonal and sub-diagonals. This is relevant when investigators wish to conduct analyses for other contexts, e.g., work and transport, where contact patterns may not depend on age and age difference. However, it is easy for investigators to revert to the classical age-age parameterisation if they deem it more appropriate. We provide several template Stan model files in the accompanying GitHub repository. Furthermore, our fully Bayesian modelling framework is currently limited to analysing approximately 10 longitudinal survey rounds. While the recently proposed Hilbert Space Gaussian Process priors enable fast Bayesian inferences on cross-sectional data [16, 24, 25], additional research is needed to scale up the approach to survey data from 30 waves or more. Potential avenues include HSGP modelling over the time domain within our approach, Integrated Nested Laplace Approximations, penalised spline-based regressions or even variational autoencoders [2, 10, 36, 37]. Finally, and most importantly, the Bayesian rate consistency model requires participant age information to be reported by 1-year age bands. The exact age of participants is usually recorded without error but not necessarily made publicly available [7].
Conclusion
In summary, we provide a new statistical method to strengthen the global pandemic preparedness toolkit, which enables epidemiologists and policymakers to obtain a clearer picture of how infectious respiratory pathogens such as SARS-CoV-2 can spread through populations in near real-time. The Bayesian rate consistency model can analyse contemporary, longitudinal social contact survey data and estimate contact intensity by 1-year age bands. We validated the model on simulated social contact data for different scenarios and real-world data from Europe and Africa where contacts have been recorded in 1-year age bands, and any inference errors made by artificially coarsening the data can be assessed. The outputs—contact intensity matrices by gender and 1-year age band with uncertainty quantification—are central to estimating the effective reproduction number of pathogens in real-time with greater precision than currently possible from contact estimates by coarse age bands [6, 10, 13]. The outputs are also crucial for parameterising infectious disease models and more accurately forecasting cases, hospitalisations and deaths [32, 38, 39], and understanding the drivers of disease spread [38, 40]. Contact estimates from the first waves of the COVIMOD and CoMix social contact surveys [7] may be of particular interest as they represent the patterns during a time of stay-at-home orders across Europe and subsequent relaxation of non-pharmaceutical interventions, and could be used as templates of social contact patterns in future pandemic emergencies when such data are not immediately available. Vaccine allocation strategies also depend critically on understanding how pathogens spread through structured populations. Optimal allocations count the direct benefits of allocating life-saving vaccines to individuals with the highest fatality risk and the indirect benefits that accrue through prioritised allocations to population groups that would otherwise drive spread disproportionately through their contact and mixing patterns [41, 42]. We thus find the Bayesian rate consistency model promises to aid the understanding of contact behaviour, more realistic parameterisations of infectious disease models, and a deeper understanding of how infectious respiratory diseases are propagated through populations.
Supporting information
S1 Fig. The number of complete and aggregated contact reports by age, gender, and COVIMOD wave.
Pink bars represent missing & aggregate contacts, light blue bars represent non-household contacts, and dark blue bars represent household contacts. Missing & aggregate contacts were truncated at 60 (90th percentile in the primary data) to remove the effects of extreme outliers.
https://doi.org/10.1371/journal.pcbi.1011191.s001
(JPEG)
S2 Fig. Contribution of individually reported contacts with the known age of contact among all contacts reported.
Dark blue bars represent individually reported contacts for whom the age of the contact was specified by age bands. Pink bars represent reported contacts that were either reported in aggregate or individually without detail on the age or gender of the contact.
https://doi.org/10.1371/journal.pcbi.1011191.s002
(JPEG)
S3 Fig. The Average number of contacts for first-time and all participants by wave and contact type.
Red, light blue, green, and dark blue points represent the average number of household, non-household excluding aggregated & missing contacts, non-household including aggregated & missing contacts, and all contacts, respectively. Error bars represent 95% bootstrap confidence intervals.
https://doi.org/10.1371/journal.pcbi.1011191.s003
(JPEG)
S4 Fig. Simulation experiment results for the pre-COVID-19 scenario with different covariance kernels.
From top to bottom: results for the squared exponential kernel, results for the Matérn kernel, and results for the Matérn
kernel. All experiments were run with HSGP using the difference-in-age parameterisation models with M1 = 40 (Number of eigenfunctions on the difference-in-age dimension) and M2 = 20 (Number of eigenfunctions on the contacts’ age dimension). The sample size was fixed at N = 2000.
https://doi.org/10.1371/journal.pcbi.1011191.s004
(JPEG)
S5 Fig. Simulation experiment results for the in-COVID19 scenario with different covariance kernels.
From top to bottom: results for the squared exponential kernel, results for the Matérn kernel, and results for the Matérn
kernel. All experiments were run with HSGP using the difference-in-age parameterisation models with M1 = 40 (Number of eigenfunctions on the difference-in-age dimension) and M2 = 20 (Number of eigenfunctions on the contacts’ age dimension). The sample size was fixed at N = 2000.
https://doi.org/10.1371/journal.pcbi.1011191.s005
(JPEG)
S6 Fig. Simulated data, estimated contact intensity matrix, and estimated marginal contact intensities of different sample sizes for the pre-COVID-19 scenario.
(Top row) Simulated contact data for the pre-COVID-19 scenario. (Middle row) Posterior median contact intensity estimates. (Bottom row) Posterior median marginal contact intensity estimates. Results were obtained using the Bayesian rate consistency model with difference-in-age parameterisation with M1 = 30 and M2 = 20.
https://doi.org/10.1371/journal.pcbi.1011191.s006
(PNG)
S7 Fig. Simulated data, estimated contact intensity matrix, and estimated marginal contact intensities of different sample sizes for the in-COVID-19 scenario.
(Top row) Simulated contact data for the in-COVID-19 scenario. (Middle row) Posterior median contact intensity estimates. (Bottom row) Posterior median marginal contact intensity estimates. Results were obtained using the Bayesian rate consistency model with difference-in-age parameterisation with M1 = 30 and M2 = 20.
https://doi.org/10.1371/journal.pcbi.1011191.s007
(PNG)
S8 Fig. Trace plots of the parameters with the smallest effective sample size and maximum
convergence diagnostic statistic.
(Top row) Trace plot for parameter with the lowest effective sample size (1892), gp_rho_2[5, 2]: the length-scale parameter of the Gaussian process over the contact age dimension of the Female-Female contact matrix for wave 5. (Bottom row) Trace plot for the parameter with the largest convergence diagnostic statistic (1.01), z[1, 21, 6]: the coefficient for the 21st HSGP basis function over the difference-in-age dimension of the Female-Female contact matrix for wave 1.
https://doi.org/10.1371/journal.pcbi.1011191.s008
(PNG)
S9 Fig. Empirical and estimated social contact intensity patterns for COVIMOD wave 2.
(Top row) Crude empirical social contact intensity patterns, with crude contact intensities above a value of 3 truncated for visualisation purposes. There are some age groups with no participants, and they are represented by white vertical columns. (Middle row) Contact intensity patterns as estimated by the socialmixr R package [11]. (Bottom row) Contact intensity patterns are given by our Bayesian model.
https://doi.org/10.1371/journal.pcbi.1011191.s009
(JPEG)
S10 Fig. Empirical and estimated social contact intensity patterns for COVIMOD wave 3.
(Top row) Crude empirical social contact intensity patterns, with crude contact intensities above a value of 3 truncated for visualisation purposes. There are some age groups with no participants, and they are represented by white vertical columns. (Middle row) Contact intensity patterns as estimated by the socialmixr R package [11]. (Bottom row) Contact intensity patterns are given by our Bayesian model.
https://doi.org/10.1371/journal.pcbi.1011191.s010
(JPEG)
S11 Fig. Empirical and estimated social contact intensity patterns for COVIMOD wave 4.
(Top row) Crude empirical social contact intensity patterns, with crude contact intensities above a value of 3 truncated for visualisation purposes. There are some age groups with no participants, and they are represented by white vertical columns. (Middle row) Contact intensity patterns as estimated by the socialmixr R package [11]. (Bottom row) Contact intensity patterns are given by our Bayesian model.
https://doi.org/10.1371/journal.pcbi.1011191.s011
(JPEG)
S12 Fig. Empirical and estimated social contact intensity patterns for COVIMOD wave 5.
(Top row) Crude empirical social contact intensity patterns, with crude contact intensities above a value of 3 truncated for visualisation purposes. There are some age groups with no participants, and they are represented by white vertical columns. (Middle row) Contact intensity patterns as estimated by the socialmixr R package [11]. (Bottom row) Contact intensity patterns are given by our Bayesian model.
https://doi.org/10.1371/journal.pcbi.1011191.s012
(JPEG)
S13 Fig. Comparison of marginal contact intensity estimates: Final longitudinal model adjusting for reporting fatigue v.s. a cross-sectional model with repeating participants excluded from the data.
The red lines and ribbons show the posterior median estimates and 95% credible interval from the final longitudinal model on the entire COVIMOD dataset, adjusting for reporting fatigue. The green lines and ribbons display the estimated median and 95% credible interval from a cross-sectional model fitted on COVIMOD wave 4 data, excluding participants who answered the survey in previous waves.
https://doi.org/10.1371/journal.pcbi.1011191.s013
(JPEG)
S14 Fig. Ratio of female to male marginal contact intensities for POLYMOD and COVIMOD.
Lines represent posterior median estimates of the female-to-male marginal contact intensity ratios, i.e., . A ratio of 1 (dashed lines) indicates no difference in contact intensities between genders. Shaded ribbons represent 95% credible intervals.
https://doi.org/10.1371/journal.pcbi.1011191.s014
(JPEG)
S1 Text. The construction of simulated social contact patterns.
A detailed description of how the contact intensity patterns used in the simulation experiments are generated.
https://doi.org/10.1371/journal.pcbi.1011191.s015
(PDF)
S1 Table. Comparison of different covariance kernels and number of basis functions for HSGP models.
Results were obtained with models using the difference-in-age parameterisation. The sample size was fixed at N = 2000 throughout. M1: The number of HSGP basis functions on the difference-in-age dimension. M2: The number of HSGP basis functions on the contacts’ age dimension. aMean absolute error, bExpected log posterior density, cPosterior predictive check, dMedian runtime, epre-COVID19, fin-COVID19 scenario.
https://doi.org/10.1371/journal.pcbi.1011191.s016
(PDF)
Acknowledgments
The authors acknowledge Christopher Jarvis, Kevin Van Zandvoort, Amy Gimma, John Edmunds, and the entire CoMix team for allowing the COVIMOD team to use an adapted version of the CoMix questionnaire for COVIMOD and for their cooperation. The authors also thank the team at IPSOS-Mori for their work on implementing the COVIMOD survey.
References
- 1. Mossong J, Hens N, Jit M, Beutels P, Auranen K, Mikolajczyk R, et al. Social Contacts and Mixing Patterns Relevant to the Spread of Infectious Diseases. PLOS Medicine. 2008;5(3):e74. pmid:18366252
- 2. Goeyvaerts N, Hens N, Ogunjimi B, Aerts M, Shkedy Z, Damme PV, et al. Estimating infectious disease parameters from data on social contacts and serological status. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2010;59(2):255–277.
- 3. Eichner M, Schwehm M, Hain J, Uphoff H, Salzberger B, Knuf M, et al. 4Flu—an individual based simulation tool to study the effects of quadrivalent vaccination on seasonal influenza in Germany. BMC infectious diseases. 2014;14:365. pmid:24993051
- 4. Schmidt-Ott R, Schwehm M, Eichner M. Influence of social contact patterns and demographic factors on influenza simulation results. BMC Infectious Diseases. 2016;16(1):1–11. pmid:27821137
- 5. Leung NHL. Transmissibility and transmission of respiratory viruses. Nature Reviews Microbiology. 2021;19(8):528–545. pmid:33753932
- 6. Wallinga J, van Boven M, Lipsitch M. Optimizing infectious disease interventions during an emerging epidemic. Proceedings of the National Academy of Sciences. 2010;107(2):923–928. pmid:20080777
- 7. Verelst F, Hermans L, Vercruysse S, Gimma A, Coletti P, Backer JA, et al. SOCRATES-CoMix: a platform for timely and open-source contact mixing data during and in between COVID-19 surges and interventions in over 20 European countries. BMC Medicine. 2021;19(1):254. pmid:34583683
- 8. Feehan DM, Mahmud AS. Quantifying population contact patterns in the United States during the COVID-19 pandemic. Nature Communications. 2021;12(1):893. pmid:33563992
- 9. Tomori DV, Rübsamen N, Berger T, Scholz S, Walde J, Wittenberg I, et al. Individual social contact data and population mobility data as early markers of SARS-CoV-2 transmission dynamics during the first wave in Germany—an analysis based on the COVIMOD study. BMC Medicine. 2021;19(1):271. pmid:34649541
- 10. van de Kassteele J, van Eijkeren J, Wallinga J. Efficient estimation of age-specific social contact rates between men and women. The Annals of Applied Statistics. 2017;11(1):320–339.
- 11.
Funk S, Dunbar MBN, Pearson CAB, Clifford S, Jarvis C, Robert A. socialmixr: Social Mixing Matrices for Infectious Disease Modelling; 2020. Available from: https://CRAN.R-project.org/package=socialmixr.
- 12. Farrington CP, Whitaker HJ. Contact Surface Models for Infectious Diseases. Journal of the American Statistical Association. 2005;100(470):370–379.
- 13. Gimma A, Munday JD, Wong KLM, Coletti P, Zandvoort Kv, Prem K, et al. Changes in social contacts in England during the COVID-19 pandemic between March 2020 and March 2021 as measured by the CoMix survey: A repeated cross-sectional study. PLOS Medicine. 2022;19(3). pmid:35231023
- 14. Coletti P, Wambua J, Gimma A, Willem L, Vercruysse S, Vanhoutte B, et al. CoMix: comparing mixing patterns in the Belgian population during and after lockdown. Scientific Reports. 2020;10(1):21885. pmid:33318521
- 15. Backer JA, Mollema L, Vos ER, Klinkenberg D, Klis FRvd, Melker HEd, et al. Impact of physical distancing measures against COVID-19 on contacts and mixing patterns: repeated cross-sectional surveys, the Netherlands, 2016–17, April 2020 and June 2020. Eurosurveillance. 2021;26(8):2000994. pmid:33632374
- 16. Solin A, Särkkä S. Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing. 2020;30(2):419–446.
- 17.
Ipsos iSay [Internet]. Ipsos; c2022 [cited 2022 Oct 17]. Available from: https://www.ipsosisay.com.
- 18. Jarvis CI, Van Zandvoort K, Gimma A, Prem K, Auzenbergs M, O’Reilly K, et al. Quantifying the impact of physical distance measures on the transmission of COVID-19 in the UK. BMC Medicine. 2020;18(1):124. pmid:32375776
- 19. Hale T, Angrist N, Goldszmidt R, et al. A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker). Nature Humam Behaviour. 2021;5(4):529–538 pmid:33686204
- 20.
ACAPS. COVID-19 Government Measures Dataset; 2021 [cited 2022 Oct 17]. Available from: https://www.acaps.org/covid-19-government-measures-dataset
- 21. Hens N, Goeyvaerts N, Aerts M, Shkedy Z, Van Damme P, Beutels P. Mining social mixing patterns for infectious disease models based on a two-day population survey in Belgium. BMC Infectious Diseases. 2009;9(1):1–18. pmid:19154612
- 22.
Vandendijck Y, Gressani O, Faes C, Camarda CG, Hens N. Cohort-based smoothing methods for age-specific contact rates. BioRxiv [Preprint]. 2022 bioRxiv 290551 [posted 2022 April 11; cited 2022 Oct 17]. Available from: https://www.biorxiv.org/content/10.1101/290551v2.
- 23. Wallinga J, Teunis P, Kretzschmar M. Using Data on Social Contacts to Estimate Age-specific Transmission Parameters for Respiratory-spread Infectious Agents. American Journal of Epidemiology. 2006;164(10):936–944. pmid:16968863
- 24.
Xi X, Spencer S, Hall M, Grabowski K, Kagaayi J, Ratmann O. Inferring the sources of HIV infection in Africa from deep sequence data with semi-parametric Bayesian Poisson flow models. arXiv:2110.12273 [stat.AP]. 2022 [posted 2021 Oct 23; revised 2021 Oct 29; revised 2021 Dec 18; revised 2022 Jan 5; cited 2022 Oct 17]. Available from: http://arxiv.org/abs/2110.12273.
- 25.
Riutort-Mayol G, Bürkner PC, Andersen MR, Solin A, Vehtari A. Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming. arXiv:2004.11408 [stat.CO]. 2020 [posted 2020 Apr 23; revised 2022 Mar 22; cited 2022 Oct 17]. Available from: https://arxiv.org/abs/2004.11408
- 26.
Carl Edward Rasmussen, Christopher K I Williams. Gaussian Processes for Machine Learning. The MIT Press; 2006. Available from: https://gaussianprocess.org/gpml/.
- 27. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. Stan: A Probabilistic Programming Language. Journal of Statistical Software. 2017;76:1–32. pmid:36568334
- 28. Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC. Rank-Normalization, Folding, and Localization: An Improved Ȓ for Assessing Convergence of MCMC (with Discussion). Bayesian Analysis. 2021;16(2):667–718.
- 29.
Betancourt M. A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434 [stat.ME]. 2017 [posted 2017 Jan 10; revised 2028 Jul 16; cited 2022 Oct 17]. Available from: http://arxiv.org/abs/1701.02434.
- 30.
Statistische Ämter des Bundes und der Länder. ZENSUS2011 - Bevölkerungs- und Wohnungszählung 2011; 2011. Available from: https://www.zensus2011.de/DE/Home/home_node.html.
- 31. Melegaro A, Fava ED, Poletti P, Merler S, Nyamukapa C, Williams J, et al. Social Contact Structures and Time Use Patterns in the Manicaland Province of Zimbabwe. PLOS ONE. 2017;12(1). pmid:28099479
- 32. Flaxman S, Mishra S, Gandy A, Unwin HJT, Mellan TA, Coupland H, et al. Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature. 2020;584(7820):257–261. pmid:32512579
- 33.
Diggle PJ, Ribeiro PJ. Model-based Geostatistics. Springer Series in Statistics. New York, NY: Springer; 2007. Available from: http://link.springer.com/10.1007/978-0-387-48536-2.
- 34.
Ton JF, Flaxman S, Sejdinovic D, Bhatt S. Spatial Mapping with Gaussian Processes and Nonstationary Fourier Features. arXiv:1711.05615 [stat.ML]. 2017 [cited 2022 Oct 17]. Available from: http://arxiv.org/abs/1711.05615.
- 35. Beutels P, Shkedy Z, Aerts M, Damme PV. Social mixing patterns for transmission models of close contact infections: exploring self-evaluation and diary-based data collection through a web-based interface. Epidemiology & Infection. 2006;134(6):1158–1166. pmid:16707031
- 36. Semenova E, Xu Y, Howes A, Rashid T, Bhatt S, Mishra S, et al. PriorVAE: encoding spatial priors with variational autoencoders for small-area estimation. Journal of The Royal Society Interface. 2022;19(191):20220094. pmid:35673858
- 37.
Mishra S, Flaxman S, Berah T, Zhu H, Pakkanen M, Bhatt S. piVAE: a stochastic process prior for Bayesian deep learning with MCMC. arXiv:2002.06873 [cs.LG]. 2020 [posted 2020 Feb 17; last revised 2022 Sep 13; cited 2020 Oct 22]. Available from: http://arxiv.org/abs/2002.06873.
- 38. Baguelin M, Flasche S, Camacho A, Demiris N, Miller E, Edmunds WJ. SARS-CoV-2 transmission across age groups in France and implications for control Nature Communications. 2021;12(1):6895
- 39. Wikle NB, Tran TNA, Gentilesco B, Leighow SM, Albert E, Strong ER, et al. SARS-CoV-2 epidemic after social and economic reopening in three U.S. states reveals shifts in age structure and clinical characteristics Science Advances. 2022;8(4):eabf9868 pmid:35080987
- 40. Monod M, Blenkinsop A, Xi X, Hebert D, Bershan S, Tietze S, et al. Age groups that sustain resurging COVID-19 epidemics in the United States. Science. 2021;371(6536):eabe8372. pmid:33531384
- 41.
Baguelin M, Flasche S, Camacho A, Demiris N, Miller E, Edmunds WJ. Assessing Optimal Target Populations for Influenza Vaccination Programmes: An Evidence Synthesis and Modelling Study PLOS Medicine
- 42. Matrajt L, Eaton J, Leung T, Brown ER. Vaccine optimization for COVID-19: Who to vaccinate first? Science Advances. 2021;7(6):eabf1374 pmid:33536223