Figures
Abstract
Online social networks like Twitter and Facebook are among the most popular sites on the Internet. Most online social networks involve some specific features, including reciprocity, transitivity and degree heterogeneity. Such networks are so called scale-free networks and have drawn lots of attention in research. The aim of this paper is to develop a novel methodology for directed network embedding within the latent space model (LSM) framework. It is known, the link probability between two individuals may increase as the features of each become similar, which is referred to as homophily attributes. To this end, penalized pair-specific attributes, acting as a distance measure, are introduced to provide with more powerful interpretation and improve link prediction accuracy, named penalized homophily latent space models (PHLSM). The proposed models also involve in-degree heterogeneity of directed scale-free networks by embedding with the popularity scales. We also introduce LASSO-based PHLSM to produce an accurate and sparse model for high-dimensional covariates. We make Bayesian inference using MCMC algorithms. The finite sample performance of the proposed models is evaluated by three benchmark simulation datasets and two real data examples. Our methods are competitive and interpretable, they outperform existing approaches for fitting directed networks.
Citation: Yang H, Xiong W, Zhang X, Wang K, Tian M (2021) Penalized homophily latent space models for directed scale-free networks. PLoS ONE 16(8): e0253873. https://doi.org/10.1371/journal.pone.0253873
Editor: Lei Shi, Yunnan University of Finance and Economics, CHINA
Received: February 10, 2021; Accepted: June 14, 2021; Published: August 2, 2021
Copyright: © 2021 Yang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting information files.
Funding: The research of WX was supported by National Natural Science Foundation of China (NNSFC) grants No.12001101 and the Fundamental Research Funds for the Central Universities in UIBE (CXTD10-09) and 20YQ18. MT’s work was partially supported by the National Natural Science Foundation of China (No.11861042), and the China Statistical Research Project (No.2020LZ25). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Network analysis is being increasingly prevalent in various scientific disciplines, ranging from anthropology, sociology, social psychology, to physics, mathematics and computer science, among others. Networks provide useful representations for non-Euclidean data and have been employed to analyze interpersonal relationships, academic co-authorships and citations, protein interactions and traffic flows, etc. Among these research, social networks have received excessive discussions, in which nodes typically represent individuals and edges represent social relationships [1–3]. In more general cases, nodes can also be used to denote large social units (for example, families, organizations, governments), objects (airports, servers, locations) or abstract entities (concepts, texts, tasks, random variables), and thus edges indicate the certain relations, states, contents or features of nodes. To date, however, much attention has been paid to model undirected networks.
The aim of this paper is to focus on the directed networks with degree heterogeneity, such as social sharing sites (YouTube, QQzone) and microblogs (Twitter, Weibo). Formally, we use to represent an acyclic directed graph with n nodes, where
,
respectively denotes the sets of nodes and edges, and
is the attribute matrix of nodes. The topology of a graph can be measured by an adjacency matrix
, where yij ∈ {0, 1} indicates the presence or absence of an edge on each ordered pair of nodes (vi, vj), i, j = 1, …, n and i ≠ j. Edges connecting a node to itself are not allowed, thus yii = 0 for i = 1, …, n. Throughout this paper, we use “vi → vj” to indicate yij = 1.
Many probabilistic models have been proposed in order to capture the topology of graphs by adopting their local properties. The simplest one is the Erdös-Rényi Bernoulli random graph model, in which edges are considered to be independent of each other [4]. Given two arbitrary nodes vi and vj in a directed social network, it is more likely for vi to follow vj when vj is following vi, or when both vi and vj are connecting to another node vk. In other words, the conditional link probabilities P(yji = 1|yij = 1) and P(yij = 1|yik ykj = 1) are larger than the marginal link probability P(yji = 1) [5, 6]. These two properties are called link reciprocity and transitivity. Unfortunately, neither of them is considered in the Erdös-Rényi model. To involve reciprocity, a log-linear statistical model (i.e. p1 model) is proposed [7] and the stochastic blockmodel is introduced [8], which can also fit the block structure, or network communities by partitioning nodes into different subgroups [9]. The stochastic blockmodel then has a rapid development in various fields [10–12] and is still of great interest in recent research [13–16]. Despite such superiority, the stochastic blockmodels are inappropriate to accommodate the complex dependence structure, such as transitivity, due to the pairwise independence assumption. As a result, the exponential random graph model (EGRM) is proposed as a flexible and alternative way [17–19]. Estimation methods such as the maximum pseudo-likelihood [20] and the maximum likelihood with Markov chain Monte Carlo (MCMC) algorithms [21, 22] are further developed, with a comprehensive comparison conducted in [23].
Another line of network research is the latent space model (LSM), which assumes that each node of a network has a position, denoted as , in an unobserved latent space [6]. Usually, the dimension of the latent space d is small, for example, d = 2. To measure the closeness relationship between nodes, the latent positions are involved as latent distances ‖zi − zj‖ (could be replaced by any distance). Then the probability of edges P(yij = 1) is modeled as a function of these positions and node attributes. The above mentioned properties, reciprocity and transitivity, are inherently involved in LSM due to the symmetry of pairwise distances. Handhock et al. introduce the latent position cluster model to involve community structure via multivariate Gaussian mixture model [24], which is further extended to allow for degree heterogeneity by embedding with node-level random effects [25]. Sewell and Chen generalize static model to the dynamic latent space model (DLSM) that accounts for relations drifting over time under the framework of LSM [26]. Such dynamic networks are also studied in [27]. The LSM is widely developed in other directions as well. For instance, Austin propose the covariate-defined latent space random effects model to predict the latent positions of new nodes entering a fitted network [28]. Sewell and Chen develop the model to fit a weighted edges network, which means that the edges connecting nodes are no longer binary variables but can take multi-values [29].
Besides reciprocity and transitivity, degree heterogeneity and homophily attributes are also of great interest in social networks. This work considers all of these properties within the LSM framework. For large-scale social networks, it is reasonable to assume that the degrees of different nodes vary in a wide range. This is also referred as scale-free networks (SFN), in which node degrees follow a power law. Such phenomenon is quite common in online social networks [30]. For example, Facebook, Twitter, LinkedIn and Weibo are popular sites built on social networks, providing communication, storage and social applications for hundreds of millions of users. On these social platforms, it is frequent to see few celebrities capturing substantial numbers of followers, accounting for power law or power law with exponential cutoff degrees. In directed networks, degrees contain in-degrees and out-degrees, defined as and
respectively. The link probability should be strongly related to the heterogeneity of node in-degrees. Taking vi and vj as ordinary nodes and
as a celebrity with an extremely high
, the marginal link probability P(yij* = 1) is expected to be much larger than P(yij = 1). However, it is unlikely for a celebrity to pay close attention to its followers. Thus the conditional link probability P(yj*i = 1|yij* = 1) should be smaller than P(yji = 1|yij = 1). On the other hand, out-degree heterogeneity only has limit impacts in online social networks, because the number of users one can follow is usually up-bounded (e.g. 5000 in Twitter and 2000 in Weibo), while the total number of nodes in the network is practically countless. As a result, even node vi keeps a high
, the link probability for vi to follow vj remains around zero. Thus the heterogeneity of out-degrees can be ignored. We call these networks semi-SFN in this paper. Such phenomenon is also discussed in [31], where popularity scaled latent space model (PSLSM) is proposed for large-scale directed network formulation. However, due to the employment of probit function, PSLSM only considers a one-dimensional latent space and limits the latent positions to standard Normal distribution, which is a quite restrictive assumption. To this end, this paper introduces a novel latent space modeling procedure for directed semi-SFN, where the latent distances are scaled by popularity factors γ = (γ1, …, γn) to involve in-degree heterogeneity. The logistic regression extends our proposed model to a much more generalized level. Specifically, the dimension and distribution of latent positions are theoretically unlimited, and homophily attributes are considered emphatically in this paper.
It is well known that the link probability is related to homophily node attributes. Therefore, pair-specific covariates, acting as a distance measure, are introduced in our model. To be mentioned, the classic LSM proposed by [6] also allows for covariates and has been performed in a few research [24, 32]. To the best of our knowledge, however, we are the first to proposed a specific formulation of covariates processing within the LSM framework. In this way, social relationships between nodes can be better represented through latent distances, since the effects of node attributes have been fully extracted. Additionally, to deal with the possible high and ultrahigh dimensionality of covariates, regularization with both ridge and LASSO penalties is discussed under a Bayesian framework, and thus we propose the penalized homophily latent space model (PHLSM). The posterior estimation is performed by adopting MCMC algorithms, which is particularly appropriate in this context since it allows uncertainty of model parameters to be explored through a posterior distribution. Our experiments show that such approach perform well in simulations and real semi-SFN examples compared to other competing models that also involve degree heterogeneity and homophily attributes.
The major contributions of this paper is as follows:
- We propose a novel latent space model as an alternative network embedding, which comprehensively accommodate the significant properties of directed social networks including reciprocity, transitivity, degree heterogeneity and homophily attributes.
- The popularity factors are introduced as denominator scales of latent distances so as to model the heterogeneity of node in-degrees in scale-free networks.
- For different dimensions of covariate spaces, the normal and laplacian priors for regression coefficients are discussed separately as ridge and LASSO regularization within a Bayesian framework.
- For large-scale online social networks, we randomly sample ego-networks for real data analysis, each of which is formed by a single hub and its followers and keeps the scale-free characteristic. Experimental results demonstrate the superior performance of our approach.
The rest of the paper is organized as follows. A basic description of our proposed models together with a brief illustration in multivariate and high dimensional features are given in the next section. Parameter estimation in Bayesian framework is introduced. Several numeric simulation examples are performed and two real network datasets are fitted. We summarize this work with conclusions.
Penalized homophily latent space models
We consider a directed network with n nodes. Given a d-dimensional latent space, a specific position , d ≥ 1 is allocated to each node. We use
to denote the latent position matrix. The data to model consists of a binary adjacency matrix
, where yij = 1 if vi follows vj, yij = 0 otherwise, and a pairwise covariate matrix
is derived from a node-specific attribute matrix
. We then propose two probabilistic models under different dimensions p. Note that only binary-valued relations are focused in this paper, though the proposed method can be extended to more complex relational data by transforming the Bernoulli prior of ties.
PHLSM for multi-covariates
We first discuss the multivariate case, namely p ≪ n. Assuming edges yij to be conditionally independent, the PHLSM is defined as
(1)
where β = (β1, …, βp)′ is a p-dimensional vector of regression coefficients, γ = (γ1, …, γn)′ is a popularity vector for n nodes. Θ = {β0, β, γ} is the collection of all parameters. Intuitively, as ‖zi − zj‖ increasing, the link probability for both vi → vj and vj → vi will decline. Such symmetric property can accommodate the reciprocity of networks. Throughout this paper, we assume that the latent space coordinates Z are independently and identically generated from a 2-dimensional multivariate Normal distribution with mean 0 and equi-variance matrix, i.e.
(2)
where I2 is an identity matrix. Moreover, γj ∈ (0, 1) is a node-specific popularity scale. The larger γj, the greater social popularity. Considering extreme cases, if γj → 0, the probability for vi to follow vj remains 0; When γj → 1, we are back to LSM. In this way, the in-degree heterogeneity of semi-SFN can be modeled, meaning that an ordinary node tends to follow a celebrity with high popularity, yet the opposite is not true. For model identification, the intercept β0 and ∑j γj is constrained to be 1.
The p-dimensional pairwise covariate vectors xij are obtained using an element-wise operator. Specifically, for continuous attributes, the attribute matrix A is first normalized columnwisely and then
(3)
For discrete attributes,
(4)
It is remarkable that attributes play vital roles in our model. In some social network, the probability of a relational tie between two individuals may increase as the characteristics of individuals become more similar. Therefore, in this framework the relative difference between two nodes is of interest. In details, for a continuous attribute normalized in (0, 1), an entropy-like covariate xij is proposed in (3) to measure the relative information diversity. For a discrete attribute, (4) defines a binary covariate xij, suggesting that whether nodes vi and vj belong to the same category (0 for the same category and 1 otherwise). The purpose for using absolute values of differences is to eliminate the directional factors. In case p ≪ n, we employ ridge regression coefficients, which equals to the Normal prior for β, i.e.
(5)
The feature-specific variance
serves as a tuning of the L2 norm penalty within Bayesian framework. Note that when
, the ridge penalty will degenerate to a non-penalized form, which can lead to an unbiased estimate of βk.
With the implementation of (3) and (4), model (1) has a simple interpretation:
- For nodes vi and vj equidistant from vk, the log odds ratio of vi → vk versus vj → vk is β′(xik − xjk), that is, the followed probability depends on the similarity of node attributes.
- For nodes vi and vj equidistant from vk, the log odds ratio of vk → vi versus vk → vj depends on β′(xki − xkj) and
, thus both attributes and popularity determines the following probability.
LASSO-based PHLSM for high-dimensional covariates
With the explosion of information, numerous predictors are involved in social network analysis for accurate link prediction, for instance, user preferences in recommender systems, protein connections in protein interactomes and potential communities in social networks. A major challenge in this situation is the high-dimensional regime, where the number of available nodes is typically much smaller than the number of features. It is thus imperative to consider a properly sparse model with low computational complexity.
The log likelihood for (1) is
To reduce dimensionality, the maximum likelihood estimator with regularization is defined as
(6)
where
is some penalty function with tuning parameter λk ≥ 0 to be determined for each βk. In terms of the ridge regression case (5), the penalty function is described as
In this section we discuss high-dimensional cases, where the adaptive LASSO penalty (7) is mainly considered due to its simplest expression and nice properties:
(7)
(8)
Actually other penalties such as SCAD [33] and MCP [34] are all applicable.
This work performs Bayesian estimation. In Bayesian framework, the L1 norm penalty in (7) was equivalent to a Laplace distribution (also referred to as the double exponential distribution) for parameter βk [35], namely
(9)
It is essential in regularized likelihood methods to determine the tuning parameter λk appropriately, which controls the trade-off between the bias and variance in resulting estimators [36, 37]. Selecting an appropriate tuning parameter becomes an important issue, both theoretically and practically. The most common method for choosing the hyperparameter is the cross validation [38]. Unfortunately, it is difficult to be applied in LSM, since the estimated latent coordinate matrix from the training sets is unfeasible for fitting the testing sets. Rather than setting a fixed number, [39] employs hierarchical priors and assumes the tuning parameter to follow a Gamma prior, which is the conjugate prior of exponential distributions. So a Gibbs sampling algorithm can be implemented for Bayesian estimation, as described in the next section. In our model, we simply extend this hierarchical approach to the adaptive LASSO. Specifically, let f⋅(⋅) denote the probability density functions, the full conditional posterior distribution for λk is given as
where ξ is the shape parameter and δ is the rate parameter of the Gamma distribution.
Estimation methodology
We employ Bayesian approach to estimate the parameters in (1) using MCMC algorithms. In Bayesian treatment, a prior distribution π(Θ) is placed on Θ and what of interest is the posterior distribution π(Θ|Y) ∝ π(Y|Θ)π(Θ). In this paper, Metropolis-Hastings (MH) within Gibbs algorithm [40] is adopted for posterior sampling.
Posterior sampling
We set the priors on the parameters as follows:
Here IG denotes the inverse Gamma distribution. α = (α1, α2, …, αn) is a strictly positive hyperparameter for the Dirichlet prior. For convenience of notation, all the parameters of PHLSM are collected in Ψr = {Z, β, γ, σ2, τ2, α, ν, ϕ, ξτ, δτ} and ΨL = {Z, β, γ, σ2, λ, α, ν, ϕ, ξλ, δλ}.
The hyperparameters are discussed as follows. For the Inverse Gamma prior of σ2, ν and ϕ are expected to be small. Besides we have E(σ2) = ϕ/(ν − 1) for ν > 1, which is supposed to approach the sample variance of initial latent positions. Thus we set ν = 2 and , where
indicates the initial value of zi. For the ridge regression version, it can be shown
for δτ > 0, ξτ > 2, meaning that a large ξτ as well as a small δτ results in low variability for βk [39]. So is ξλ and δλ for the LASSO version. As a proposal, we set δτ = 0.05, δλ = 0.1, ξτ = 4, ξλ = 8 for categorical variables and ξτ = 2, ξλ = 4 for continuous variables. Last, the Dirichlet prior for γ is set to be uninformative, thus a flat Dirichlet distribution, given as Dirichletn(1, …, 1), is proposed.
Practically, the number of MCMC iterations to reach convergence can be greatly reduced by proper initial values of the latent positions and model parameters. Details for selection of initial values are discussed in the next subsection.
Define
the posterior kernels or full conditional distributions of ridge PHLSM parameters are expressed as
(10)
(11)
(12)
(13)
(14)
where the notation “…” indicates that the parameters we do not list are independent of the corresponding variable.
Given posterior distributions of model parameters, the MCMC algorithm can be written as follows:
Algorithm 1: MCMC algorithm for PHLSM
0. Set initial values of Ψr.
1. For i = 1, …, n, draw zi via MH using a random walk proposal.
2. Draw σ2 via Gibbs sampling from its full conditional distribution (11).
3. For k = 1, …, p, draw βk via MH using a Normal random walk proposal.
4. For k = 1, …, p, draw via Gibbs sampling from its full conditional distribution (13).
5. Draw γ via MH using a Dirichlet proposal.
Repeat steps 1–5.
As for the adaptive Lasso version (7), using a maximum pseudo likelihood approximation, the posterior distributions for β and λ can be expressed as
(15)
(16)
Other parameters are the same as the ridge penalty version. The MCMC algorithm is given as Algorithm 2:
Algorithm 2: MCMC algorithm for LASSO-based PHLSM
0. Set initial values of ΨL.
1. For i = 1, …, n, draw zi via MH, using a Normal random walk proposal.
2. Draw σ2 via Gibbs sampling using the posterior distribution given in (11).
3. For k = 1, 2, …, p, draw βk via MH using a Laplace random walk proposal.
4. For k = 1, 2, …, p, draw λk Gibbs sampling using the posterior distribution in (16).
5. Draw γ via MH using a Dirichlet proposal.
Repeat steps 1–5.
As an aside, there are two remarks for the proposed MCMC algorithms.
Remark 1. The posterior of coordinate matrix Z is not unique due to the invariance property of distances in a two-dimensional Euclidean latent space by rotation, reflection or translation. To deal with this, the Procrustes transformation [6] is applied in each step.
Remark 2. For Algorithm 2, we use the Dirichlet proposal introduced in [26] to draw γ. Due to the constraint |γ|1 = 1, all components of γ must keep or remove simultaneously during each iteration. To accelerate convergence, we set α(t) = Mγ(t−1) at t-th iteration, where M is a sufficiently large positive number.
Initialization strategies
As mentioned before, the number of iterations for MCMC to reach convergence can be dramatically reduced by setting appropriate initial values of the parameters Ψr or ΨL. Below we give some ad hoc initialization strategies.
1. The initial values of latent positions Z can be found using the classical multidimensional scaling (MDS) method [41]. Typically, MDS method could transform an n × n symmetric matrix of association coefficients between individuals into a unique coordinate matrix in Euclidean space via the principal components analysis approach. In practice, we use the geodesic distances in the directed graph, rescaled by 1/n, as the input distance matrix. Then the output coordinate matrix can be employed as the initial latent positions after centralization.
2. For σ2, a reasonable initial value should be the sample variance of , given as
where the superscript (0) indicates the initial value.
3. We use the maximum likelihood estimation of the regression coefficients β as their initial values. Furthermore, the initial values of and λk can be simply obtained via Gibbs sampling with
.
4. Typically for an edge vi → vj, we expect the value of γj to be significantly associated with the in-degree of the end node, i.e. , hence the initial value for γj is proposed as
The added 1 in the molecule is to promise a strictly positive value for
, and the corresponding n in the denominator is to ensure the summation remaining 1.
Simulation examples
For evaluation, three different benchmark directed networks datasets are considered. In each dataset several nodes are randomly selected as popular hubs to model the heterogeneity of in-degrees in semi-SFN. For each of them we apply the MCMC algorithm proposed in Algorithm 1 and Algorithm 2. The link sparsity and reciprocity of each adjacency matrix is measured using empirical link probabilities given as follows,
where vj* denotes popular hubs with high in-degrees and m is the number of them. The first two equations can reflect the global sparsity of a network. And the last two equations reflect the empirical reciprocity between two arbitrary nodes, or from a popular hub to another node, respectively.
PHLSM with no covariates
In this example, we consider model (17) without attribute effects,
(17)
The top 5 in-degrees are considered as popular hubs. We generate 20 adjacency matrices to characterize directed social networks, each of which contains 500 nodes. For data generation, we set σ2 = 3 × 10−4, γ ∼ Dirichlet(α1, …, αn), where αi are drawn from a power-law distribution, given as
(18)
Larger θ means more likely to produce popular nodes. Three different θ are considered in this example for comparison, θ ∈ {1.7, 2.0, 2.3}. The means and standard deviations (sd) of empirical link probabilities for all simulation networks are given in Table 1.
It is shown in Table 1 that the first two empirical probabilities are close to 0. Conversely, the empirical reciprocity conditional probability between arbitrary nodes is much larger, while for an edge sent by a popular hub, the conditional probability remains small.
Fig 1 also presents the latent positions scaled by node popularity, which follows a power-law distribution (18). We can see that with θ increasing, the node popularity differences gradually decrease. For θ = 1.7, an enormous circle appears near the origin, while the other circles seem to be relatively similar in size, much smaller than the hub. As for θ = 2.0 and 2.3, a growing number of moderate-sized circles emerge.
The radius of a circle indicates the value of γi for the corresponding latent position. (a) θ = 1.7; (b) θ = 2.0; (c) θ = 2.3.
To investigate the power-law of in-degrees, the logarithmic in-degree distribution curves of all simulation networks are depicted in Fig 2. As expected, the empirical logarithmic distribution curves are approximated linear, indicating that the in-degrees follow a power-law, especially when θ is relatively large. Note that here we employ the complementary cumulative distribution function (CDF) rather than the probability density function (PDF) because it is more robust against fluctuations resulted from finite sample sizes [42].
To examine the efficiency and accuracy of our proposed methods, we adopt Algorithm 1 to estimate model (17) and set M = 5 × 106. Other hyperparameters and initial values are set as described above. We iterate 15,000 times for initial burn-in and another 50,000 times for monitoring. In each iteration, the Procrustes transformation is performed as described in Remark 1. Posterior means of estimates with its standard deviations over 20 simulations are shown in Table 2. It seems that the proposed model performs better for fitting a light-tailed directed semi-SFN.
We use the following two ratios to compare between the estimates and the truth. For any edge vi → vj, define and
. For each ratio, we depict the density curves of 20 simulation data in Figs 3 and 4. From these two figures we can observe the ratios all concentrate near 1, indicating the superiority of our proposed methods. Furthermore, the trace plots of the estimated popularity and true in-degrees are presented in Fig 5, which show significant positive correlations. Such results empirically verify that the degree heterogeneity and other node-specific random effects can be modeled by rescaling latent distances.
(a) θ = 1.7; (b) θ = 2.0; (c) θ = 2.3.
(a) θ = 1.7; (b) θ = 2.0; (c) θ = 2.3.
For a careful measurement, total correct rate (TCR), true positive rate (TPR), false positive rate (FPR), and AUC (the area under ROC) are applied to evaluate prediction accuracy. Results are reported in Table 3, which suggests our proposed method performs better with smaller θ.
Finally, to examine the dependence of MCMC algorithm on initial values, we take θ = 2.0 as a trial. We use uninformative priors for all the parameters. Specifically, initial values of Z and γ are randomly selected from a standard Normal distribution and a flat Dirichlet distribution. The mean(sd) of is 3.019 × 10−4(0.247 × 10−4), and the AUC value is 0.896(0.036), which is pretty close to the results in Tables 2 and 3 with informative priors. Thus the MCMC algorithm performs robust to the initial values, however it will take longer time to reach convergence.
PHLSM with multi-covariates
In this example, two attributes a1, a2 are considered to analyze the node attribute effects. The model for simulation data generation is specified as
(19)
where β1 = 0.5, β2 = −1. a1 and a2 are assumed to be continuous and binary, generated from a Normal and a Bernoulli distribution respectively, i.e. a1 ∼ N(0, 1) and a2 ∼ B(1, 0.5). Thus by the proposed transformation (3) and (4), we obtain xij,1 and xij,2. For parameter estimation, 20 simulation datasets are generated. In each replication, we set θ = 2, σ2 = 3 × 10−4 as in example 1. Hyperparameters and initial values for implementing Algorithm 1 are set as discussed before. Experimental results are reported in Table 4.
From Table 4, we can observe that the proposed MCMC algorithm had a good performance in parameter estimation. The posterior means of and
get very close to true values with quite small standard deviations. In addition, the means (sd) of MSE for
and
are 9.845 × 10−5(3.111 × 10−6) and 5.165 × 10−3 (1.286 × 10−4) respectively. The means (sd) of link prediction accuracy are TCR = 0.972(0.007), TPR = 0.845(0.020), FPR = 0.025(0.011), AUC = 0.905(0.094). Compared with the predictive results in example 1, it is suggested that the proposed PHLSM can be significantly improved by adding node attributes into the original model.
LASSO-based PHLSM with high-dimensional covariates
This example focuses on the high-dimensional covariate case. For evaluation and comparison analysis, two groups of simulation experiments are conducted, each of which consists of 20 independent datasets with fixed sample size n = 50 and θ = 2. All the simulation data come from model (20),
(20)
For the first group, we consider p = 40, where a5, a15, a25, a35 are significant and the other coefficients are 0. The former 20 attributes are binary and generated from a Bernoulli distribution, i.e.
. The latter 20 attributes are continuous and generated from a Normal distribution, i.e.
. In the second group we consider a higher-dimensional case by setting p = 150 and all attributes are produced the same way as in the first group, that is, half of them are binary and the others are continuous, each of which contains 7 significant attributes.
Due to the sparse as well as high dimensional setting, the proposed Algorithm 2 is applied here for posterior estimation with 15,000 iterations for initial burn-in and 50,000 iterations for monitoring. Hyperparameters and initial values are selected as before. As comparison, we also employ Algorithm 1 to fit the simulation data. To investigate the performance of LASSO-based PHLSM on variable selection, we use C to denote the number of non-zero coefficients correctly estimated as non-zero, and IC to denote the number of zero coefficients incorrectly estimated as non-zero. Furthermore, the proportion of the 20 simulations excluding non-zero coefficients from the model is denoted as Under-fit, the proportion of including zero coefficients is denoted as Over-fit, and the proportion for correct coefficient selection is denoted as Correct-fit [43]. Results are presented in Table 5. As expected, the LASSO version results in Table 5 show considerable advantages on fitting a sparse model, especially when p is large. Besides, when considering the prediction accuracy, both models have the similar behaviors, between which, however, the LASSO version performs slightly worse. But actually, it is worthwhile to establish a simpler and more interpretable model via sacrificing a little prediction accuracy.
Real data analysis
For model evaluation, we fit the proposed models in two real data examples. In the first example, we mainly discuss the multi-covariate situation and employ the ridge PHLSM for node representation and link prediction. To compare our model to the state-of-the-art methods, we also consider DLSM, a network model which also considers degree heterogeneity within the LSM framework. The second example focuses on the high-dimensional covariate case. Both regularization versions are fitted to evaluate the feature screening performance of different penalties. We also appropriately modify the proposed models by extending the Normal prior of latent positions to a mixture Normal distribution so as to accommodate the community structure of the network data.
Pokec data
Pokec is the biggest and most popular Twitter-type online social network in Slovakia. It has connected more than 1.6 million users and the craze has been continuing even after the emergence of Facebook. An in-depth understanding of Pokec is necessary to evaluate current systems, and to understand the impact of social networks on the Internet. The dominant users in Pokec are ordinary individuals, and there also exists some official accounts of governments, enterprises, media, and other celebrities. It provides a platform for individuals to extend and maintain social relationships with others sharing similar interests, and for institutions to make announcements and put advertisements to the public. The raw data extracted by [44] contains the profiles of 1,632,803 users and 30,622,564 directed binary relationships of the whole platform. By using yij = 1 to represent the status of user vi following user vj, we can estimate the empirical probability , thus the directed network is extremely sparse. In addition, the maximum of out-degrees is only 8,763, and that of in-degrees achieves 13,733. Actually, most of the hubs with huge amounts of followers are official accounts of media or companies which conduct propaganda through the network.
To adapt this network to the proposed PHLSM model, we draw a sample by randomly selecting 5 popular users and establish a subnetwork using their followees. After eliminating nodes with missing attributes, the final sample size of our subnetwork is n = 695. The logarithmic complementary CDF of node degrees are presented in Fig 6. The outliers in tails correspond to the popular hubs selected for the sample network (two of them have the same in-degrees). As can be observed, although both degrees are approximately power-law (ignoring the non-linear head), the range of in-degrees is actually larger than that of out-degrees. The absolute slope of the linear part, namely the exponent θ in (18), is steeper for out-degrees than in-degrees. That means the tail of in-degree distribution is fatter and exists more users with either extremely small or extremely large in-degree. Furthermore, the empirical link probabilities of this subnetwork are and
, indicating the sparsity of network. Taking vj* as celebrities, the empirical reciprocity conditional probabilities are
and
. To this end, we regard the subnetwork sample as a semi-SFN and thus employ PHLSM for node representations and link predictions.
The solid lines are fitted by scatters excluding non-linear parts at the heads and outliers at the tails.
The full user profiles of the Pokec data contain 60 user attributes, including user id, gender, region, all friendships public or not, completion percentage of the user file, time the user last logged in, time the user registered, age, and other notes free fillable for users. Due to the severe missing of the user profiles, we only take 4 attributes into our model, namely gender (binary), region (categorical), age (continuous), and registration time (continuous). To be specific, the regions are categorized at state (in Slovakia) or country (out of Slovakia) level, and any sample with zero age are identified as missing and deleted. We then propose to estimate
where yij = 1 if user vj is a friend of vi (but user vi is not necessarily to be a friend of vj), Θ = {β, γ, σ2, τ2} is a collection of parameters. The continuous and discrete attributes are respectively processed according to (3) and (4).
We run 100,000 iterations, including 30,000 for initial burn-in and 70,000 for monitoring. The trace plots for parameters β and σ2 are given in Fig 7. Posterior estimates of parameters are . Typically, there should exist homophily relationships, that is, nodes sharing similar attributes are more likely to form ties. In this experiment, results of
,
and
suggest that the region, age and registration time are homophily attributes, where the last exerts slight effects. On the other hand, the result of
indicates that the gender attribute presents heterophily characteristic, which means in an average sense users with different genders tend to be more intimate. It is reasonable for such results since people are usually more interested in the opposite sex during social activities. Nevertheless, those from vicinal regions or with similar ages are more probable to share common topics and become friends.
Our models (with and without covariates) are compared to DLSM proposed by [26]. Specifically, we simplify the dynamic approach to fit a static network by ignoring the time t for each latent position, and the covariates are involved in the same way as in PHLSM, given as
where Θ = (β, βin, βout, r) and r = (r1, …, rn) is a node-specific influence factor. Experimental results are reported in Table 6. ROC curves of the three models are depicted in Fig 8. Intuitively, introducing the node attributes can dramatically improve prediction accuracy, such as TPR and AUC, and our model performs better than DLSM for fitting the semi-SFN over all of the four predictive indices. For inference, PHLSM iterates less than 10,000 times for the Markov chain to reach convergence, as is shown in the trace plots (see in Fig 7), while DLSM iterates more than 60,000 times for convergence. The running time for estimating PHLSM using the MCMC algorithm with 100,000 iterations is 5.48 hours in R on a 2.6 GHz processor, and that for DLSM is 6.34 hours, due to the more parameters to estimate.
Twitter ego-network data
Almost everyone encounters hundreds or thousands of people since childhood, but the number of friends that can be keep in touch simultaneously is very limited. Anthropologist Dunbar points out that there is an upper limit to the ability of human beings to maintain social relations, which is about 150 [45]. This upper limit is determined by the physiological characteristics of primates. Recent studies have shown that the upper limit has not been breached because of the higher communication efficiency, such as mobile phones, social networking sites (for instance see [46]). Regarding a person (ego) and his/her friends as nodes and the friendships between this person and his/her friends as edges, we can get an ego-centered network, or more briefly, an ego-network. Ego-networks are very important in anthropology. They are not only helpful for the detailed study of individual characteristics, but also can be extended to the study of the structure and function of social networks.
In this example, we consider 3 sets of ego-network data crawled in Twitter [47], with 28, 10 and 12 users respectively. In each ego-network, the users are in a relatively close relationships due to the small circle size, and the ego is assumed to be followed by every other users in the circle. However, users from different ego-networks are barely connected, giving rise to a classical community structure of social networks. It is inappropriate to apply the original PHLSM here because the egos can only be considered as hubs in their own circles, rather than global hubs. To accommodate our model in such clustering networks, we refer to [24] and assume the latent positions to be drawn from a mixture multivariate Normal distribution, described as
(21)
where G is the number of clusters and is 3 in this example, δg is the prior probability for node vi belonging to cluster g, and μg,
denote the mean and variance of each cluster. The posterior probability of clustering labels is then given as
where ki denotes the clustering label of node vi. The prior distributions for δg, μg, and
are chosen as conjugate priors, corresponding to Dirichlet, Normal, and Inverse Gamma distribution respectively.
One more thing to be mentioned is the recognition problem, which is so called the “label switching” problem [48], the mixture model is insensitive to the order of clustering labels, because the likelihood of (21) is the same for all permutations of labels. In this example we post-process the MCMC posterior samples by selecting a permutation of clustering labels to minimize the Kullback-Leibler divergence. See [24] for more details.
The node attributes are the hashtags (#) and mentions (@) extracted from each user’s tweets. In this experiment we totally take 112 attributes into consideration, each of which is a binary feature, representing whether the user’s tweets include a particular hashtag or mention. In practice, it is reasonable to conjecture that most of the features are insignificant, thus a sparse model should be proposed via the LASSO-based PHLSM. To evaluate the feature screening and link prediction of the proposed models, we also fit the ridge PHLSM and obtain a full model, that is, all covariates are retained in the model. Both proposed models are modified by transforming the latent position prior to a mixture Normal distribution to accommodate the community structure.
To estimate PHLSM, we perform the proposed MCMC algorithms with 60,000 iterations, still 10,000 for initial burn-in and 50,000 for monitoring. Finally 7 significant features are selected in the sparse model, listed in Table 7. It seems that Twitter topics of greatest interest are distracted driving, photos and ttot.
Cumulative mean plots for all regression coefficients and tuning parameters are depicted in Fig 9. Results of link prediction are reported in Table 8. As comparison, we also fitted the latent cluster random effects model (LCREM) proposed by [25], which incorporates the degree heterogeneity by adding node-specific random terms to the log odds. It can be demonstrated that the predictive results are very similar for the two forms of PHLSM, but the sparse model only includes 7 covariates, which is much simpler than the full version with 112 covariates. Such results can reflect superiority of the LASSO method for feature screening. On the other hand, LCREM shows poor performance, especially for predicting the true positive entities. ROC curves of the three models are presented in Fig 10.
(a) Covariate regression coefficients ; (b) Tuning parameters
.
Directed graph for the fitted ego-network are reported in Fig 11. The circles are located based on the estimated latent positions, and the directed edges denote the true relations of users. The colors and sizes of circles denote the true user clustering labels and estimated popularity scales respectively. Specifically, most of the popular users, denoted in large sizes, concentrate near the center of a community, while those on borders only have few followers, denoted in small sizes. In addition, the latent positions from different communities are separated clearly, suggesting the importance of community detection in fitting such multi-ego-networks.
The circles are located based on the estimated latent positions, and the directed edges denote the true relations of users. The colors and sizes of circles denote the true user clustering labels and estimated popularity scales respectively.
Conclusions
This paper introduces the penalized homophily latent space models for directed social networks. The proposed Bayesian inferential approaches achieve superior performances in fitting two real data examples. Typically, the proposed models accommodate typical network properties, such as reciprocity and transitivity within the LSM framework. The first major innovation of the proposed methods is to improve extensive applicability and predictive accuracy by introducing pairwise node attributes. Besides, the popularity scales are also considered to involve the heterogeneity of node in-degrees. The model performs well for node representation and link prediction for semi-SFN. An alternative approach for network visualization is yielded, which can reflect the social relationships among individuals, as well as their popularity in a social network. For model evaluation, we compare our models with other network modeling frameworks such as DLSM. It appears that our models, with a more concise form and less computation costs, outperform the state-of-the-art approaches.
References
- 1. Goodreau S. M. (2007). Advances in exponential random graph (p*) models applied to a large social network. Social Networks, 29(2), 231–248. pmid:18449326
- 2. Hunter D. R. (2007). Curved exponential family models for social networks. Social Networks, 29(2), 216–230. pmid:18311321
- 3. Robins G., Pattison P., Kalish Y., and Lusher D. (2007). An introduction to exponential random graph (p*) models for social networks. Social networks, 29(2), 173–191.
- 4. Erdös P., and Rényi A. (1960). On the evolution of random graphs. Publ. Math. Inst. Hungar. Acad. Sci, 5, 17–61.
- 5. Faust K. (1988). Comparison of methods for positional analysis: Structural and general equivalences. Social Networks, 10(4), 313–341.
- 6. Hoff P. D., Raftery A. E., and Handcock M. S. (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460), 1090–1098.
- 7. Holland P. W., and Leinhardt S. (1981). An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association, 76(373), 33–50.
- 8. Holland P. W., Laskey K. B., and Leinhardt S. (1983). Stochastic blockmodels: First steps. Social Networks, 5(2), 109–137.
- 9. Newman M. E. (2003). The structure and function of complex networks. SIAM Review, 45(2), 167–256.
- 10. Nowicki K., and Snijders T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455), 1077–1087.
- 11. Wang Y. J., and Wong G. Y. (1987). Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397), 8–19.
- 12. Wasserman S., and Anderson C. (1987). Stochastic a posteriori blockmodels: Construction and assessment. Social Networks, 9(1), 1–36.
- 13. Bickel P., Choi D., Chang X., and Zhang H. (2013). Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics, 41(4), 1922–1943.
- 14. Choi D. S., Wolfe P. J., and Airoldi E. M. (2012). Stochastic blockmodels with a growing number of classes. Biometrika, 99(2), 273–284. pmid:23843660
- 15. Karrer B., and Newman M. E. (2011). Stochastic blockmodels and community structure in networks. Physical Review E, 83(1): 016107. pmid:21405744
- 16. Rohe K., Qin T., and Yu B. (2016). Co-clustering directed graphs to discover asymmetries and directional communities. Proceedings of the National Academy of Sciences, 113(45), 12679–12684. pmid:27791058
- 17. Frank O., and Strauss D. (1986). Markov graphs. Journal of the American Statistical Association, 81(395), 832–842.
- 18. Robins G., Snijders T., Wang P., Handcock M., and Pattison P. (2007). Recent developments in exponential random graph (p*) models for social networks. Social Networks, 29(2), 192–215.
- 19. Wasserman S., and Pattison P. (1996). Logit models and logistic regressions for social networks: I. An introduction to Markov graphs and p*. Psychometrika, 61(3), 401–425.
- 20. Strauss D., and Ikeda M. (1990). Pseudolikelihood estimation for social networks. Journal of the American Statistical Association, 85(409), 204–212.
- 21. Geyer C. J., and Thompson E. A. (1992). Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society: Series B (Methodological), 54(3), 657–683.
- 22. Hunter D. R., and Handcock M. S. (2006). Inference in curved exponential family models for networks. Journal of Computational and Graphical Statistics, 15(3), 565–583.
- 23. Van Duijn M. A., Gile K. J., and Handcock M. S. (2009). A framework for the comparison of maximum pseudo-likelihood and maximum likelihood estimation of exponential family random graph models. Social Networks, 31(1), 52–62. pmid:23170041
- 24. Handcock M. S., Raftery A. E., and Tantrum J. M. (2007). Model-based clustering for social networks. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170(2), 301–354.
- 25. Krivitsky P. N., Handcock M. S., Raftery A. E., and Hoff P. D. (2009). Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models. Social Networks, 31(3), 204–213. pmid:20191087
- 26. Sewell D. K., and Chen Y. (2015). Latent space models for dynamic networks. Journal of the American Statistical Association, 110(512), 1646–1657.
- 27. Sarkar P., and Moore A. W. (2005). Dynamic social network analysis using latent space models. ACM SIGKDD Explorations, 7(2), 31–40.
- 28. Austin A., Linkletter C., and Wu Z. (2013). Covariate-defined latent space random effects model. Social Networks, 35(3), 338–346.
- 29. Sewell D. K., and Chen Y. (2016). Latent space models for dynamic networks with weighted edges. Social Networks, 44, 105–116.
- 30. Faloutsos M., Faloutsos P., and Faloutsos C. (1999). On power-law relationships of the internet topology. In ACM SIGCOMM Computer Communication Review, 29(4), 251–262. ACM.
- 31. Chang X., Huang D., and Wang H. (2019). A popularity scaled latent space model for large-scale directed social network. Statistica Sinica, 29, 1277–1299.
- 32. Gormley I. C., and Murphy T. B. (2010). A mixture of experts latent position cluster model for social network data. Statistical methodology, 7(3), 385–405.
- 33. Fan J., and Li R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
- 34. Zhang C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2), 894–942.
- 35. Tibshirani R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
- 36. Fan J., and Lv J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101–148. pmid:21572976
- 37.
Hastie T., Tibshirani R., and Friedman J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media.
- 38. Genkin A., Lewis D. D., and Madigan D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3), 291–304.
- 39. Biswas S, Lin S. (2012). Logistic Bayesian LASSO for identifying association with rare haplotypes and application to age-related macular degeneration. Biometrics, 68(2), 587–597. pmid:21955118
- 40. Geweke J., and Tanizaki H. (2001). Bayesian estimation of state-space models using the Metropolis-Hastings algorithm within Gibbs sampling. Computational Statistics & Data Analysis, 37(2), 151–170.
- 41. Gower J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53(3-4), 325–338.
- 42. Clauset A., Shalizi C. R., Newman M. E. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661–703.
- 43. Takac L., and Zabovsky M. (2012). Data analysis in public social networks. In International Scientific Conference and International Workshop Present Day Trends of Innovations, 1(6).
- 44. Zou H., and Li R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4), 1509–1533. pmid:19823597
- 45. Dunbar R. I. (1998). The social brain hypothesis. Evolutionary Anthropology: Issues, News, and Reviews, 6(5), 178–190.
- 46. Goncalves B., Perra N., and Vespignani A. (2011). Modeling users’ activity on twitter networks: Validation of dunbar’s number. PLoS ONE, 6(8): e22656. pmid:21826200
- 47. Leskovec J., and Mcauley J. J. (2014). Learning to discover social circles in ego networks. ACM Transactions on Knowledge Discovery from Data, 8(1): 539–547.
- 48. Stephens M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4), 795–809.