random graphs with communities and arbitrary degrees
Classical Erdös-Renyi random graphs lack many of the properties commonly associated with networks observed in the real world. In particular, they are restricted to binomial or Poisson degree distributions and are generated without a particular modular structure in mind (i.e., no group of nodes is treated any differently than any other group). Several additional random graph models have been proposed over the years that seek to alleviate these shortcomings. On the one hand, we have the configuration and models that produce random graphs with an arbitrary degree sequence (or expected degree sequence). There are also random graphs that incorporate community structure, such as the hierarchical random graph (HRG) or the stochastic block model (SBM). In fact, these models can produce non-Poisson degree distributions as well, but they often suffer from overfitting problems. In the case of the SBM the maximum likelihood estimator for a particular graph is one in which each edge exists with probability if those nodes are connected in the observed network and if they are not, so the overfitting is particularly bad.
We can avoid the overfitting problems by restricting our attention to the -parameter SBM. In this model, each node is assigned to one of distinct communities. An edge between nodes and exists with probability if and are both in community and with probability otherwise. This is a generalization of the 2-parameter SBM used by Hofman and Wiggins in their recent paper on Bayesian community detection, and is suitable for a network with a single layer of community structure. However, the resulting degree distribution is still fundamentally Poisson (now actually the sum of several Poisson distributions).
In my upcoming senior thesis on network community detection, I propose a generalization of the -parameter SBM that allows for an arbitrary degree sequence, effectively extending the configuration model to the realm of community structure. It can also be viewed as a slight generalization of the benchmark graph proposed by Lancichinetti et al that is suitable for maximum likelihood inference. In addition, this new model (which I am tentatively calling the 2-layer configuration model)can be naturally associated with the generative process behind the modularity function (more on this later).
Intuitively, the 2-layer configuration model works as follows. Each node is given a degree and is assigned to one of distinct communities, which means that each community has a total degree . Furthermore, each community is given an “affinity” in the range . We then assign edges to each community chosen by randomly picking two stubs within that community and connecting them with an edge (i.e., each community is treated as a separate configuration network with degree sequence ). The remaining edges are assigned by randomly choosing two of the unused stubs from anywhere in the network and connecting them with an edge (i.e., we then treat the entire network as a configuration network with degree sequence ). This yields a network with the desired degree sequence that also contains the desired community structure.
For reasons outlined in a previous post, it is often desirable to have a model in which each edge exists independently of all the others. This can be achieved (as in the case of the configuration model) by relaxing the degree requirements to merely specifying the expected degree for each node. However, because of the caveats outlined in that previous post, this model turns out to be rather complicated to write down. With a little work, one can show that the edge probabilities should be given by
A detailed derivation will be given in the thesis.