Regional Connectivity Analysis Over Large Graphs
Large-scale, networked systems such as the World Wide Web or Online Social Networks (OSNs) can be represented as graphs where nodes represent individual entities (e.g., web pages or users), and directed or undirected edges represent relations between these entities (e.g., interaction or friendship between users). Characterizing the connectivity structure of such a graph, in particular at scale, often provides deeper insight into the corresponding networked system and has motivated many researchers to analyze graph representations of large networked systems.
It is often very useful to obtain a coarse view of the connectivity structure of a huge graph that shows a few major tightly connected components or regions of the graph along with the inter- and intra-region connectivity. Such a regional view also enables a natural top-down approach to the analysis of large graphs, where one first examines the regional connectivity of a huge graph and then zooms in to individual regions to explore their structure in further detail. However, capturing a regional view of a huge graph is a non-trivial task that existing tools and techniques are not able to achieve. While many techniques exist for graph clustering, graph partitioning, and community detection, these approaches do not work well for discovering coarse regional views in very large graphs. These methods usually scale poorly, force regions to have similar size, or find communities that are too small. For example, existing techniques (e.g., Louvain) are likely to identify tens of thousands of communities in the structure of a large OSN that is still too complex for high-level analysis to determine the full picture of inter-community connectivity.
WalkAboutis technique to infer a coarse view of connectivity in very large graphs; that is, identify well-connected "regions" with different edge densities and determine the corresponding inter- and intra- region connectivity. We leverage the transient behavior of many short random walks (RW) on a large graph that is assumed to have regions of varying edge density but whose structure is otherwise unknown. The key idea is that as RWs approach the mixing time of a region, the ratio of the number of visits by all RWs to the degree for nodes in that region converges to a value proportional to the average node degree in that region. Leveraging this indirect sign of connectivity enables our proposed framework to effectively scale with graph size.
If individual regions in a graph have a different average degrees, we can use the visit-to-degree ratio across visited nodes to identify the regions and their corresponding nodes. Using this indirect sign of connectivity enables this framework to scale with graph size. We leverage this phenomenon in our proposed framework and address practical considerations.
As theory has shown, in an undirected, connected, and non-bipartite graph, the probability that a sufficiently long RW would be a particular node x converges to deg(x)/2|E|, where deg(x) is the degree of node x, and |E| is the number of edges in the graph. The mixing time of a graph is the walk length at which the probability of being at each node is within of the stationary distribution. In other words, when we start N RW and their length grows longer than the mixing time, deg(x)/Vis(x) = 2|E|/N, where vis(x) is number of visits to node(x) by the RWs, N is the total number of nodes. Now if we set N=|V| to previous equation turns into:
$$\frac{deg(x)}{Vis(x)} = \frac{2|E|}{|V|}$$
Now we consider a case where the graph has multiple regions that are loosely connected. In this case, RWs tend to stay inside the region before leaving the region an mix in the entire graph. Therefor RWs mix inside one region before mixing in the graph. We refer to the ratio of deg(x) over the number of visit to node x as dvr(x). We show that for a node in region i the following equation holds (Please refer to our related paper for more information regarding this.)
$$dvr_{i}(x) = \frac{deg(x)}{E[Vis(x,wl)]} = \frac{2|E_{i}|}{|V_{i}|}$$
E[Vis(x,wl)] represents the expected value of the number of visits from RWs with length wl, |E_{i}| is the number of edges in region i, and |V_{i}| is the number of nodes in region i. As the length of RWs increases, RWs start mixing in graph and the dvr will converge to 2|E|/|V| of the entire graph. Figure 1 shows this transient phenomenon in the dvr histogram of Twitter. The goal of WalkAbout is to capture this transient effect and use it to identity regions in the graph. Figure 3 shows the dvr histogram for the graph of Twitter for RW walk length of 18. It clearly shows that graph have various groups of nodes with similar dvr values. As a result the value of dvr can be used to group nodes based on the average degree of the region to which the node belongs.
In our empirical evaluation, we apply WalkAbout to three major OSNs: Flickr, Twitter and Google+. Compared to Louvain, the gold standard for scalable community detection, WalkAbout runs faster and finds larger, coarser regions. Most communities discovered by Louvain can be mapped to a single one of WalkAbout’s regions, suggesting that WalkAbout is providing a higher-level view of the network than Louvain. Finally, we analyze the regions in Flickr and show that different regions discovered by WalkAbout correspond to different interest groups, providing a meaningful coarse view of this OSN.
The WalkAbout code is implemented in C++. However, we have implemented a Python front end that can interactively communicate with the compiled C++ module. The python module is capable of displaying the dvr histogram. In addition, users can interactively chose the region's core borders. Based on the the identified cores by the Python module, graph is split into regions and the regional view of the graph is generated in GEXF which can be visualized by different graph visualization tools, e.g., Gephi. An example of such representation can be found in Figure 2. Please visit the instructions page for further details about this code.
This project is funded by the National Science Foundation (NSF) grant no. IIS-0917381 and IIS-1342477. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.