Zachary's Karate Club Technical Journal 3

General Overview

The Girvan-Newman Algorithm, named after Michelle Girvan and Mark Newman, is a hierarchical method used to detect communities in a graph depending on the iterative elimination of edges with the highest number of shortest paths that go through them. Essentially the algorithm is the process of repetitive calculations of edge betweenness centrality of all nodes and cutting the edge with the highest centrality. In pseudo-code:
Repeat until no edges are left:
  1. Calculate edge betweenness for every node in the graph
  2. Remove the edge with the highest edge betweenness
  3. Recalculate edge betweenness for all remaining nodes
  4. Connected nodes are considered communities
This method works because groups are separated from one another, revealing the underlying community structure of the network. Edge betweenness is the driving factor of the algorithm but it must be recalculated after each completed iteration. In sentences, edge betweenness works like the following: for every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that, in our case (for an unweighted graph), several edges that the path passes through is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex, as mentioned before. After calculating edge betweenness for all nodes in the graph, edge values are added together respectively. If specific edges are bi-directional, meaning that it can be traveled from both directions, divide the sum by two. In the end, a cut should be made for the edge(s) with the highest edge betweenness, effectively splitting the graph into separate communities.

Key Components

The most important step for the Girvan-Newman Algorithm is calculating the edge-betweenness of every node on the graph. But first, the while-loop must be instantiated so the algorithm repeats until there are no more edges to be cut. Then we run networkx's edge betweenness centrality function. The first step is to determine the shortest path for every node and assign the values to the respected node. A visual formula is shown below in an image:
Image result for edge betweenness
Calculating the shortest path on a graph

The next step is to find the value of each edge on the graph. You start from the bottom of a converted graph (pictured below) and divide the above node's value by the bottom node's value, getting a fraction for the value. From there on out, the edge values are the (the previous edge-value + the node it leads into's value) / the previous node. An example from the image below would be FI = (0.5 + 2) / 3 = 0.833.
 Image result for edge betweenness centrality
Example of a "converted" shortest-path graph with root node as A

The result is a completed edge betweenness graph of a single node as the root. This needs to be done for all nodes in the graph. The second to the last step is to add the respective edge values together. If the edges are bi-directional, divide the edge values by 2 and finally cut the edge(s) with the highest value. In math terms: the betweenness centrality of a node v is where  is the total number of shortest paths from node  to node  and  is the number of those paths that pass through .

Walkthrough

Above is a super simple, yet effective, graph to help guide us through the Girvan-Newman Algorithm. Our first step is to choose a root node and use the shortest path on the graph. After that, we re-format the graph in a more helpful structure. The first step is depicted below:

The next step is to calculate the edge values using the method I mentioned before in the key components. Below is a completed edge betweenness of node A in the graph.
Repeat for nodes B, C, D, E and F.

Next, we add all the values together to get a final graph:
We then plot them onto the original graph. And since this graph is bi-directional, we divide all values by two.
With the naked human eye, we can see that four of the edges have exactly the same value and that number happened to be the highest of the bunch, resulting in 4 cuts to effectively split the graph. That was just one iteration of the Girvan-Newman Algorithm. Obviously, it would continue until all edges are cut but it is up to the user to stop it given a case for an optimal result.

Implementation

I realized midway through my implementation that we should run the algorithm exactly once since we only want two clusters. Thus, my code only breaks down to a couple lines:
We calculate the edge betweenness of graph G and store it into bw. bw is an array of those edge values and because we want to find the max betweenness centrality, we run max(bw.values()) and then we cut all edges related to max.

Analysis

When we replot the newly cut graph (shown below), we notice a small variance between the result and the ground truth. It turns out it gets an accuracy of 0.9411764705882353 or 94.11764705882353%, missing nodes 2 and 8, nodes that are typically missed by a lot of algorithms. My guess as to why it missed those is because they were so typical of both communities that it had a hard time deciding where to cut it. It may also be because those karate members stuck with someone nothing could predict merely because it was just human nature. The advantages to this algorithm are the fact that you can get derived more and more communities out of a single graph if so desired and it works for unweighted and weighted graphs. However, it lacks the ability to accommodate data points which are distance-based.
Plotted graph for the Girvan-Newman Algorithm
The image below is a bar chart of the accuracy of all four algorithms that I tried for Zachary's Karate Club dataset. Obviously, because it missed two, it was a little bit less accurate than some of the rest. It did not do the best but it is good by data scientists' standards. In the future, I want to figure out how agglomerative clustering is applied to the dataset since all I've seen is it being a distance-based algorithm.

My summarized video: http://youtu.be/qJPMjUUfdZk?hd=1


Comments

Popular posts from this blog

Journal

Technical Journal - Research Papers

Zachary's Karate Club Technical Journal 1