Zachary's Karate Club Technical Journal 3
General Overview
The Girvan-Newman Algorithm, named after Michelle Girvan and Mark Newman, is a hierarchical method used to detect communities in a graph depending on the iterative elimination of edges with the highest number of shortest paths that go through them. Essentially the algorithm is the process of repetitive calculations of edge betweenness centrality of all nodes and cutting the edge with the highest centrality. In pseudo-code:
Repeat until no edges are left:
- Calculate edge betweenness for every node in the graph
- Remove the edge with the highest edge betweenness
- Recalculate edge betweenness for all remaining nodes
- Connected nodes are considered communities
Key Components
The most important step for the Girvan-Newman Algorithm is calculating the edge-betweenness of every node on the graph. But first, the while-loop must be instantiated so the algorithm repeats until there are no more edges to be cut. Then we run networkx's edge betweenness centrality function. The first step is to determine the shortest path for every node and assign the values to the respected node. A visual formula is shown below in an image:

Calculating the shortest path on a graph
The next step is to find the value of each edge on the graph. You start from the bottom of a converted graph (pictured below) and divide the above node's value by the bottom node's value, getting a fraction for the value. From there on out, the edge values are the (the previous edge-value + the node it leads into's value) / the previous node. An example from the image below would be FI = (0.5 + 2) / 3 = 0.833.
Example of a "converted" shortest-path graph with root node as A
The result is a completed edge betweenness graph of a single node as the root. This needs to be done for all nodes in the graph. The second to the last step is to add the respective edge values together. If the edges are bi-directional, divide the edge values by 2 and finally cut the edge(s) with the highest value. In math terms: the betweenness centrality of a node v is where is the total number of shortest paths from node to node and is the number of those paths that pass through .
Walkthrough
Above is a super simple, yet effective, graph to help guide us through the Girvan-Newman Algorithm. Our first step is to choose a root node and use the shortest path on the graph. After that, we re-format the graph in a more helpful structure. The first step is depicted below:
The next step is to calculate the edge values using the method I mentioned before in the key components. Below is a completed edge betweenness of node A in the graph.
Repeat for nodes B, C, D, E and F.
Next, we add all the values together to get a final graph:
We then plot them onto the original graph. And since this graph is bi-directional, we divide all values by two.
With the naked human eye, we can see that four of the edges have exactly the same value and that number happened to be the highest of the bunch, resulting in 4 cuts to effectively split the graph. That was just one iteration of the Girvan-Newman Algorithm. Obviously, it would continue until all edges are cut but it is up to the user to stop it given a case for an optimal result.
Implementation
I realized midway through my implementation that we should run the algorithm exactly once since we only want two clusters. Thus, my code only breaks down to a couple lines:
We calculate the edge betweenness of graph G and store it into bw. bw is an array of those edge values and because we want to find the max betweenness centrality, we run max(bw.values()) and then we cut all edges related to max.
Analysis
When we replot the newly cut graph (shown below), we notice a small variance between the result and the ground truth. It turns out it gets an accuracy of 0.9411764705882353 or 94.11764705882353%, missing nodes 2 and 8, nodes that are typically missed by a lot of algorithms. My guess as to why it missed those is because they were so typical of both communities that it had a hard time deciding where to cut it. It may also be because those karate members stuck with someone nothing could predict merely because it was just human nature. The advantages to this algorithm are the fact that you can get derived more and more communities out of a single graph if so desired and it works for unweighted and weighted graphs. However, it lacks the ability to accommodate data points which are distance-based.
Plotted graph for the Girvan-Newman Algorithm
The image below is a bar chart of the accuracy of all four algorithms that I tried for Zachary's Karate Club dataset. Obviously, because it missed two, it was a little bit less accurate than some of the rest. It did not do the best but it is good by data scientists' standards. In the future, I want to figure out how agglomerative clustering is applied to the dataset since all I've seen is it being a distance-based algorithm.
My summarized video: http://youtu.be/qJPMjUUfdZk?hd=1
Comments
Post a Comment