GitHub has made its data available on Google BigQuery and has recently launched a data challenge, which sounded like a good opportunity to experiment with some data visualizations in NodeXL. It’s really easy to extract relational data from this dataset and load it into a NodeXL template! All it takes is a simple SQL query, a click in the Google BigQuery CSV download button and a data import in NodeXL.
One of my queries returned a list of organisations and the number of projects each of them had on GitHub by programming language. Looking at this data from a network perspective, we have a bipartite graph with two types of nodes, organisations and programming languages, and links that indicate that a given organisation has projects in the languages it is connected to. Additional queries gave me the total number of projects per organisation and per language, which I use to map the size of the nodes in the graph visualization. The resulting graph is shown below.
(Click on the image to view the full network with deep zoom – requires Silverlight.)
The graph shows organisations (in blue) and programming languages (in orange) that have at least 250 projects. You can also find a histogram on the top-left corner of the image that shows the languages by popularity on GitHub: JavaScript is by far the most popular language!
About the data
The data was retrieved from Google BigQuery on May 7th, 2012 and visualised with NodeXL (version 1.0.1.209). The NodeXL data file can be downloaded here: github_lang_org.
