Seventh assignment, 20 Apr 2013

This assignment will involve NetworkX software, as described in class 19 Apr. The assignment involves finding an interesting network, and answer some simple questions about it. Towards the end of this notebook there are some examples (from Networkx docs) of using the class Graph for simple undirected graphs (no arrows on links, no self loops or multiple edges).

The object of this assignment is to download a dataset and analyze some of its network features.

One possibility would be the actor network from the internet movie database (imdb.com): an imdb "big" data set (along with some other interesting datasets including medline co-authorship data) is available from the CMU Auton Lab's SNBN Datasets as imdb_b.csv.zip, and an actor network can be constructed from this data. (Nodes represent actors, and a link between two nodes indicates that they appeared together in at least one movie.) A fun question to ask about this network is which actor has the average shortest length path to all other nodes in the network, and what is this shortest average value. (It will probably not turn out to be Kevin Bacon or length six degrees.) For a coauthor network (nodes are authors, and links indicate at least one article coauthored together), the analogous path length to a given author is termed the "Erdös number".

The NetworkX software has built in all of the algorithms you'll need for building and analyzing the graphs, e.g.: centrality, clique, clustering, communities, components, distance measures, and in particular shortest paths.

Questions you should explore in addition to the above include, which actor or author has the highest degree (number of links), which has the highest betweenness centrality, visualize the ego network of this node, determine how many communities are in the network and infer their properties, and so on. (Check the algorithms for possibilities.)

Other possible data sources to explore, if closer to your interests: J.Leskovec has collected some wonderful datasets (including facebook egonet and wikipedia votenet, also the twitter data mentioned in class, citation networks, web graphs, networks with ground truth communities, and others). Some mobile data is available from the MIT Human Dynamics Lab. More datasets are linked from M.Newman and infovis cbi, and large numbers of other places. (Saeed had suggested datamob.org , but it since went down, here's an archived link.)