Sixth assignment, 5 Apr 2013

This assignment will involve decision tree classifiers, as described in class. (There was some minor difficulty getting the dataset together -- I decided at last moment to switch to json format, to give experience with that).

You have a choice regarding which decision tree software to use. In the nltk book, there's an example of decision tree usage here, with further description here. The nltk software prints pseudocode for the tree, but not a full diagram. scikit-learn (part of the enthought distribution most of you have installed), also has decision Atree code built in, with examples here. (Note that to use the scikit-learn visualization you will need to install GraphViz , also needed for the networkx software we'll use next week. The packages for mac OS are here, you may need to use ctr-click and open in order to get around the mac security restrictions).

I left a simple demo of the scikit visualization and nltk pseudocode here, along with some functions for reading files from the dataset.

You are also welcome to suggest better decision tree software if you're aware.

To start the asssignment, read the above descriptions, and install any necessary software (as mentioned, you should already have everything except GraphViz, if you've already installed enthought python and nltk).
Note: you need to be using the latest version of the iPython notebook for the graphvis visualization to work properly in the browser: see upgrade instructions.

In looking for a fun exercise using Decision Trees, I was intrigued by the above nltk example, which uses the "suffix" of word (defined as the last one, two or three characters) in a decision tree to determine part of speech. I wondered whether the "prefix", defined as say the first three letters, could instead convey content. (The intuition would be words say starting with 'phy' or 'bio' which strongly convey content.)

So I assembled a dataset of recent abstracts from a small subset of eleven "categories":
categories=['astro-ph', 'cond-mat', 'cs', 'gr-qc', 'hep', 'math', 'nucl', 'physics.optics', 'q-bio', 'q-fin', 'quant-ph']
with 200 abstracts from each (for a total of 2200).

The dataset is here: arxiv_data.json. The .json format can be loaded as indicated in the assignment notebook. (Though also have a look directly at the file, the format should be transparent.)

You have many choices for choosing features, in the example in the notebook I used the highest frequency features in two categories, for a binary classifier. It is also possible to experiment with using all of them, and to experiment with multiclass possibilities (all eleven at once?).

In the above features, only words in the abstracts were used. For further intuition, you could also using full words in the titles (also in the .json file), and computing the mutual information between those words and the categories, to see which are the biggest contributors (most discriminating), as explained in these slides from lecture 10.