Second assignment
- Write a non-trivial program that gets and processes some text data, and saves data and/or results to a file
- Why did you choose this data source/why is it interesting/
what alternatives did you consider?
- Why did you choose this question and processing technique/why
is it interesting/what alternatives did you consider?
- We'll use these to organize the group and class discussion
a little more.
(The random thought mentioned in class was weather data, e.g., Ithaca Jan 1900-2012, but anything is possible.
See also open data on the web, or consider some of the nltk data described below.)
- install nltk if you don't already have it.
- Instructions are here, but note that if you have the enthought distribution of python, you can skips steps 1--5 (since you already have easy_install, pip, numpy, pyyaml), and just run
sudo pip install -U nltk
- Update: when I installed on my newer mac OS 10.8.3, I found I had to do the following (because the latest version of pip had openssl problems):
sudo easy_install pip==1.2.1
And then after running as above
sudo pip install -U nltk
I also needed to install X11 from XQuartz.
- Install as much of nltk data as you want, instructions here for the interactive installer (if you install all of it, it's < 1Gb, dominated by the corpora, the largest of which is the thesaurus)
- Browse first two chapters of the nltk book, and let us know if you see any interesting problems from those at the end of the chapters
- Have a look at the Norvig follow-up