Info 6010 Spring 2013 Assignment 1

First assignment

install python if you don't already have it (instructions)
play with it
exercise in trigram probabilities (details below)
think about online data that might be fun to harvest and analyze
find an interesting python script on-line to discuss in class

Trigram probabilities (a.k.a. Mark V Shaney)

As explained in class, p(w₃ | w₁ w₂) is the conditional probablity that w₃ occurs following the two words "w₁ w₂". It can be estimated from some training corpus by counting the number of times that the three word sequence "w₁ w₂ w₃" occurs, and dividing by the number of times that "w₁ w₂" occurs, giving the fraction of times that "w₁ w₂" is followed by w₃ in the training corpus.

The object of this exercise is to gain some experience in identifying a training corpus, learning how to read in text to a program, and using random numbers to generate text from a language model. (Some of these you can base on the reading spell-correct for next weeks's class. It is also fine to discuss methodologies on Piazza. If you really don't know where to start, there are many hints given in a similar assignment here.) Once you've chosen text to train the trigram probabilities, you can generate random text by starting from a random bigram (or from one known to start a sentence in the training set).

Here are samples generated using a perl version, the first trained on computer science abstracts, the second on high energy physics abstracts. Note that you'll have various design choices re handling punctuation, etc.