Cornell Information Science


INFO 295: Mathematical Methods for Information Science

Fall 2007
Tue/Thu 2:55-4:10 PM Hollister 110
4 credits, S/U Optional
Co-requisite: Math 231 or equivalent

This course teaches basic mathematical concepts in information modeling. Topics to be covered include graph theory, discrete probability, finite automata, Markov models and hidden Markov models. We will be guided by examples and applications from various areas of information science such as: the structure of the Web, genome sequences, natural languages, and signal processing.


Professor: Paul Ginsparg (301 College Ave, 5-7371, ginsparg@cornell.edu)
Office hours: Wednesdays 2-3 PM (or by appointment)
Course website: http://www.infosci.cornell.edu/Courses/info295/2007fa/ (this page)

Lecture 1 (Thu 23 Aug 07)

Intro, overview of topics to be discussed. Background discussion of Set Theory.
(See also INFO 295 Fall 2006)

Lecture 2 (Tue 28 Aug 07)

Probability and Counting 1 (Probability space, events, binomial coefficients, conditional probability).
See also counting examples.

Problem Set 1 (due in class Thurs, 6 Sep 07)

Lecture 3 (Thu 30 Aug 07)

More on conditional probability: Bayes' Law and examples. (see also Bayesian inference)

Lecture 4 (Tue 4 Sep 07)

Finish Bayes' Law and examples.
More on independent events, binomial distributions, random variables, expectation and variance (Probability and Counting 2)

Lecture 5 (Thu 6 Sep 07)

Continue Probability and Counting 2 notes on independent events, expectation and variance of Bernoulli distributions

Problem Set 2 (due in class Thurs, 20 Sep 07)

Lecture 6 (Tue 11 Sep 07)

Some comments on normal distributions and the central limit theorem (see also), and review of exponentials and logarithms.

Lecture 7 (Thu 13 Sep 07)

Discuss www-admin spam, introduce A plan for spam (Graham, 2002) (plus "naive Bayes" and combining probabilities).
(See also section 6.3 of Rosen, and "naive Bayes" document classification and Bayesian spam filtering.)

Lecture 8 (Tue 18 Sep 07)

Completed A plan for spam (Graham, 2002) and follow-up (Graham, 2003), using more details on "naive Bayes".
(Additional updates on this approach, including FAQ and "white papers" can be found at dspam resources.)

Lecture 9 (Thu 20 Sep 07)

Some final comments on combining probabilities ( p(L or R | O) vs p(O | L and R) ).
Start Graph Theory 1 (Basic definitions, isomorphisms, paths)

Suggested Tue 16 Oct as likely date for in-class midterm exam.

Problem Set 3 (due in class Thus, 4 Oct 07)
Added note: here is a hint for [optional] 5b

Lecture 10 (Tue 25 Sep 07)

Graph Theory 2 (Eulerian and Hamiltonian circuits, see also Hamiltonian circuit)
Comments on De Bruijn sequences and magic tricks.
Example of Gray Coding from course text (pp 642-643).

Lecture 11 (Thu 27 Sep 07)

Graph Theory 3 (Trees, planarity, colorings, Euler's formula)

Lecture 12 (Tue 2 Oct 07)

(Comments on hint for [optional] 5b)
6 color theorem (and 5 color theorem). Started Graph Theory 4 (Graph algorithms).
Plus some comments on Milgram's "small world" experiment.

Lecture 13 (Thu 4 Oct 07)

Continue Graph Theory 4 (Graph algorithms), proof of minimum spanning tree algorithms, and application to clustering.


Fall Break (Tue 9 Oct 07)

Lecture 14 (Thu 11 Oct 07)

Finish Graph Theory 4 (Graph algorithms). Some comments on Erdös-Renyi random graphs, Brief discussion of quantitative methods applied to social networks, node and link centrality (browse sections 1.1, 1.2, 1.3.6, and 1.4 [pp 1-3, 14-25] of Who is the best connected Scientist? (Newman, 2003)), more on Milgram's "small world" experiment, and Collective dynamics of 'small-world' networks (Watts/Strogatz, 1998).

Problem Set 4 (due in class Tues, 30 Oct 07)

Lecture 15 (Tue 16 Oct 07)

Midterm

Lecture 16 (Thu 18 Oct 07)

Final comments on small world networks, clustering coefficient. Comments on random graphs and Poisson distributions (problem set 4). Some additional comments on Poisson distributions.

Lecture 17 (Tue 23 Oct 07)

(Graph of C(17,i)/217 : probability of only four of seventeen last names in class starting in latter half of alphabet < ~10%)

Recap brief history of the web, start to discuss Page rank algorithm, see also AMS feature column.

Example iterations from n=4 example at end of class.

See also historical background material:
0) W3C "Little History"
1) Vannevar Bush, As We May Think, Atlantic Monthly, July 1945
2) The World Wide Web: A very short personal history, Tim Berners-Lee
3) The Anatomy of a Large-Scale Hypertextual Web Search Engine (Brin/Page, 1998), equivalent html (sections 1-3, pp 1-6)

Lecture 18 (Thu 25 Oct 07)

Continue discussion of Page rank algorithm, AMS feature column.

(Note: office hours shifted to Monday 2 pm, at 301 College Ave office for weeks of 15,22,29 Oct)

Lecture 19 (Tue 30 Oct 07)

Complete discussion of Page rank algorithm, and Perron-Frobenius theorem (for the mathematically inclined, see also The $25B Eigenvector).

Problem Set 5 (due in class Thu 15 Nov)

Lecture 20 (Thu 1 Nov 07)

Discuss other features used in ordering heuristics, mention "Google bombing" (NYTimes, 22 Jan 04) and Authoritative sources in a hyperlinked environment (Kleinberg, 1998).
Begin Finite State Automata 1 (alphabet, DFA, regular sets)

Some additional resources (suggested by a student last year as useful, feel free to forward others you've found): Wikipedia PageRank, Webworkshop pagerank, efactory pagerank-algorithm, ...

Lecture 21 (Tue 6 Nov 07)

Continue Finite State Automata 1 (alphabet, DFA, regular sets), examples of DFAs.

Lecture 22 (Thu 8 Nov 07)

Finite State Automata 2 (NFA, subset construction) and example of subset construction. Finite State Automata 3 (Regular expressions)

Lecture 23 (Tue 13 Nov 07)

Ultimately periodiocity and constraints on lengths of elements of regular sets. Connection between NFA's and regular expressions (examples). Construction of automaton for A* from automaton for language A.

Lecture 24 (Thu 15 Nov 07)

More examples of the regexp/NFA connection, examples of language parsing and number recognition. Colorless green ideas sleep furiously.
Started discussion of Markov Chains 1 (Basic definitions, transition matrix), more on PageRank as a Markov process.

Problem Set 6 (due in class Thurs 29 Nov)

Lecture 25 (Tue 20 Nov 07)

Relation between Markov chains and random walks on weighted graphs, worked example, genomics example from Durbin et al pp 46-57 (handout in class).
Examples of Markov chains to determine the expected number of coin flips to reach a given number of heads in a row.

Lecture 26 (Tue 27 Nov 07)

"Mark V Shaney" and other examples generated from trigram probability distributions, Markov Chains 2 (Ergodic chains, stationary distributions).
Introductory parts on Hidden Markov Models from A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition (Rabiner, 1989), 257-261
Set up example of Viterbi by hand.

Lecture 27 (Thu 29 Nov 07)

Finish example of Viterbi by hand.
Example of max probability state not on maximum probability path.
Example of CpG islands from Durbin et al pp 46-57 (handout).
Example from Bursty and Hierarchical Structure in Streams.


Final Exam

Take Home: available 2p.m. Tues 4 Dec, due 2 p.m. Fri 7 Dec (should take no more than a few hours...). Contact me by e-mail with any questions, and I'll have office hours on Wed afternoon, 2-4.


Background text for Lectures 1-5: Chpt 2, Sections 5.3,5.4, and Chpt 6 of Rosen, Discrete Mathematics and its Applications (6th edition), on reserve at engineering library, also used by cs 280. (These are the sections on Sets, Permutations and Combinations, Binomial Coefficients, Probability Theory, Bayes' Theorem, Expected Value and Variance. Note that in earlier editions of the book these sections were in Chpts 1 and 4.)
For lectures 6--8, see Rosen sections 6.3,6.4 (Bayes' Thm, Expected Value and Variance)
For lectures 9--14, see Rosen chapters 9,10 (Graphs, Trees)
For lectures 21--24, see Rosen sections 12.2-12.4


Advising Notes

This course is a required course for the new Information Science (IS) major in the College of Arts and Sciences and the College of Agriculture and Life Sciences. It can also be used to satisfy the math/stats course requirement for the Information Science minor/concentration.
INFO295 and CS280 may not both be taken for credit towards the IS major.