Tue/Thu 2:55-4:10 PM, Hollis 110

4 credits, S/U Optional

This course teaches basic mathematical methods for information science, with applications to data science. Topics include discrete probability, Bayesian methods, graph theory, power law distributions, Markov models, and hidden Markov models. Uses examples and applications from various areas of information science such as the structure of the web, genomics, social networks, natural language processing, and signal processing. Some assignments require python programming.

Announcements: | |

– 30 Nov | Problem set 7 (due 6-8 Dec, plus ps7.ipynb notes) |

– 14 Nov | Problem set 6 (due in class Tue 25 Nov, plus ps6.ipynb notes) |

– 28 Oct | Problem set 5 (problems 5.1-5.4 due in class Thu 6 Nov), plus problem 5.5 (programming part) due Wed 12 Nov midnight |

– 8 Oct | Problem set 4 (problems 4.1-4.5 due in class Tue 21 Oct), plus problems 4.6,4.7 (programming part) due Tue 21 Oct midnight |

– 27 Sep | According to the registrar, the final exam will take place Sat, 13 Dec 2014, at 2p.m., Olin Hall 218 |

– 23 Sep | Problem set 3 (problems 3.1-3.4 due in class Thu 2 Oct), plus problem 3.5 (programming part) due Tue 7 Oct midnight |

– 23 Sep | Midterm scheduled for 23 Oct (in class, open book/computer but not open e-mail/texting) |

– 12 Sep | Problem Set 2 (due Tue 23 Sep midnight) |

– 5 Sep | Another Python ref: Introduction to Python for Science, by David Pine, appears to give a good intro to the basic python and matplotlib routines we'll be using in the IPython environment in this course |

– 3 Sep | Problem Set 1 (due Thu 11 Sep, in class) |

– 29 Aug | Problem Set "0" (due Thu 4 Sep) |

**Syllabus:**

Date | Lecture | Reading | |
---|---|---|---|

Week 1 | Tue 8/26 | 1. Course introduction, set theory | Rosen ch 2.1-2.2 (online here) Notes on set theory Ipython notebook demos: lec1a.ipynb (intro), lec1b.ipynb (sets) There are some python resources listed on the Piazza course site, the python.org site has a tutorial, and there are other online resources, including the book Think Python. |

Thu 8/28 | 2. Probability and counting | Some instructions for installing python Notes on probability and dice.ipynb Rosen ch 6 'Counting' (ch 5 in earlier editions, link given in class) Problem Set "0": simple exercise to be posted on Fri 29 Aug ("due" Thu 4 Sep) | |

Week 2 | Tue 9/2 | 3. Conditional Probability | Continue notes on probability Re "birthday" problem, see lec3.ipynb, and as well nytimes popular version (Strogatz). (Rosen ch 7.1-7.2 ['discrete probability', ch 6 in older editions, link available]) Problem Set 1 (due Thu 11 Sep, in class) |

Thu 9/4 | 4. Bayes' theorem and applications |
notes on cond'l prob and Bayes (Rosen ch 7.2-7.3) | |

Week 3 | Tue 9/9 | 5. Probability conundra + naive Bayes | mentioned Wason selection task additional selected comments from above notes and pp 1-4 of these notes on Bayesian spam filters (Rosen 7.3) |

Thu 9/11 | 6. Expected value, Variance | A few more "naive Bayes" resources:
Bayesian
spam filtering,
A plan for spam,
and nytimes popular Bayes (Strogatz). mentioned Doomsday Argument Then pp 5-7 of these notes (Rosen 7.4) Problem Set 2 posted (due Tue 23 Sep midnight) | |

Week 4 | Tue 9/16 | 7. Binomial and Normal distributions |
Then pp 8-10 of these notes (Rosen 7.4) Discussed lec7.ipynb, see also these simulations links for deviations from normal: Poincare bread and on-line dating |

Thu 9/18 | 8. Regexps, more on normal distribution |
Data janitoring (nytimes) and
Data carpentry (Mimno) plus comments on regular expressions Mentioned Bernoulli->normal derivation, then more from lec7.ipynb, and discussed lec8.ipynb, plus "xkcd on p=.05" | |

Week 5 | Tue 9/23 | 9. Election polls, exponentials and logarithms |
Discussed addendum to
lec8.ipynb in great detail, then notes on exponentials see also popular logarithms (Strogatz nytimes, includes video link) Problem set 3 (probems 3.1-3.4 due in class Thu 2 Oct) plus problem 3.5 (programming part) due Tue 7 Oct midnight |

Thu 9/25 | 10. Poisson distributions | intro to "data carpentry" component of problem 3.5, finish exponential notes, then some notes on Poisson distribution | |

Week 6 | Tue 9/30 | 11. Poisson cont'd, Graph Theory |
Today's Bayes vs Frequentism (Nytimes) Finish notes on Poisson distribution, and real data: arxpoi.ipynb start graph theory (Rosen ch 10.1-10.4, and these notes1) |

Thu 10/2 | 12. Graph theory: Eulerian and Hamiltonian circuits | Rosen ch 10.5, and notes2
(see also Hamiltonian circuit
and Seven Bridges of Königsberg), and start of notes4 (Rosen 11.3-11.4), then "Why your friends have more friends than you do" (Feld, 1991; slide, plus nytimes popular version [Strogatz]) | |

Week 7 | Tue 10/7 | 13. Random graphs, Trees, Spanning Trees, graph algorithms |
notes3 (Rosen ch 10.7, 11.1), notes4 (Rosen 11.3-11.4) (DFS, BFS), and the graph partitioning algorithm described in pp 69-83 of Easley/Kleinberg Chpt 3 (not covered in info 2040), final result was fig 3.20 on p.81 Problem Set 3 due tonight, Problem set 4 out tomorrow (due 21 Oct) |

Thu 10/9 | 14. Applications of graphs | Discussed various aspects of problem 4.7 (due 21 Oct),
then last two pages of notes4 (Rosen 11.5): Kruskal's algorithm (see proof of minimum spanning tree algorithms) and topological sort, then Dijkstra algorithm (Rosen 10.6) (another example) | |

Week 8 | Tue 10/14 | fall break | |

Thu 10/16 | 15. More graphs and power laws | Discussed features of solutions to problem 3.5 ("simulating Silver"), discussed the wikipedia dataset in problems 4.6,4.7 (programming part, due Tue 21 Oct midnight), including the conversions to utc, and further discussion of the random networks problem. See Easley/Kleinberg Chpt 3 (section 3.1, p.48) for "triadic closure". Started discussion of power laws | |

Week 9 | Tue 10/21 | 16. More power laws, start "big data" discussion |
Continued discussion of power laws
(nytimes city math [Strogatz]),
specifically the "long tail"
(see also Wikipedia entry,
and later book). see lec15 slides Then started "big data" discussion: for spell correct, see lec18 slides (Prob Set 4 due) |

Thu 10/23 | 17. Midterm exam (in class, open book/computer but not open e-mail/texting) | Probability and counting (independent events, conditional probability), statistics (mean, variance), Bayes theorem (medical tests, lie detector, dishonest casino), Binary Classifiers (spam, text), Graphs and graph algorithms | |

Week 10 | Tue 10/28 | 18. More big data, spell check |
Discuss midterm, then continue "big data" discussion:
for spell correct,
see lec18 slides. See The Unreasonable Effectiveness of Data (copy here -- please read, it's fun)Problem set 5 (problems 5.1-5.4 due in class Thu 6 Nov) |

Thu 10/30 | 19. Big data algorithms |
Finished lec18 slides on big data. Discussed The Anatomy of the Facebook Social Graph, and commented on Milgrm's "small world" experiment. for comments on linear regression and logistic regression, see lec19.ipynb | |

Week 11 | Tue 11/4 | 20. Linear regression, logistic regression, fitting power law data |
Final redux of election simulation, more discussion of linear regression, Pearson's r, logistic regression: see lec19.ipynb, and extended discussion of fitting power law data: see lec20.ipynb |

Thu 11/6 | 21. Preferential attachment |
start with these notes on preferential attachment
(see also E/K section 18.7, pp 555-559) then recommender systems | |

Week 12 | Tue 11/11 | 22. Recommender systems | Covered
lec22.ipynb on collective filtering Then started Markov chains notes1 |

Thu 11/13 | 23. Markov Models |
Finished Markov chain notes1, then worked example then text example after "Mark V Shaney" and illustrated here: Markov.ipynb (some other examples generated from trigram probability distributions, see also last year's what would i say) | |

Week 13 | Tue 11/18 | 24. Steady State distributions | Covered these lec 24 notes, and also many comments about genomic data (see ps6 #3,#4, and ps6.ipynb notes) |

Thu 11/20 | 25. Simulating Markov processes | Briefly these notes on Spearman correlation, Perron-Frobenius theorem, then back to Markov.ipynb and further discussion of ps6.ipynb notes | |

Week 14 | Tue 11/25 | 26. Page Rank: history and algorithm |
Started with the "memex" from
As we may think
(1945, required reading for info majors ...). Then Markov/PageRank notes: stationary states, power method, relation to eigenvalues, ..., finished with example here (see also The $25B Eigenvector (Bryan & Leise)) |

Thu 11/27 | Thanksgiving break | ||

Week 15 | Tue 12/2 | 27. Hidden Markov Models, Viterbi algorithm |
Worked through example of Viterbi by hand, mentioned example from Bursty and Hierarchical Structure in Streams. Described implemention of algorithm in viterbi.ipynb, plus some other examples. |

Thu 12/4 | 28. Data Science Redux |
Described implementation of Viterbi algorithm in viterbi.ipynb. Finished with these slides, including final exam review and mainly notes taken from chpt 1 of this recent book: "Doing Data Science" (Oct 2013). Referred to "The Rise of Big Data" (Foreign Affairs, May/Jun 2013), "The Sexiest Job ..." (Harvard Business Review, Oct 2012), the data science "Venn diagram", the Wikipedia Data_Science entry, and media hype (NYTimes, 11 Apr 2013). (A couple of other related refs: "What is Data Science?" (Jun 2010), Data Skepticism" (Apr 2013)) |

**Grading:**

50% Problem Sets

25% Final exam

20% Midterm exam (in-class)

5% Class participation

**Academic integrity policy:**

You are expected to abide by the
Cornell University Code of Academic
Integrity.
It is your responsibility to understand and follow these policies.
(In particular, the work you submit for course assignments must be your own.
You may discuss homework assignments with other students at a high
level, by for example discussing general methods or strategies to
solve a problem, but you must cite the other student in your submission.
Any work you submit must be your own understanding of the solution,
the details of which you personally and individually worked out, and
written in your own words.)

This course is a required course for the Information Science (IS) major in the College of Arts and Sciences and the College of Agriculture and Life Sciences. It can also be used to satisfy the math/stats course requirement for the Information Science minor/concentration. INFO2950 and CS2800 may not both be taken for credit towards the IS major.