Cornell Information Science

Info 2950: Introduction to Data Science

Fall 2014
Tue/Thu 2:55-4:10 PM, Hollis 110
4 credits, S/U Optional

This course teaches basic mathematical methods for information science, with applications to data science. Topics include discrete probability, Bayesian methods, graph theory, power law distributions, Markov models, and hidden Markov models. Uses examples and applications from various areas of information science such as the structure of the web, genomics, social networks, natural language processing, and signal processing. Some assignments require python programming.

Announcements:
 – 30 Nov Problem set 7 (due 6-8 Dec, plus ps7.ipynb notes)
 – 14 Nov Problem set 6 (due in class Tue 25 Nov, plus ps6.ipynb notes)
 – 28 Oct Problem set 5 (problems 5.1-5.4 due in class Thu 6 Nov), plus problem 5.5 (programming part) due Wed 12 Nov midnight
 – 8 Oct Problem set 4 (problems 4.1-4.5 due in class Tue 21 Oct), plus problems 4.6,4.7 (programming part) due Tue 21 Oct midnight
 – 27 Sep According to the registrar, the final exam will take place Sat, 13 Dec 2014, at 2p.m., Olin Hall 218
 – 23 Sep Problem set 3 (problems 3.1-3.4 due in class Thu 2 Oct), plus problem 3.5 (programming part) due Tue 7 Oct midnight
 – 23 Sep Midterm scheduled for 23 Oct (in class, open book/computer but not open e-mail/texting)
 – 12 Sep Problem Set 2 (due Tue 23 Sep midnight)
 – 5 Sep Another Python ref: Introduction to Python for Science, by David Pine, appears to give a good intro to the basic python and matplotlib routines we'll be using in the IPython environment in this course
 – 3 Sep Problem Set 1 (due Thu 11 Sep, in class)
 – 29 Aug Problem Set "0" (due Thu 4 Sep)


Professor: Paul Ginsparg (242 Gates Hall, ginsparg "at" cornell.edu)
Office hours: Wed 2-3 PM (or by appointment)
TA: Sam Tung (sat83@..., office hours TBD)
Course website: http://courses.cit.cornell.edu/info2950_2014fa/ (this page)
Prerequisite: An introductory statistics course (from approved list) and an introductory programming class (e.g., CS 1110), or permission of instructor Corequisite: Math 2310 or a similar linear algebra class
Textbook:

Syllabus:
Date Lecture Reading
Week 1 Tue 8/261. Course introduction, set theory Rosen ch 2.1-2.2 (online here)
Notes on set theory
Ipython notebook demos: lec1a.ipynb (intro), lec1b.ipynb (sets)
There are some python resources listed on the Piazza course site, the python.org site has a tutorial, and there are other online resources, including the book Think Python.
Thu 8/282. Probability and counting Some instructions for installing python
Notes on probability and dice.ipynb
Rosen ch 6 'Counting' (ch 5 in earlier editions, link given in class)
Problem Set "0": simple exercise to be posted on Fri 29 Aug ("due" Thu 4 Sep)
Week 2 Tue 9/23. Conditional Probability Continue notes on probability
Re "birthday" problem, see lec3.ipynb, and as well nytimes popular version (Strogatz).
(Rosen ch 7.1-7.2 ['discrete probability', ch 6 in older editions, link available])
Problem Set 1 (due Thu 11 Sep, in class)
Thu 9/44. Bayes' theorem and applications notes on cond'l prob and Bayes
(Rosen ch 7.2-7.3)
Week 3 Tue 9/95. Probability conundra + naive Bayes mentioned Wason selection task
additional selected comments from above notes
and pp 1-4 of these notes on Bayesian spam filters (Rosen 7.3)
Thu 9/11 6. Expected value, Variance A few more "naive Bayes" resources: Bayesian spam filtering, A plan for spam, and nytimes popular Bayes (Strogatz).
mentioned Doomsday Argument
Then pp 5-7 of these notes (Rosen 7.4)

Problem Set 2 posted (due Tue 23 Sep midnight)
Week 4 Tue 9/167. Binomial and Normal distributions Then pp 8-10 of these notes (Rosen 7.4)
Discussed lec7.ipynb, see also these simulations
links for deviations from normal: Poincare bread and on-line dating
Thu 9/18 8. Regexps, more on normal distribution Data janitoring (nytimes) and Data carpentry (Mimno) plus comments on regular expressions
Mentioned Bernoulli->normal derivation, then more from lec7.ipynb, and discussed lec8.ipynb,
plus "xkcd on p=.05"
Week 5 Tue 9/23 9. Election polls, exponentials and logarithms Discussed addendum to lec8.ipynb in great detail,
then notes on exponentials
see also popular logarithms (Strogatz nytimes, includes video link)

Problem set 3 (probems 3.1-3.4 due in class Thu 2 Oct)
plus problem 3.5 (programming part) due Tue 7 Oct midnight
Thu 9/25 10. Poisson distributions intro to "data carpentry" component of problem 3.5,
finish exponential notes,
then some notes on Poisson distribution
Week 6 Tue 9/3011. Poisson cont'd, Graph Theory Today's Bayes vs Frequentism (Nytimes)
Finish notes on Poisson distribution, and real data: arxpoi.ipynb
start graph theory (Rosen ch 10.1-10.4, and these notes1)
Thu 10/2 12. Graph theory: Eulerian and Hamiltonian circuits Rosen ch 10.5, and notes2 (see also Hamiltonian circuit and Seven Bridges of Königsberg),
and start of notes4 (Rosen 11.3-11.4),
then "Why your friends have more friends than you do" (Feld, 1991; slide, plus nytimes popular version [Strogatz])
Week 7 Tue 10/7 13. Random graphs, Trees, Spanning Trees, graph algorithms notes3 (Rosen ch 10.7, 11.1),
notes4 (Rosen 11.3-11.4) (DFS, BFS),
and the graph partitioning algorithm described in pp 69-83 of Easley/Kleinberg Chpt 3 (not covered in info 2040), final result was fig 3.20 on p.81

Problem Set 3 due tonight, Problem set 4 out tomorrow (due 21 Oct)
Thu 10/9 14. Applications of graphs Discussed various aspects of problem 4.7 (due 21 Oct), then last two pages of notes4 (Rosen 11.5):
Kruskal's algorithm (see proof of minimum spanning tree algorithms) and topological sort,
then Dijkstra algorithm (Rosen 10.6) (another example)
Week 8 Tue 10/14fall break
Thu 10/16 15. More graphs and power laws Discussed features of solutions to problem 3.5 ("simulating Silver"), discussed the wikipedia dataset in problems 4.6,4.7 (programming part, due Tue 21 Oct midnight), including the conversions to utc, and further discussion of the random networks problem. See Easley/Kleinberg Chpt 3 (section 3.1, p.48) for "triadic closure". Started discussion of power laws
Week 9 Tue 10/2116. More power laws, start "big data" discussion Continued discussion of power laws (nytimes city math [Strogatz]), specifically the "long tail" (see also Wikipedia entry, and later book).
see lec15 slides
Then started "big data" discussion: for spell correct, see lec18 slides
(Prob Set 4 due)
Thu 10/2317. Midterm exam (in class, open book/computer but not open e-mail/texting) Probability and counting (independent events, conditional probability), statistics (mean, variance), Bayes theorem (medical tests, lie detector, dishonest casino), Binary Classifiers (spam, text), Graphs and graph algorithms
Week 10 Tue 10/2818. More big data, spell check Discuss midterm, then continue "big data" discussion: for spell correct, see lec18 slides.
See The Unreasonable Effectiveness of Data (copy here -- please read, it's fun)
Problem set 5 (problems 5.1-5.4 due in class Thu 6 Nov)
Thu 10/3019. Big data algorithms Finished lec18 slides on big data.
Discussed The Anatomy of the Facebook Social Graph, and commented on Milgrm's "small world" experiment.
for comments on linear regression and logistic regression, see lec19.ipynb
Week 11 Tue 11/4 20. Linear regression, logistic regression, fitting power law data Final redux of election simulation,
more discussion of linear regression, Pearson's r, logistic regression: see lec19.ipynb,
and extended discussion of fitting power law data: see lec20.ipynb
Thu 11/6 21. Preferential attachment start with these notes on preferential attachment (see also E/K section 18.7, pp 555-559)
then recommender systems
Week 12 Tue 11/11 22. Recommender systems Covered lec22.ipynb on collective filtering
Then started Markov chains notes1
Thu 11/13 23. Markov Models Finished Markov chain notes1, then worked example
then text example after "Mark V Shaney" and illustrated here: Markov.ipynb (some other examples generated from trigram probability distributions, see also last year's what would i say)
Week 13 Tue 11/18 24. Steady State distributionsCovered these lec 24 notes, and also many comments about genomic data (see ps6 #3,#4, and ps6.ipynb notes)
Thu 11/20 25. Simulating Markov processes Briefly these notes on Spearman correlation, Perron-Frobenius theorem, then back to Markov.ipynb and further discussion of ps6.ipynb notes
Week 14 Tue 11/25 26. Page Rank: history and algorithm Started with the "memex" from As we may think (1945, required reading for info majors ...).
Then Markov/PageRank notes: stationary states, power method, relation to eigenvalues, ..., finished with example here (see also The $25B Eigenvector (Bryan & Leise))
Thu 11/27 Thanksgiving break
Week 15 Tue 12/2 27. Hidden Markov Models, Viterbi algorithm Worked through example of Viterbi by hand,
mentioned example from Bursty and Hierarchical Structure in Streams.
Described implemention of algorithm in viterbi.ipynb, plus some other examples.
Thu 12/4 28. Data Science Redux Described implementation of Viterbi algorithm in viterbi.ipynb.
Finished with these slides, including final exam review and mainly notes taken from chpt 1 of this recent book: "Doing Data Science" (Oct 2013).
Referred to "The Rise of Big Data" (Foreign Affairs, May/Jun 2013), "The Sexiest Job ..." (Harvard Business Review, Oct 2012), the data science "Venn diagram", the Wikipedia Data_Science entry, and media hype (NYTimes, 11 Apr 2013).
(A couple of other related refs: "What is Data Science?" (Jun 2010), Data Skepticism" (Apr 2013))

Grading:
50% Problem Sets
25% Final exam
20% Midterm exam (in-class)
5% Class participation

Academic integrity policy:
You are expected to abide by the Cornell University Code of Academic Integrity. It is your responsibility to understand and follow these policies. (In particular, the work you submit for course assignments must be your own. You may discuss homework assignments with other students at a high level, by for example discussing general methods or strategies to solve a problem, but you must cite the other student in your submission. Any work you submit must be your own understanding of the solution, the details of which you personally and individually worked out, and written in your own words.)

Advising Notes:
This course is a required course for the Information Science (IS) major in the College of Arts and Sciences and the College of Agriculture and Life Sciences. It can also be used to satisfy the math/stats course requirement for the Information Science minor/concentration. INFO2950 and CS2800 may not both be taken for credit towards the IS major.