Cornell Information Science

Info 2950: Introduction to Data Science

Spring 2016
Tue/Thu 10:10-11:25 PM, Phillips Hall 203
4 credits, S/U Optional

This course teaches basic mathematical methods for information science, with applications to data science. Topics include discrete probability, Bayesian methods, graph theory, power law distributions, Markov models, and hidden Markov models. Uses examples and applications from various areas of information science such as the structure of the web, genomics, social networks, natural language processing, and signal processing. Assignments require python programming.

Announcements:
 – 24 FebThe registrar has designated the final exam date/time for this course as Sat, 21 May 2016, 2:00-4:30pm in Hollister Hall B14


Professor: Paul Ginsparg (242 Gates Hall, ginsparg "at" cornell.edu)
Office hours: Wed 1-2 PM (or by appointment)
TAs: Alisha S., Sean M., Linna L., Shibo Z. (ahs264, ll577, srm274, sz428 office hours on Piazza pages)
Course website: http://courses.cit.cornell.edu/info2950_2016sp/ (this page)
Prerequisites: An introductory statistics course (from approved list) and an introductory programming class (e.g., CS 1110), or permission of instructor
Highly Recommended: Math 2310 or a similar linear algebra class (but no longer a prereq)
Textbooks: Rosen, Discrete Mathematics and Its Applications, 6th edition, is a basic ref for some of the earlier non-programming material

Syllabus:
Date Lecture Reading
Week 1 Thu 1/281. Course introduction Most recent class syllabus: Fall '14
Ipython notebook class demo: lec1.ipynb (intro)
Instructions for downloading anaconda python are here
There will be some python resources listed on the Piazza course site, the python.org site has a tutorial, and there are other online resources, including the book Think Python.
Rosen ch 2.1-2.2 (online here)
Week 2 Tue 2/22. Probability and counting Notes (l1) on set theory and lec2.ipynb
Notes (l2-3) on probability
Rosen ch 6 'Counting' (ch 5 in earlier editions, link given in class)
Thu 2/43. Conditional Probability Continue notes (l2-3) on probability
(Rosen ch 7.1-7.2 ['discrete probability', ch 6 in older editions, link available])
Week 3 Tue 2/94. Bayes' theorem and applications Finished these notes (l2-3) for the birthday/jackpot problem, then started notes (l4-5) on cond'l prob and Bayes
(Rosen ch 7.2-7.3)
Re "birthday" problem, see lec4.ipynb, and as well nytimes popular version (Strogatz).
Thu 2/115. Probability conundra + naive Bayes mentioned Wason selection task,
continued these notes (l4-5), including Bayesian spam filters (Rosen 7.3).
Introduced first part of problem Set 2 (not due til a week after break)
Week 4 Tue 2/16A Break
Thu 2/18 6. Expected value, Variance A few more "naive Bayes" resources: Bayesian spam filtering, A plan for spam, and nytimes popular Bayes (Strogatz).
mentioned Doomsday Argument
Then started these notes (Rosen 7.4)
Week 5 Tue 2/237. Binomial and Normal distributions Finish these notes (Rosen 7.4),
and discuss lec7.ipynb, see also these simulations

(see riddler column for the probability problem mentioned at outset)
Thu 2/25 8. Regexps, more expectation/variance of Bernoulli processes comments on regular expressions
more from lec7.ipynb, and started discussion of lec8.ipynb
Week 6 Tue 3/1 9. Normal distributions and Election polls Clarified notion of Bernoulli Trial, introduced Galton machine (time-lapse video),
mentioned Bernoulli->normal derivation and What is Life?,
then deviations from normal: on-line dating and Poincare bread,
plus "xkcd on p=.05" and pre-cognition.
More on normal distribution from lec8.ipynb
(closed with continued discussion of addendum from that notebook.)
Thu 3/3 10. Data Wrangling, Visualizations, and PS3 Most of the class was devoted to an interactive tutorial on the two parts of problem set 3 (since it was interactive, there are no notes other than about this question, so unfortunate if missed). Also recalled ENIAC (1946), and then described early UNIVAC data science foray during 1952 election. In relation to PS3 prob 5 mentioned Data janitoring (nytimes) and Data carpentry (Mimno).
Also completed discussion of lec8.ipynb and recalled these simulations (linked from lecture 7)
Week 7 Tue 3/811. Poisson distributions notes on exponentials, see also popular logarithms (Strogatz nytimes, includes video link)
then some notes on Poisson distribution, and simulation of balls_in_bins.ipynb
Thu 3/10 12. Poisson, Multivariate Normal, and PS4 Mentioned covariance matrix of multinomial distribution,
UTC (last leap second was 30 June 2015 23:59:60), json,
and more description of real data: arxpoi.ipynb (to provide framework for PS4 Q1)
Week 8 Tue 3/15 13. Graph Theory start graph theory (Rosen ch 10.1-10.4, and these notes1),
Eulerian and Hamiltonian circuits (Rosen ch 10.5, and notes2, see also Hamiltonian circuit and Seven Bridges of Königsberg),
then "Why your friends have more friends than you do" (Feld, 1991; slide, plus nytimes popular version [Strogatz])
Thu 3/17 14. Social vs Random Graphs Finished discussion of Feld, 1991 (slide), discussed implications for vaccination (see also popular version)
Discussed properties of social graphs that distinguish them from random graphs, see Easley/Kleinberg Chpt 3 for "triadic closure" and "clustering coeffient" (section 3.1, p.48);
and discussed various counting problems on graphs involving {n \choose m} and Poisson.
Week 9 Tue 3/2215. Graph Algorithms distance matrix, adjacency matrix, spanning trees, notes3 (Rosen ch 10.7, 11.1), notes4 (Rosen 11.3-11.5) (DFS, BFS, topo sort)
Tue 3/22Midterm7:30PM-9:00PM, Upson B17
Thu 3/2416. More Graph algorithms Comments on midterm (see piazza for links).
Kruskal's algorithm (notes4, Rosen 11.5, see also proof of minimum spanning tree algorithms)
Dijkstra's algorithm (Rosen 10.6 example 1, another example)
graph partitioning algorithm described in pp 69-83 of Easley/Kleinberg Chpt 3 (not covered in info 2040), final result will fig 3.20 on p.81 (to be completed next lecture)
Week 10 Tue 3/29Spring Break
Thu 3/31
Week 11 Tue 4/517. Graph Applications, Power Law Data recalled the Cornell University Code of Academic Integrity (note guidelines near bottom of thie page).
completed example from end of last lecture (pp. 69-83 of Easley/Kleinberg Chpt 3),
also finished topological sort (last page notes4).
Started discussion of power laws, and discussed this example of cumulative advantage.
Thu 4/718. More power laws Corrected topological sort conventions from last time
Continued power law discussion (lec18.pdf),
specifically the "long tail" (see also Wikipedia entry, and later book, and see nytimes city math [Strogatz])
Week 12 Tue 4/1219. Yet more power laws Discussed rich-get-richer model in 18.3 (Easley/Kleinberg p.547), finished lec18.pdf slides,
mentioned the Google Book (1913, and the google as source for Barney Google as source for googol as source for google.com),
and "analyzed" the complete works of Shakespeare (follow "Plain Text UTF-8" link),
introduced big data discussion
Thu 4/1420. Small worlds and preferential attachment Commented on Milgrm's "small world" experiment,
discussed The Anatomy of the Facebook Social Graph,
then (only) p.3 of these notes on preferential attachment. (see also E/K section 18.7, pp 555-559)
Continue big data discussion (see The Unreasonable Effectiveness of Data -- copy here [please read, it's fun])
Week 13 Tue 4/19 21. Big Data / spell-correct Started with linear regression.ipynb (see also Linear_regression and LeastSquaresFitting),
covered spell correct with lec21.pdf slides, as model for other big data algorithms
Thu 4/21 22. Big data algorithms reprised reading off power laws from log-log plots.
re power of modern language models, mentioned Chomsky and Pereira (see also Norvig)
continued "big data" slides (9--17),
more discussion of regression and cells 5--9 of regression.ipynb for Pearson's r,
illustrated with example of midterm data for this class.
finished with hurried discussion of alternative method for fitting power law data, extended discussion here: power_law_fit.ipynb (more comments next time)
Week 14 Tue 4/26 23. Recommender Systems Comments on logistic regression, see last part of regression.ipynb,
finished comments on power_law_fit.ipynb,
more comments on Pearson's r, then recommender systems lec23.ipynb (on collective filtering)
Thu 4/28 24. Markov Chains Briefly these notes on Spearman correlation,
Then started Markov chains notes1,
then text example after "Mark V Shaney" and illustrated here (after 5 min of unfortunate technical miscue): Markov.ipynb
(some other examples generated from trigram probability distributions, see also "what would i say" from a few years back)
Week 15 Tue 5/3 25. Simulating Markov processes Comments on multi-gram language models (for web or books)
Recalled Markov chain notes1, then worked example,
then steady state distributions
Thu 5/5 26. Hidden Markov Models, Viterbi algorithm Returned to Markov.ipynb for further discussion of simulating Markov chains (see starting at cell labelled 49, about halfway through)
Discussion of genome sequence data, and CpG Islands, and inference problem for observed sequence data (see additional notes in ps8.ipynb)
(mentioned HMM example from Bursty and Hierarchical Structure in Streams and article)
Started working through example of Viterbi by hand
(see also Viterbi algorithm, notebook to be covered next time)
Week 16 Tue 5/10 27. Data Science Redux Finished working through example of Viterbi by hand.
Described implementation of Viterbi algorithm in viterbi.ipynb.
Finished with these slides,
including final exam review and mainly notes taken from chpt 1 of this recent book: "Doing Data Science" (Oct 2013).
Referred to "Teaching Data Science" (arXiv:1604.07397, Apr 2016), "The Rise of Big Data" (Foreign Affairs, May/Jun 2013), "The Sexiest Job ..." (Harvard Business Review, Oct 2012), the data science "Venn diagram", the Wikipedia Data_Science entry, and media hype (NYTimes, 11 Apr 2013).
(A couple of other related refs: "What is Data Science?" (Jun 2010), Data Skepticism" (Apr 2013))
Week 17 Sat 5/21Final2:00PM-4:30PM, Hollister Hall B14

Grading:
50% Problem Sets
25% Final exam
20% Midterm exam (in-class)
5% Class participation

Academic integrity policy:
You are expected to abide by the Cornell University Code of Academic Integrity. It is your responsibility to understand and follow these policies. (In particular, the work you submit for course assignments must be your own. You may discuss homework assignments with other students at a high level, by for example discussing general methods or strategies to solve a problem, but you must cite the other student in your submission. Any work you submit must be your own understanding of the solution, the details of which you personally and individually worked out, and written in your own words.)

Advising Notes:
This course is a required course for the Information Science (IS) major in the College of Arts and Sciences and the College of Agriculture and Life Sciences. It can also be used to satisfy the math/stats course requirement for the Information Science minor/concentration. INFO2950 and CS2800 may not both be taken for credit towards the IS major.