Cornell Information Science

Info 2950: Introduction to Data Science

Spring 2017
Tue/Thu 10:10-11:25 PM, Kennedy Hall 116 (Call Auditorium)
4 credits, S/U Optional

This course teaches basic mathematical methods for information science, with applications to data science. Topics include discrete probability, Bayesian methods, graph theory, power law distributions, Markov models, and hidden Markov models. Uses examples and applications from various areas of information science such as the structure of the web, genomics, social networks, natural language processing, and signal processing. Assignments require python programming.

 – 14 FebThe registrar has designated the final exam date/time for this course as Sun, 21 May 2017, 2:00-4:30pm in Olin Hall 155

Professor: Paul Ginsparg (242 Gates Hall, ginsparg "at"
Office hours: Wed 1-2 PM (or by appointment)
TAs: Jialu B, Carter B, Marc G, Haotian L, Dean O, Qinru S, Alex W, Kai Y (office hours on Piazza pages)
Course website: (this page)
Prerequisites: An introductory statistics course (from approved list) and an introductory programming class (e.g., CS 1110), or permission of instructor
Highly Recommended: Math 2310 or a similar linear algebra class (but no longer a prereq)
Textbooks: Rosen, Discrete Mathematics and Its Applications, 6th edition, is a basic ref for some of the earlier non-programming material

Date Lecture Reading
Week 1 Thu 1/261. Course introduction Lecture 1 slides, Ipython notebook class demo: lec1.ipynb (intro)
Instructions for downloading anaconda python are here
There will be some python resources listed on the Piazza course site, the site has a tutorial, and there are other online resources, including the book Think Python.
Rosen ch 2.1-2.2 (online here)
Most recent class syllabus (just for reference): Spring '16
Week 2 Tue 1/312. Probability and counting ps0 due on Thu, note piazza polls for stats and python bootcamps
Repeat Lecture 1 set examples using python: lec2.ipynb
Lecture 2 slides on probability
Rosen ch 6 'Counting' (link given in piazza)
Thu 2/23. Conditional Probability, Bayes' Thm Lecture 3 slides on conditional probability and Bayes' thm
(Rosen ch 7.1-7.3, see link in piazza)
Re birthday problem, see nytimes popular version (Strogatz)
Week 3 Tue 2/74. Bayes' theorem and applications Lecture 4 slides on applications of Bayes' thm, Bayesian spam filters (Rosen 7.3)
A few more "naive Bayes" resources: Bayesian spam filtering, A plan for spam, and nytimes popular Bayes (Strogatz)
Thu 2/95. Expected value, Variance, Bernoulli trials mentioned Wason selection task,
Lecture 5 slides on E[X], Var[X], Bernoulli distribution (Rosen 7.4),
see also these simulations
Week 4 Tue 2/146. Text as data, more on Bernoulli First went over ps2 and ps2_supp (urlopen(), Counter(), re.findall(), ...)
Then Lecture 6 slides
Thu 2/167. statistical significance, p-hacking, and harking Discussed midterm poll as Bernoulli processs, ps1 solutions, how to calculate word-word correlations in ps2
mentioned xkcd on "p-hacking", possible stats problems in foodlab [also dietary news]
(and 2011 Cornell precognition results, see also time-travel and psych stats and this 2017 overview)
then Lecture 7 slides on exp and log (see also popular logarithms (Strogatz nytimes, includes video link))
Week 5 Tue 2/21A Break
Thu 2/23 8. more text analysis, more simulations some programming comments plus text analysis (from Berkeley data8 class): lit_chars.ipynb,
some simulations to see role of mean and variance: lec8.ipynb
Week 6 Tue 2/28 9. Geodata, normal distributions, central limit theorem started with basemap demo,
recalled last three slides from lecture 6 on normal distribution (see also sd and normal)
mentioned Galton machine (time-lapse video),
and Bernoulli->normal derivation then deviations from normal: on-line dating and Poincare bread,
started lec9-11.ipynb
Thu 3/2 10. visualizing pointwise mutual information, simulating normal distributions more on ps3, then continued lec9-11.ipynb
Week 7 Tue 3/711. Normal -> Poisson Continued discussion of Gaussian, Mendel, colorblindness and airport delays from lec9-11.ipynb,
then Lecture 11 slides on Poisson distribution
Thu 3/9 12. Poisson -> Poll Data Continued Lecture 11 slides on Poisson distribution, and simulation of balls_in_bins.ipynb
then Lecture 12 slides on poll data
(Also recalled ENIAC (1946), and then described early UNIVAC data science foray during 1952 election)
Week 8 Tue 3/14 13. Time + Graph Theory I discussed ps4 and and ps4_supp,
including UTC (last leap second was 31 Dec 2016 23:59:60).
In relation to ps4 prob 3 mentioned Data janitoring (nytimes) and Data carpentry (Mimno).
started graph theory in Lecture 13 slides (Rosen ch 10.1-10.4)
Thu 3/16 14. Graph Theory II Continued graph theory in Lecture 14 slides,
including "Why your friends have more friends than you do" (Feld, 1991; see also nytimes popular version [Strogatz]), and
Eulerian and Hamiltonian circuits (Rosen ch 10.5, see also Seven Bridges of Königsberg and Hamiltonian circuit)
Then discussed "addendum" and "bonus supplement" from ps4_supp [Note: that notebook was updated after lecture to include the class material]
Week 9 Tue 3/21 15. Graph Algorithms discussed midterm and various counting problems on graphs involving {n \choose m} and Poisson
Lecture 15 slides on planar graphs, distance matrix, adjacency matrix, spanning trees (Rosen ch 10.7, 11.1), and DFS, BFS (Rosen 11.3-11.5)
Thu 3/23Midterm
Week 10 Tue 3/2816. Graph Algorithms II, more data scraping Lecture 16 slides on BFS, DFS, DAGS, topological sort, weighted graphs, Krusal's algorithm (Rosen 11.3-11.5, see also proof of minimal spanning tree algorithms)
Then ps5_supp.ipynb on scraping data for ps5 #3,#4 (c.f. above links to Data janitoring (nytimes) and Data carpentry (Mimno))
Thu 3/3017. Graph Applications, Power Law Data Lecture 17 slides Dijkstra's shortest path algorithm (Rosen 10.6), edge/node betweenness centrality, efficient algorithm (pp. 69-83 of Easley/Kleinberg Chpt 3).
Resumed discussion of power law data
Week 11 Tue 4/4Spring Break
Thu 4/6
Week 12 Tue 4/1118. Social graphs and small worlds Reprised edge betweenness from updated Lecture 17 slides,
then Lecture 18 slides on properties of social graphs that distinguish them from random graphs (see Easley/Kleinberg Chpt 3 for "triadic closure" and "clustering coeffient" (section 3.1, p.48)), and discussed this example of cumulative advantage
Continued power law discussion (Easley/Kleinberg Chpt 18)
Thu 4/13 19. More power laws Lecture 19 slides:
Commented on Milgram's "small world" experiment, and discussed The Anatomy of the Facebook Social Graph.
Extended power law discusion, specifically the "long tail" (see also Wikipedia entry, and later book, and see nytimes city math [Strogatz])
Then started discussion of preferential attachment (rich-get-richer) model from 18.3 (Easley/Kleinberg p.547)
Week 13 Tue 4/18 20. Big Data / spell-correct Started with discussion of heaplaw.ipynb for ps6, recalled look at complete works of Shakespeare (follow "Plain Text UTF-8" link) from lec 8 (mentioned 24:00 min point of radiolab podcast for small sample of coinages).
Then Lecture 20 slides: finished example E/K section 18.7 (pp 555-559),
introduced big data discussion (see The Unreasonable Effectiveness of Data -- copy here [please read, it's fun]),
started to cover spell correct as model for other big data algorithms (see also Norvig video)
Thu 4/20 21. Big data algorithms Continued discussion of "big data" algoritms: Lecture 21 slides,
then returned to linear regression.ipynb (see also Linear_regression and LeastSquaresFitting),
showed some spurious examples, and started discussion of Pearson's r
Week 14 Tue 4/25 22. Correlated data Lecture 22 slides: including more on Pearson's r (illustrated with example of midterm data for this class),
also Spearman correlation (see also these notes), closed with some comments on ps7
Thu 4/27 23. More correlated data Lecture 23 slides: Spearman, MAP, MLE (see power_law_fit.ipynb), logistic regression (see also last part of regression.ipynb), Challenger o-ring data,
then started Markov chains, mentioned "Mark V Shaney" and illustrated here: Markov.ipynb
(some other examples generated from trigram probability distributions, see also "what would i say" from a few years back)
Week 15 Tue 5/2 24. Language Models and Markov Chains Lecture 24 slides: Comments on ps7#4, comments on multi-gram language models (for web or books)
re power of modern language models, mentioned Chomsky and Pereira (see also Norvig)
Then Markov chain worked example, and intro to genomic data for probset 8 Markov problem.
Thu 5/4 25. Hidden Markov Models, Viterbi algorithm Returned to Markov.ipynb for further discussion of simulating Markov chains (see starting at cell labelled 49, about halfway through)
Then Lecture 25 slides: Simulating Markov processes, steady state distributions
Discussion of genome sequence data, and CpG Islands, and inference problem for observed sequence data (see additional notes in ps8.ipynb)
(mentioned HMM example from Bursty and Hierarchical Structure in Streams and article)
example of Viterbi by hand (see also viterbi.ipynb, notebook to be covered next time; or see the three urn example broken out, with in extenso text version)
Week 16 Tue 5/9 26. Data Science Redux Reviewed some added lec25 slides on Markov chains and hidden Markov models.
Finished with these lecture 26 slides:
including final exam review and mainly notes taken from chpt 1 of this recent book: "Doing Data Science" (Oct 2013).
Referred to "Teaching Data Science" (arXiv:1604.07397, Apr 2016), "The Rise of Big Data" (Foreign Affairs, May/Jun 2013), "The Sexiest Job ..." (Harvard Business Review, Oct 2012), the data science "Venn diagram", the Wikipedia Data_Science entry, and media hype (NYTimes, 11 Apr 2013).
(A couple of other related refs: "What is Data Science?" (Jun 2010), Data Skepticism" (Apr 2013))
Week 18 Sun 5/21Final2:00PM-4:30PM, Olin Hall 155

50% Problem Sets
25% Final exam
20% Midterm exam (in-class)
5% Subjective

Academic integrity policy:
You are expected to abide by the Cornell University Code of Academic Integrity. It is your responsibility to understand and follow these policies. (In particular, the work you submit for course assignments must be your own. You may discuss homework assignments with other students at a high level, by for example discussing general methods or strategies to solve a problem, but you must cite the other student in your submission. Any work you submit must be your own understanding of the solution, the details of which you personally and individually worked out, and written in your own words.)

Advising Notes:
This course is a required course for the Information Science (IS) major in the College of Arts and Sciences and the College of Agriculture and Life Sciences. It can also be used to satisfy the math/stats course requirement for the Information Science minor/concentration.
[Note for InfoSci majors: INFO2950 and CS2800 may not both be taken for credit towards the IS major.
If (and only if) for some reason you have already taken CS2110 and CS2800, that combination can be petitioned to be used in place of INFO2950.]