This course teaches basic mathematical methods for information science, with applications to data science. Topics include discrete probability, Bayesian methods, graph theory, power law distributions, Markov models, and hidden Markov models. Uses examples and applications from various areas of information science such as the structure of the web, genomics, social networks, natural language processing, and signal processing. Assignments require python programming.
Announcements: | |
– 14 Feb | The registrar has designated the final exam date/time for this course as Sun, 21 May 2017, 2:00-4:30pm in Olin Hall 155 |
Syllabus:
Date | Lecture | Reading | |
---|---|---|---|
Week 1 | Thu 1/26 | 1. Course introduction |
Lecture 1 slides,
Ipython notebook class demo:
lec1.ipynb (intro) Instructions for downloading anaconda python are here There will be some python resources listed on the Piazza course site, the python.org site has a tutorial, and there are other online resources, including the book Think Python. Rosen ch 2.1-2.2 (online here) Most recent class syllabus (just for reference): Spring '16 |
Week 2 | Tue 1/31 | 2. Probability and counting | ps0 due on Thu, note piazza polls for stats and python bootcamps Repeat Lecture 1 set examples using python: lec2.ipynb Lecture 2 slides on probability Rosen ch 6 'Counting' (link given in piazza) |
Thu 2/2 | 3. Conditional Probability, Bayes' Thm | Lecture 3 slides on conditional probability and Bayes' thm (Rosen ch 7.1-7.3, see link in piazza) Re birthday problem, see nytimes popular version (Strogatz) | |
Week 3 | Tue 2/7 | 4. Bayes' theorem and applications | Lecture 4 slides on applications of Bayes' thm,
Bayesian spam filters (Rosen 7.3) A few more "naive Bayes" resources: Bayesian spam filtering, A plan for spam, and nytimes popular Bayes (Strogatz) |
Thu 2/9 | 5. Expected value, Variance, Bernoulli trials | mentioned Wason selection task, Lecture 5 slides on E[X], Var[X], Bernoulli distribution (Rosen 7.4), see also these simulations | |
Week 4 | Tue 2/14 | 6. Text as data, more on Bernoulli |
First went over ps2 and ps2_supp
(urlopen(), Counter(), re.findall(), ...) Then Lecture 6 slides |
Thu 2/16 | 7. statistical significance, p-hacking, and harking |
Discussed midterm poll as Bernoulli processs, ps1 solutions, how to calculate word-word correlations in ps2 mentioned xkcd on "p-hacking", possible stats problems in foodlab [also dietary news] (and 2011 Cornell precognition results, see also time-travel and psych stats and this 2017 overview) then Lecture 7 slides on exp and log (see also popular logarithms (Strogatz nytimes, includes video link)) | |
Week 5 | Tue 2/21 | A Break | |
Thu 2/23 | 8. more text analysis, more simulations | some programming comments plus text analysis (from Berkeley data8 class):
lit_chars.ipynb, some simulations to see role of mean and variance: lec8.ipynb | |
Week 6 | Tue 2/28 | 9. Geodata, normal distributions, central limit theorem | started with basemap demo, recalled last three slides from lecture 6 on normal distribution (see also sd and normal) mentioned Galton machine (time-lapse video), and Bernoulli->normal derivation then deviations from normal: on-line dating and Poincare bread, started lec9-11.ipynb |
Thu 3/2 | 10. visualizing pointwise mutual information, simulating normal distributions | more on ps3, then continued lec9-11.ipynb | |
Week 7 | Tue 3/7 | 11. Normal -> Poisson | Continued discussion of Gaussian, Mendel, colorblindness and airport delays from
lec9-11.ipynb, then Lecture 11 slides on Poisson distribution |
Thu 3/9 | 12. Poisson -> Poll Data |
Continued Lecture 11 slides on Poisson distribution,
and simulation of
balls_in_bins.ipynb then Lecture 12 slides on poll data (Also recalled ENIAC (1946), and then described early UNIVAC data science foray during 1952 election) | |
Week 8 | Tue 3/14 | 13. Time + Graph Theory I | discussed ps4 and and ps4_supp, including UTC (last leap second was 31 Dec 2016 23:59:60). In relation to ps4 prob 3 mentioned Data janitoring (nytimes) and Data carpentry (Mimno). started graph theory in Lecture 13 slides (Rosen ch 10.1-10.4) |
Thu 3/16 | 14. Graph Theory II | Continued graph theory in Lecture 14 slides, including "Why your friends have more friends than you do" (Feld, 1991; see also nytimes popular version [Strogatz]), and Eulerian and Hamiltonian circuits (Rosen ch 10.5, see also Seven Bridges of Königsberg and Hamiltonian circuit) Then discussed "addendum" and "bonus supplement" from ps4_supp [Note: that notebook was updated after lecture to include the class material] | |
Week 9 | Tue 3/21 | 15. Graph Algorithms |
discussed midterm and various counting problems on graphs involving {n \choose m} and Poisson Lecture 15 slides on planar graphs, distance matrix, adjacency matrix, spanning trees (Rosen ch 10.7, 11.1), and DFS, BFS (Rosen 11.3-11.5) |
Thu 3/23 | Midterm | ||
Week 10 | Tue 3/28 | 16. Graph Algorithms II, more data scraping | Lecture 16 slides on BFS, DFS, DAGS, topological sort, weighted graphs, Krusal's algorithm
(Rosen 11.3-11.5, see also proof of
minimal spanning tree algorithms) Then ps5_supp.ipynb on scraping data for ps5 #3,#4 (c.f. above links to Data janitoring (nytimes) and Data carpentry (Mimno)) |
Thu 3/30 | 17. Graph Applications, Power Law Data |
Lecture 17 slides
Dijkstra's shortest path algorithm (Rosen 10.6), edge/node betweenness centrality, efficient algorithm
(pp. 69-83 of Easley/Kleinberg Chpt 3). Resumed discussion of power law data | |
Week 11 | Tue 4/4 | Spring Break | |
Thu 4/6 | |||
Week 12 | Tue 4/11 | 18. Social graphs and small worlds |
Reprised edge betweenness from updated Lecture 17 slides, then Lecture 18 slides on properties of social graphs that distinguish them from random graphs (see Easley/Kleinberg Chpt 3 for "triadic closure" and "clustering coeffient" (section 3.1, p.48)), and discussed this example of cumulative advantage Continued power law discussion (Easley/Kleinberg Chpt 18) |
Thu 4/13 | 19. More power laws |
Lecture 19 slides: Commented on Milgram's "small world" experiment, and discussed The Anatomy of the Facebook Social Graph. Extended power law discusion, specifically the "long tail" (see also Wikipedia entry, and later book, and see nytimes city math [Strogatz]) Then started discussion of preferential attachment (rich-get-richer) model from 18.3 (Easley/Kleinberg p.547) | |
Week 13 | Tue 4/18 | 20. Big Data / spell-correct |
Started with discussion of
heaplaw.ipynb for ps6, recalled look at complete works of Shakespeare (follow "Plain Text UTF-8" link) from lec 8 (mentioned 24:00 min point of
radiolab podcast for small sample of coinages). Then Lecture 20 slides: finished example E/K section 18.7 (pp 555-559), introduced big data discussion (see The Unreasonable Effectiveness of Data -- copy here [please read, it's fun]), started to cover spell correct as model for other big data algorithms (see also Norvig video) |
Thu 4/20 | 21. Big data algorithms |
Continued discussion of "big data" algoritms:
Lecture 21 slides, then returned to linear regression.ipynb (see also Linear_regression and LeastSquaresFitting), showed some spurious examples, and started discussion of Pearson's r | |
Week 14 | Tue 4/25 | 22. Correlated data | Lecture 22 slides: including more on
Pearson's r
(illustrated with example of midterm data for this class), also Spearman correlation (see also these notes), closed with some comments on ps7 |
Thu 4/27 | 23. More correlated data |
Lecture 23 slides:
Spearman, MAP, MLE (see power_law_fit.ipynb),
logistic regression (see also last part of
regression.ipynb), Challenger o-ring data, then started Markov chains, mentioned "Mark V Shaney" and illustrated here: Markov.ipynb (some other examples generated from trigram probability distributions, see also "what would i say" from a few years back) | |
Week 15 | Tue 5/2 | 24. Language Models and Markov Chains |
Lecture 24 slides:
Comments on ps7#4, comments on multi-gram language models (for web or books) re power of modern language models, mentioned Chomsky and Pereira (see also Norvig) Then Markov chain worked example, and intro to genomic data for probset 8 Markov problem. |
Thu 5/4 | 25. Hidden Markov Models, Viterbi algorithm |
Returned to Markov.ipynb for further discussion of simulating Markov chains (see starting at cell labelled 49, about halfway through) Then Lecture 25 slides: Simulating Markov processes, steady state distributions Discussion of genome sequence data, and CpG Islands, and inference problem for observed sequence data (see additional notes in ps8.ipynb) (mentioned HMM example from Bursty and Hierarchical Structure in Streams and article) example of Viterbi by hand (see also viterbi.ipynb, notebook to be covered next time; or see the three urn example broken out, with in extenso text version) | |
Week 16 | Tue 5/9 | 26. Data Science Redux |
Reviewed some added lec25 slides on Markov chains and hidden Markov models. Finished with these lecture 26 slides: including final exam review and mainly notes taken from chpt 1 of this recent book: "Doing Data Science" (Oct 2013). Referred to "Teaching Data Science" (arXiv:1604.07397, Apr 2016), "The Rise of Big Data" (Foreign Affairs, May/Jun 2013), "The Sexiest Job ..." (Harvard Business Review, Oct 2012), the data science "Venn diagram", the Wikipedia Data_Science entry, and media hype (NYTimes, 11 Apr 2013). (A couple of other related refs: "What is Data Science?" (Jun 2010), Data Skepticism" (Apr 2013)) |
Week 18 | Sun 5/21 | Final | 2:00PM-4:30PM, Olin Hall 155 |
Grading:
50% Problem Sets
25% Final exam
20% Midterm exam (in-class)
5% Subjective
Academic integrity policy:
You are expected to abide by the
Cornell University Code of Academic
Integrity.
It is your responsibility to understand and follow these policies.
(In particular, the work you submit for course assignments must be your own.
You may discuss homework assignments with other students at a high
level, by for example discussing general methods or strategies to
solve a problem, but you must cite the other student in your submission.
Any work you submit must be your own understanding of the solution,
the details of which you personally and individually worked out, and
written in your own words.)