Cornell

Information Science


INFO 6010: Computional Methods For Information Science Research

Spring 2013
Fri 9:30am-noon , Location: 301 College Ave (large conf room)
3 credits, S/U Optional


Professor: Paul Ginsparg (452 Phys.Sci.Bldg, ginsparg@cornell.edu)
Office hours: Tue 3-4 PM (or by appointment)
Course website: http://courses.cit.cornell.edu/info6010/ (this page)

Course blurb: Computation is an essential tool for many facets of information science research. Examples of its utility include capture, access and analysis of digital data; visualization of that data for analysis, interpretation and information extraction; construction of user-focused applications; and analysis of textual and sensor-derived information to detect patterns and dynamics of human activities, social interactions and social networks. Effective use of computation requires a mixture of skills including structuring data, accessing data, programming, choosing and applying computational analysis methods, and designing visualizations. This course covers the mixture of these skills with the goal of providing information science graduate and masters students with the appreciation of their utility and the ability to employ them in future research. The course is project-based, allowing students to understand the use of computational methods to pursue research interests.

Prerequisites: Graduate standing. Basic programming experience (at the level of CS1110 or CS1112 or INFO1100, including variables, arrays, strings, loops, conditionals, methods and functions, basic recursion, file IO, object-oriented design, debugging), plus introductory-level background in probability and statistics. Prior knowledge of Python is not required. This course will not teach programming per se, but rather the use of computation methods and tools for data-oriented research tasks.

Note: This course draws both from Physics 7682 / CIS 6229 ("Computational Methods for Nonlinear Systems") and from the former Info 6307 ("Learning from Web Data", which it replaces).

Course topics:


Meeting 1 (Fri 25 Jan 13)

Course overview. Be sure to come with a full-featured laptop.
27 Jan 2012: I'm in the process of setting up Piazza pages for this course, will send links when available.
In the meantime, here are the mentioned instructions for installing python.
The python.org site has a tutorial, and there are other resources listed in the left margin here.
Please post any other useful pedagogic python resources you find on the course Piazza site.
The ipython demo I ran is here: demo1.ipynb, and the matplotlib gallery is here.
A recent article (subtitled "Should data have a conscience?") about the mentioned map of gun ownership is here.

Here are some notes for assignment 1

Meeting 2 (Fri 1 Feb 13)

More notes to be posted re assignment and readings, but here are the demos from class to import into notebook: trigram.ipynb, lecture2.ipynb. The texts used for the demo were 40textfiles.zip (from Info 4300) and others (Oz, Sherlock, Decl Ind, truncated Sherlock) retrieved from here (which also has useful python "nanotutorials").

Note that it's important to get to me your trigram assignment 1 via email, not for grading but so that I have an impression of where everyone stands in order to calibrate the next few weeks. (Let me know also some rough impression, e.g., "easy and fun", "difficult and pointless", "already did it in high school", ...".)

We started discussion text as data (ubiquitous, useful), went over Norvig's spell-correct, emphasizing how "big-data" facilitates simple algorithms (see also The Unreasonable Effectiveness of Data), and the assignment has instructions for installing the nltk (Natural Language Toolkit) module.

Assignment 2 is here.

Meeting 3 (Fri 8 Feb 13)

During class, I used these slides to continue discussion of "big data" and power laws; and this notebook: assnmt2.ipynb, for my second assignment. We also had some assignment 2 demos from students.
(The article by Pereira I mentioned is here, and the article I'd seen the day before with geographic visualizations of twitter data was this one -- of course there are many of these).

Here is assignment 3.

For assignments that involve code, you should email to me (or post to Piazza, it permits files up to 20Mb) an archive file (zip or tar.gz) containing:

Meeting 4 (Fri 15 Feb 13)

Will asstmnt 3 notebook and some more notes.

In meantime, here are slides (didn't make it to the end, will pick up next time)
Assignment 4 is here (important: everyone needs to turn in code for all of the first assignments, this one is not coding to permit catch-up on those)

Meeting 5 (Fri 22 Feb 13)

Finished up slides from last time.
Elizabeth's slides on visualization are on the Piazza site.

Meeting 6 (1 Mar 13)

See notes on Piazza site, including link to slides, and refs to Mitchell's demo, and see assignment 5

Meeting 7 (8 Mar 13)

Note that it is not necessary to leave class before noon in order to make it to the Fri AI lunch seminar, which starts at 12:15. (As announced on the first day of class, the timing of this class has been arranged so that the instructor can go to that specific seminar, and has been going every week, never once late ... .)

We went through these notebooks: stylometrics and sentiment analysis, and these slides on k-means, etc. (More info re assignment will be available on Piazza site.)

Meeting (15 Mar 13)

(cancelled due to travel)


22 Mar: "spring" break

Meeting 8 (29 Mar 13)

First here are the notes regarding dimensional reduction (used ubiquitously in data analysis), clarifying a bit the part towards the end, and updated to include as well the Shannon information and decision tree material.
The audio synced to slides I mentioned is here (for brief overall flavor check the roughly two minutes from 37:15-39:15).
These are the readings for next Fri, please read in advance and come prepared to discuss:

Meeting 9 (5 Apr 13)

In lecture we continued the notes regarding the Shannon information, mutual information and decision tree material.
We started the discussion of the above readings, and will continue that discussion in the beginning next time (so please have another look over them).
Preliminary notes for assignment 6

Meeting 10 (12 Apr 13)

Here are the slides about mutual information for finding informative terms,
and here are some of the links discussed:

Meeting 11 (19 Apr 13)

Some links to things discussed in class:

A couple of notebooks used in class:

Here are some notes for assignment 7

Meeting 12 (26 Apr 13)

Mentioned 30 Apr colloquium on Data Privacy (by author of Netflix de-anonymization articles mentioned two weeks ago).
In context of assignment 7, discussed Christopher Lee.
For python stylistic issues, gave overview of What Makes Code Hard to Understand?

Here is the notebook on recommender systems, using del.icio.us and movielens data (adapted from Chpt2 of Programming Collective Intelligence)

Meeting 13 (3 May 13)

After presentations from Andy, Stephanie, and Saeed, I went over the node and link betweenness algorithms following pp.78-82 of E/K Chpt.3, then inserted same graph in this notebook, and described using networkx to navigate the movie actor network.

(I will also schedule open meetings at 301 college ave so that I can see the rest of the projects, and other students will be welcome to sit in.)