INFO 6010 Spring 2013

Information Science

INFO 6010: Computional Methods For Information Science Research

Spring 2013
Fri 9:30am-noon , Location: 301 College Ave (large conf room)
3 credits, S/U Optional

Professor: Paul Ginsparg (452 Phys.Sci.Bldg, ginsparg@cornell.edu)
Office hours: Tue 3-4 PM (or by appointment)
Course website: http://courses.cit.cornell.edu/info6010/ (this page)

Course blurb: Computation is an essential tool for many facets of information science research. Examples of its utility include capture, access and analysis of digital data; visualization of that data for analysis, interpretation and information extraction; construction of user-focused applications; and analysis of textual and sensor-derived information to detect patterns and dynamics of human activities, social interactions and social networks. Effective use of computation requires a mixture of skills including structuring data, accessing data, programming, choosing and applying computational analysis methods, and designing visualizations. This course covers the mixture of these skills with the goal of providing information science graduate and masters students with the appreciation of their utility and the ability to employ them in future research. The course is project-based, allowing students to understand the use of computational methods to pursue research interests.

Prerequisites: Graduate standing. Basic programming experience (at the level of CS1110 or CS1112 or INFO1100, including variables, arrays, strings, loops, conditionals, methods and functions, basic recursion, file IO, object-oriented design, debugging), plus introductory-level background in probability and statistics. Prior knowledge of Python is not required. This course will not teach programming per se, but rather the use of computation methods and tools for data-oriented research tasks.

Note: This course draws both from Physics 7682 / CIS 6229 ("Computational Methods for Nonlinear Systems") and from the former Info 6307 ("Learning from Web Data", which it replaces).

Course topics:

Programming
- Intro to Python and quick review of basic programming skills
- Throughout the semester, introduction/review of advanced programming concepts as needed, e.g.
  - use of, but not the construction of, basic data structures (graphs, trees, hash tables, lists)
  - accessing and formatting data
  - regular expressions and parsing
  - APIs
  - data markup (XML and JSON)
Data manipulation and understanding
- techniques for acquiring data from online sources and/or sensors
- methods for clustering data
- methods for classifying data
- working with preference data (e.g., recommender systems)
- working with textual data (e.g., vector space model)
- working with network data and computation with graphs
- working with usage data (e.g., weblogs)
- working with sensor data (e.g., gps traces, audio, video)
Visualization
- principles and theory
- toolsets

Meeting 1 (Fri 25 Jan 13)

Course overview. Be sure to come with a full-featured laptop.
27 Jan 2012: I'm in the process of setting up Piazza pages for this course, will send links when available.
In the meantime, here are the mentioned instructions for installing python.
The python.org site has a tutorial, and there are other resources listed in the left margin here.
Please post any other useful pedagogic python resources you find on the course Piazza site.
The ipython demo I ran is here: demo1.ipynb, and the matplotlib gallery is here.
A recent article (subtitled "Should data have a conscience?") about the mentioned map of gun ownership is here.

Here are some notes for assignment 1

Meeting 2 (Fri 1 Feb 13)

More notes to be posted re assignment and readings, but here are the demos from class to import into notebook: trigram.ipynb, lecture2.ipynb. The texts used for the demo were 40textfiles.zip (from Info 4300) and others (Oz, Sherlock, Decl Ind, truncated Sherlock) retrieved from here (which also has useful python "nanotutorials").

Note that it's important to get to me your trigram assignment 1 via email, not for grading but so that I have an impression of where everyone stands in order to calibrate the next few weeks. (Let me know also some rough impression, e.g., "easy and fun", "difficult and pointless", "already did it in high school", ...".)

We started discussion text as data (ubiquitous, useful), went over Norvig's spell-correct, emphasizing how "big-data" facilitates simple algorithms (see also The Unreasonable Effectiveness of Data), and the assignment has instructions for installing the nltk (Natural Language Toolkit) module.

Assignment 2 is here.

Meeting 3 (Fri 8 Feb 13)

During class, I used these slides to continue discussion of "big data" and power laws; and this notebook: assnmt2.ipynb, for my second assignment. We also had some assignment 2 demos from students.
(The article by Pereira I mentioned is here, and the article I'd seen the day before with geographic visualizations of twitter data was this one -- of course there are many of these).

Here is assignment 3.

For assignments that involve code, you should email to me (or post to Piazza, it permits files up to 20Mb) an archive file (zip or tar.gz) containing:

The code, which should be runnable and readably commented. (Other students may want to read or borrow from it.)
A README file describing how to run the code (including required files/libraries, pointers to documentation for APIs usee, etc). Acknowledge any piece of code you borrow, and every source, ideally with URL. (It's OK to use others' code in this class, but important to acknowledge it. Note for example that these guidelines for assignments are adapted from Danco's i 6307 ...)
A short (1 page or so) postmortem about your experiences doing the assignment. What was easy, fun, interesting, useful? What was hard, broken, confusing, pointless? How would you do it now if you were starting from scratch? Most importantly, when you're telling other people about what you did, what are the one to three key points they (and you) should remember for future projects?

Meeting 4 (Fri 15 Feb 13)

Will asstmnt 3 notebook and some more notes.

In meantime, here are slides (didn't make it to the end, will pick up next time)
Assignment 4 is here (important: everyone needs to turn in code for all of the first assignments, this one is not coding to permit catch-up on those)

Meeting 5 (Fri 22 Feb 13)

Finished up slides from last time.
Elizabeth's slides on visualization are on the Piazza site.

Meeting 6 (1 Mar 13)

See notes on Piazza site, including link to slides, and refs to Mitchell's demo, and see assignment 5

Meeting 7 (8 Mar 13)

Note that it is not necessary to leave class before noon in order to make it to the Fri AI lunch seminar, which starts at 12:15. (As announced on the first day of class, the timing of this class has been arranged so that the instructor can go to that specific seminar, and has been going every week, never once late ... .)

We went through these notebooks: stylometrics and sentiment analysis, and these slides on k-means, etc. (More info re assignment will be available on Piazza site.)

Meeting (15 Mar 13)

(cancelled due to travel)

22 Mar: "spring" break

Meeting 8 (29 Mar 13)

First here are the notes regarding dimensional reduction (used ubiquitously in data analysis), clarifying a bit the part towards the end, and updated to include as well the Shannon information and decision tree material.
The audio synced to slides I mentioned is here (for brief overall flavor check the roughly two minutes from 37:15-39:15).
These are the readings for next Fri, please read in advance and come prepared to discuss:

Chi (2002)
Loukides (2010)
Boyd/Crawford (2011)
- typical big data news (3 Apr): mobile ads and hardware
- Anderson (2008) 'The end of Theory'
Few (2013)
- Hearst (2008)

Meeting 9 (5 Apr 13)

In lecture we continued the notes regarding the Shannon information, mutual information and decision tree material.
We started the discussion of the above readings, and will continue that discussion in the beginning next time (so please have another look over them).
Preliminary notes for assignment 6

Meeting 10 (12 Apr 13)

Here are the slides about mutual information for finding informative terms,
and here are some of the links discussed:

Maps of Computer Science
"hot new field" of data science (venn diagram)
some of the backlash anticipated by the Boyd/Crawford article:
- data skepticism (and dangerous k-means)
- don't forget intuition
- risks of data science
nytimes re use of zipcodes
It knows (Googlisation of everything, etc)
shodan (expose on-line devices)
Some older articles about tracking open bluetooth devices: Brief encounter networks (2007), The privacy implications of Bluetooth (2008)
How To Break Anonymity of the Netflix Prize Dataset (2006), De-anonymizing Social Networks (2009), Link Prediction by De-anonymization (2011)

Meeting 11 (19 Apr 13)

Some links to things discussed in class:

slate obesity map
For the discussion of weblog data: "From Cookies to Cooks: Insights on Dietary Patterns via Analysis of Web Usage Logs", 1304.3742
As entry point to the social network discussion (and recalls an earlier assignment): "Friendship Paradox Redux: Your Friends Are More Interesting Than You", 1304.3480
and mentioned briefly some local articles that appeared this week: 1304.4837 Ego nets for recommendation (Cosley et al), 1304.4602 discussion threads (Lee, Kleinberg et al)

A couple of notebooks used in class:

code for US map, and
notes on regexps, unix_time, and networkx.

Here are some notes for assignment 7

Meeting 12 (26 Apr 13)

Mentioned 30 Apr colloquium on Data Privacy (by author of Netflix de-anonymization articles mentioned two weeks ago).
In context of assignment 7, discussed Christopher Lee.
For python stylistic issues, gave overview of What Makes Code Hard to Understand?

Here is the notebook on recommender systems, using del.icio.us and movielens data (adapted from Chpt2 of Programming Collective Intelligence)

Meeting 13 (3 May 13)

After presentations from Andy, Stephanie, and Saeed, I went over the node and link betweenness algorithms following pp.78-82 of E/K Chpt.3, then inserted same graph in this notebook, and described using networkx to navigate the movie actor network.

(I will also schedule open meetings at 301 college ave so that I can see the rest of the projects, and other students will be welcome to sit in.)