INFO 6010: Computional Methods For Information Science Research
Fri 9:30am-noon , Location: 301 College Ave (large conf room)
3 credits, S/U Optional
Professor: Paul Ginsparg (452 Phys.Sci.Bldg,
Office hours: Tue 3-4 PM (or by appointment)
Course website: http://courses.cit.cornell.edu/info6010/ (this page)
Computation is an essential tool for many facets of information science
research. Examples of its utility include capture, access and analysis
of digital data; visualization of that data for analysis,
interpretation and information extraction; construction of user-focused
applications; and analysis of textual and sensor-derived information
to detect patterns and dynamics of human activities,
social interactions and social networks.
Effective use of computation requires a mixture of skills including
structuring data, accessing data, programming, choosing and applying
computational analysis methods, and designing visualizations.
This course covers the mixture of these skills with the goal of
providing information science graduate and masters students with the
appreciation of their utility and the ability to employ them in future
research. The course is project-based, allowing students to
understand the use of computational methods to pursue research
Prerequisites: Graduate standing. Basic programming
experience (at the level of CS1110 or CS1112 or INFO1100, including
variables, arrays, strings, loops, conditionals, methods and
functions, basic recursion, file IO, object-oriented design,
debugging), plus introductory-level background in probability and
statistics. Prior knowledge of Python is not required. This course
will not teach programming per se, but rather the use of computation
methods and tools for data-oriented research tasks.
This course draws both from
Physics 7682 / CIS 6229
("Computational Methods for Nonlinear Systems")
and from the former Info 6307 ("Learning from Web Data", which it replaces).
- Intro to Python and quick review of basic programming skills
- Throughout the semester, introduction/review of advanced programming concepts as needed, e.g.
- use of, but not the construction of, basic data structures (graphs, trees, hash tables, lists)
- accessing and formatting data
- regular expressions and parsing
- data markup (XML and JSON)
- Data manipulation and understanding
- techniques for acquiring data from online sources and/or sensors
- methods for clustering data
- methods for classifying data
- working with preference data (e.g., recommender systems)
- working with textual data (e.g., vector space model)
- working with network data and computation with graphs
- working with usage data (e.g., weblogs)
- working with sensor data (e.g., gps traces, audio, video)
- principles and theory
Meeting 1 (Fri 25 Jan 13)
Course overview. Be sure to come with a full-featured laptop.
27 Jan 2012: I'm in the process of setting up Piazza pages for this course,
will send links when available.
In the meantime, here are the mentioned instructions for installing python.
The python.org site has a tutorial, and there are other resources listed in the left margin here.
Please post any other useful pedagogic python resources you find on the course Piazza site.
The ipython demo I ran is here: demo1.ipynb, and
the matplotlib gallery is here.
A recent article (subtitled "Should data have a conscience?") about the mentioned
map of gun ownership is here.
Here are some notes for assignment 1
Meeting 2 (Fri 1 Feb 13)
More notes to be posted re assignment and readings,
but here are the demos from class to import into notebook:
The texts used for the demo were 40textfiles.zip (from Info 4300) and others (Oz,
Sherlock, Decl Ind, truncated Sherlock)
here (which also has useful
Note that it's important to get to me your trigram assignment 1 via email, not for grading but so that I have an impression of where everyone stands in order to calibrate the next few weeks. (Let me know also some rough impression, e.g., "easy and fun", "difficult and pointless", "already did it in high school", ...".)
We started discussion text as data (ubiquitous, useful),
went over Norvig's spell-correct, emphasizing how "big-data" facilitates simple algorithms (see also
The Unreasonable Effectiveness of Data), and the assignment has instructions for installing the nltk (Natural Language Toolkit) module.
Assignment 2 is here.
Meeting 3 (Fri 8 Feb 13)
During class, I used these slides to continue discussion of "big data" and power laws; and this notebook: assnmt2.ipynb, for my second assignment. We also had some assignment 2 demos from students.
(The article by Pereira I mentioned is here, and the article I'd seen the day before with geographic visualizations of twitter data was this one -- of course there are many of these).
Here is assignment 3.
For assignments that involve code, you should email to me (or post to Piazza, it permits files up to 20Mb) an archive file (zip or tar.gz) containing:
- The code, which should be runnable and readably commented. (Other students may want to read or borrow from it.)
- A README file describing how to run the code (including required
files/libraries, pointers to documentation for APIs usee, etc).
Acknowledge any piece of code you borrow, and every source, ideally with URL.
(It's OK to use others' code in this class, but important to acknowledge it. Note for example that these guidelines for assignments are adapted from Danco's i 6307 ...)
- A short (1 page or so) postmortem about your experiences doing the
assignment. What was easy, fun, interesting, useful? What was hard,
broken, confusing, pointless? How would you do it now if you were
starting from scratch? Most importantly, when you're telling other
people about what you did, what are the one to three key points they
(and you) should remember for future projects?
Meeting 4 (Fri 15 Feb 13)
Will asstmnt 3 notebook and some more notes.
In meantime, here are slides (didn't make it to the end, will pick up next time)
Assignment 4 is here
(important: everyone needs to turn in code for all of the first assignments, this one is not coding to permit catch-up on those)
Meeting 5 (Fri 22 Feb 13)
Finished up slides from last time.
Elizabeth's slides on visualization are on the Piazza site.
Meeting 6 (1 Mar 13)
See notes on Piazza site, including link to slides, and refs to Mitchell's demo, and see assignment 5
Meeting 7 (8 Mar 13)
Note that it is not necessary to leave class before noon in order to make it to
the Fri AI lunch seminar, which starts at 12:15. (As announced on the first day of class, the timing of this class has been arranged so that the instructor can go to that specific seminar, and has been going every week, never once late ... .)
We went through these notebooks:
sentiment analysis, and these
slides on k-means, etc. (More info re assignment will be available on Piazza site.)
Meeting (15 Mar 13)
(cancelled due to travel)
22 Mar: "spring" break
Meeting 8 (29 Mar 13)
First here are the notes regarding dimensional reduction (used ubiquitously in data analysis), clarifying a bit the part towards the end, and updated to include as well the Shannon information and decision tree material.
The audio synced to slides I mentioned is here (for brief overall flavor check the roughly two minutes from 37:15-39:15).
These are the readings for next Fri, please read in advance and come prepared to discuss:
Meeting 9 (5 Apr 13)
In lecture we continued the notes regarding the Shannon information, mutual information and decision tree material.
We started the discussion of the above readings, and will continue that discussion in the beginning next time (so please have another look over them).
Preliminary notes for
Meeting 10 (12 Apr 13)
Here are the slides about mutual information for finding informative terms,
and here are some of the links discussed:
Meeting 11 (19 Apr 13)
Some links to things discussed in class:
- slate obesity map
- For the discussion of weblog data:
"From Cookies to Cooks: Insights on Dietary Patterns via Analysis of Web Usage Logs",
- As entry point to the social network discussion (and recalls an earlier assignment):
"Friendship Paradox Redux: Your Friends Are More Interesting Than You",
- and mentioned briefly some local articles that appeared this week: 1304.4837 Ego nets for recommendation (Cosley et al),
1304.4602 discussion threads (Lee, Kleinberg et al)
A couple of notebooks used in class:
- code for US map, and
- notes on regexps, unix_time, and networkx.
Here are some notes for assignment 7
Meeting 12 (26 Apr 13)
Mentioned 30 Apr colloquium on Data Privacy (by author of Netflix de-anonymization articles mentioned two weeks ago).
In context of assignment 7, discussed Christopher Lee.
For python stylistic issues, gave overview of
What Makes Code Hard to Understand?
Here is the notebook on recommender systems, using del.icio.us and movielens data (adapted from Chpt2 of
Programming Collective Intelligence)
Meeting 13 (3 May 13)
After presentations from Andy, Stephanie, and Saeed, I went over the node and link betweenness algorithms following pp.78-82 of
E/K Chpt.3, then inserted same graph in this
notebook, and described using networkx to navigate the movie actor network.
(I will also schedule open meetings at 301 college ave so that I can see the rest of the projects, and other students will be welcome to sit in.)