CS 5304: Data Science in the Wild

Cornell Tech


Instructor: Stephen Purpura (sp559(at)cornell.edu) and Gary Kazantsev (gk336(at)cornell.edu)

TAs: Sujay Bhatt Hodrali Ramesh and Dipendra Misa


Harvard Business Review calls Data Scientist the most sought after career of the 21st century. The modern Data Scientist is a leader who assists organizations in building systems and process to make data driven decisions. In the wild, the Data Scientist is measured on their ability to cost effectively drive innovation, customer satisfaction, communication skills, and their ability to uncover unique insights.

In this age of explosive growth of computational capacity and digitization, the amount of machine readable data keeps following a roughly exponential curve. This data capture and analysis is growing faster each year. The Task of a data scientist is to extract actionable knowledge from data; to understand what is possible and make it happen. In this course, we'll focus on the practical aspects of the field of Data Science. You will learn to work with a comprehensive set of process and tools for organizing, analyzing, and extracting knowledge from Data.

The course will cover the topics needed to solve data-science problems, which include problem formulation (business understanding), data preparation (collection, sampling, integration, cleaning), data modeling (characterization, model selection, and analysis), implementation (large-scale data processing, feedback loops, QA) and communication (data presentation, visualization). Advanced topics such as Deep Learning will be presented. Throughout the course, the students will repeat 4 times a data-science mission with all the required steps, from problem formulation to result presentation.

The course includes programming assignments, using Spark.ML, Google Tensorflow and other open source tools for data processing and modeling. Tools for manipulating large amounts of data such as Apache Spark will be taught and used in all the assignments.


Lectures: Wednesday, 4:30-7:00PM

Location: Grizzly (3rd floor)

Office hours:
Stephen Purpura: Wednesday, 7:00PM (after the lecture when I teach)
Gary Kazantsev: Wednesday, 7:00PM (after the lecture when I teach)

 Prerequisite Requirements

To be successful in this course, students need:
1. an undergraduate computer science education in probability, statistics, linear algebra, differntial equations, and Calculus.
2. exposure to the use of Bayes Rule.
3. the ability to write simple scripts in either Python or Scala.
4. the ability to use Amazon Web Services and install/use open source tools.
5. students must also have an AWS account. See the course Slack channel for information about obtaining a sponsored account.
6. be familiar with statistical analysis (at the undergraduate level) and should be able to read and create graphs that express distributions and relationships.
7. able to work with Spark and come up to speed with the technology quickly.
Optional: come up to speed to work with Google Tensorflow


Ng, Andrew, Machine Learning Yearning

Leskovec, Rajaraman and Ullman, Mining Massive Datasets, 2nd Edition

This is a collection of writings by an active Spark user. It is a reasonable starting point for examples with the Spark 2.0.2 API.
Laskowski, Jacek, Mastering Apache Spark 2.0

This book is an optional reference to provide context. It references an old version of the Spark API.
Nick Pentreath, Machine Learning With Spark

Andrew Ng's Stanford ML Course
Yaser Abu-Moustafa's Caltech course
Smola's CMU course (this is hard)

Additional Books:
Machine Learning: A Probabilistic Perspective, Murphy, MIT 2013
Information Theory, Inference, and Learning Algorithms, MacKay, Cambridge 2003 (free online)
Deep Learning, Goodfellow, Bengio, Courville, MIT 2016 (free online for now)

Papers: (will be discussed in class, more will be added)
High quality advice from a very clever practitioner: "A Few Useful Things to Know about Machine Learning", Domingos, CACM 2012
Required reading for implementation and management staff: "Machine Learning: The High Interest Credit Card of Technical Debt", Sculley et. al. NIPS 2014
The mathematical intuition without peer: On Discriminative vs. Generative Classifiers, Ng, Jordan, NIPS 2002
New-old-school feature learning: Learning feature representations with k-means, Coates, Ng, 2012
Pitfalls of neural networks: Intriguing properties of neural networks, Szegedy, Sutskever, Goodfellow et. al., ICLR 2014