Instructor: Stephen Purpura (sp559(at)cornell.edu) and Gary Kazantsev (gk336(at)cornell.edu)
TAs: Sujay Bhatt Hodrali Ramesh and Dipendra Misa
Harvard Business Review calls Data Scientist the most sought after career of the 21st century. The modern Data Scientist is a leader who assists organizations in building systems and process to make data driven decisions. In the wild, the Data Scientist is measured on their ability to cost effectively drive innovation, customer satisfaction, communication skills, and their ability to uncover unique insights.
In this age of explosive growth of computational capacity and digitization, the amount of machine readable data keeps following a roughly exponential curve. This data capture and analysis is growing faster each year. The Task of a data scientist is to extract actionable knowledge from data; to understand what is possible and make it happen. In this course, we'll focus on the practical aspects of the field of Data Science. You will learn to work with a comprehensive set of process and tools for organizing, analyzing, and extracting knowledge from Data.
The course will cover the topics needed to solve data-science problems, which include problem formulation (business understanding), data preparation (collection, sampling, integration, cleaning), data modeling (characterization, model selection, and analysis), implementation (large-scale data processing, feedback loops, QA) and communication (data presentation, visualization). Advanced topics such as Deep Learning will be presented. Throughout the course, the students will repeat 4 times a data-science mission with all the required steps, from problem formulation to result presentation.
The course includes programming assignments, using Spark.ML, Google Tensorflow and other open source tools for data processing and modeling. Tools for manipulating large amounts of data such as Apache Spark will be taught and used in all the assignments.
Lectures: Wednesday, 4:30-7:00PM
Location: Grizzly (3rd floor)
Office hours: Stephen Purpura: Wednesday, 7:00PM (after the lecture when I teach) Gary Kazantsev: Wednesday, 7:00PM (after the lecture when I teach)
To be successful in this course, students need: 1. an undergraduate computer science education in probability, statistics, linear algebra, differntial equations, and Calculus. 2. exposure to the use of Bayes Rule. 3. the ability to write simple scripts in either Python or Scala. 4. the ability to use Amazon Web Services and install/use open source tools. 5. students must also have an AWS account. See the course Slack channel for information about obtaining a sponsored account. 6. be familiar with statistical analysis (at the undergraduate level) and should be able to read and create graphs that express distributions and relationships. 7. able to work with Spark and come up to speed with the technology quickly. Optional: come up to speed to work with Google Tensorflow
Ng, Andrew, Machine Learning Yearning
Leskovec, Rajaraman and Ullman, Mining Massive Datasets, 2nd Edition
This is a collection of writings by an active Spark user. It is a reasonable starting point for examples with the Spark 2.0.2 API.
This book is an optional reference to provide context. It references an old version of the Spark API.
Papers: (will be discussed in class, more will be added)