Machine Learning with Probabilistic Programming

Fall 2020 | Columbia University

Instructor: Alp Kucukelbir
Course Assistant: Gurpreet Singh

Day and Time: Wednesdays, 4:10p.m. to 6:00p.m.
Location: Online (adaptations to online instruction are presented in red.)


Overview

The world is full of noise and uncertainty. To make sense of it, we collect data and ask questions. Is there a tumor in this x-ray scan? What is the root cause of the quality issues at a manufacturing plant? How old is this planet I see through the telescope? Does this drug actually work? To pose and answer such questions, scientists must iterate through a cycle: probabilistically model a system, infer hidden patterns from data, and evaluate how well our model describes reality.

By the end of this course, you will learn how to use probabilistic programming to effectively iterate through this cycle. Specifically, you will master

  • modeling real-world phenomena using probability models,
  • using advanced algorithms to infer hidden patterns from data, and
  • evaluating the effectiveness of your analysis.

You will learn to use (and perhaps even contribute to) Pyro throughout this course.


Intended Audience

This is a graduate-level course. You should be comfortable with probability and statistics, calculus and linear algebra, and basic numerical optimization. You should be familiar with probabilistic machine learning (for example, you took a class that used Bishop or Murphy). You must be proficient writing robust software in Python to analyze data. You do not need to be familiar with Pyro.


Readings

The recommended textbook for this course is Essentials of Statistical Inference. In addition, I will distribute course notes and point to other readings as needed. These additional readings will be a critical part of online instruction; be prepared to spend up to an hour studying a paper before each lecture.


Problem Sets

There will be three problem sets corresponding to a two week schedule. The problem sets are more theoretical in nature and involve minimal programming; they are meant to complement your final project. You are expected to complete all assigned questions. While the grade you get on your problem sets is a relatively small component of your total grade, working through, and often struggling at length with, the problem sets is a crucial part of the learning process and will invariably have a major impact on your understanding of the material.

You must submit your problem sets by the end of the class in which they are due.

Moderate collaboration on the problem sets in the form of joint problem solving with one or two classmates is permitted, provided your writeup is your own. Please prepare all written work using LaTeX; I will distribute a template.


Final Project

The focus of this course is the final project. The goal is for you to choose a real world problem and to loop through the probabilistic modeling cycle using probabilistic programming. You will be expected to write, document, and report your analysis and findings. This will involve a significant amount of programming. Based on the number of students taking the course for credit, you will work either in groups of two or three. I will provide some suggestions; however, you are encouraged to find a problem in a field that excites you.

You will produce an 8-page final report in the form of a Jupyter notebook. You will present your findings to the class in a short presentation at the end of the term. This project will measure your cumulative understanding of the material while providing you with a supportive environment to try out your new skills. Each student within a group will receive an individual grade, corresponding to their involvement in the project.

The final presentations will involve a pre-recorded video plus a question and answer period on Zoom. Each team should be comfortable recording a presentation on a computer and compiling a video of specific length.


Seminar Summaries

During the semester you will be expected to attend a seminar given in any department at the university. The speaker in the talk you select should be using probabilistic modeling and some sort of statistical inference in their research. Please ask me for permission beforehand to determine whether a particular talk is acceptable. After you have attended the talk, please write a two page summary and submit it no later than one week after the talk. This summary must include: a review of the talk, a discussion of how probabilistic modeling was used in the proposed research, a critical evaluation of the talk, and suggestions for improvements using probabilistic programming.

Seminar summaries will not be part of the online version of this course.


Course Grade

Your course grade will be calculated as follows.

Component Percentage
Problem Sets 36 %
Project Proposal 10 %
Final Project 50 %
Participation 4 %


Schedule (tentative)

Date Topic Reading Homework
September 9 Probabilistic programming introduction; statistical inference review Ch.1, 2.1, 3.6, 5 PSET1 out
September 16 Project setup and intro to Pyro Pyro docs PSET1 due
September 23 Statistical modeling [1] PSET2 out
September 30 Approximate Bayesian inference; point optimization [1]
October 7 Variational inference [2] PSET2 due
October 14 Markov chain Monte Carlo Ch 3.7, [3]
October 21 Predictive inference and evaluation Ch 3.9 PSET3 out; Project Proposal Due [NOW DUE FRIDAY, OCTOBER 23]
October 28 Model criticism [4]
November 4 Guest lecture by Dr. Aaron Schein PSET3 due
November 11 Pitfalls of probabilistic programming
November 18 Causal inference and probabilistic programming
November 25 University holiday; no class.
December 2 Final project presentations
December 9 Final project presentations