DS-GA 3001.009: Special Topics in Data Science:
Responsible Data Science

New York University, Center for Data Science, Spring 2020

Lecture: Mondays from 11am-12:40pm; Lab: Mondays from 3:30pm-4:20pm

Location: 60 5th Avenue, Room 110

Instructor: Julia Stoyanovich, Assistant Professor of Data Science, Computer Science and Engineering.
Office hours Mondays 2-3pm or by appointment, online.

Section Leader: Brina Seidel. Office hours Thursdays 3:30-4:30pm or by appointment, online

Grader: Prasanthi Gurumurthy. Office hours Wednesdays, 10:30-11:30am or by appointment, online.

Syllabus: pdf

Course Description:

The first wave of data science focused on accuracy and efficiency – on what we can do with data. The second wave focuses on responsibility – on what we should and shouldn’t do. Irresponsible use of data science can cause harm on an unprecedented scale. Algorithmic changes in search engines can sway elections and incite violence; irreproducible results can influence global economic policy; models based on biased data can legitimize and amplify racist policies in the criminal justice system; algorithmic hiring practices can silently and scalably violate equal opportunity laws, exposing companies to lawsuits and reinforcing the feedback loops that lead to lack of diversity. Therefore, as we develop and deploy data science methods, we are compelled to think about the effects these methods have on individuals, population groups, and on society at large.

Responsible Data Science is a technical course that tackles the issues of ethics, legal compliance, data quality, algorithmic fairness and diversity, transparency of data and algorithms, privacy, and data protection. The course is developed and taught by Julia Stoyanovich, Assistant Professor at the Center for Data Science and at the Tandon School of Engineering, and member of the NYC Automated Decision Systems Task Force.

Prerequisites: Introduction to Data Science, Introduction to Computer Science, or similar courses.

Lab Materials: Labs will be conducted using Jupyter Hub. Students should use their NYU NetID to log in, and click the “Assignments” tab to find the material for each week. After lab, links to the notebook for each class will be included on this page.

Background Reading (required)

Background Reading (optional)

Schedule

This weekly schedule is tentative and is subject to change.

Date Topic Materials Assignments  
Jan 27 Lecture: Introduction and background. Algorithmic fairness.
Topics: Course outline, aspects of responsibility in data science through recent examples. Fairness in classification. The importance of a socio-technical perspective: stakeholders and trade-offs.
Reading:
“Bias in Computer Systems”, Friedman and Nissenbaum (1996) ACM DL
“Machine Bias”, Angwin, Larson, Mattu, Kirchner (2016) ProPublica
“Data, Responsibly”, Abiteboul and Stoyanovich (2015) ACM SIGMOD blog
“Fairness through awareness”, Dwork, Hardt, Pitassi, Reingold, Zemel (2012) ACM DL
“On the (im)possibility of fairness”, Friedler, Scheidegger, Venkatasubramanian (2016) arXiv
slides    
Jan 27 Lab: Intro to Jupyter Hub, ProPublica’s Machine Bias notebook    
Feb 3 Lecture: Algorithmic fairness continued.
Topics: Fairness in risk assessment. Fairness in ranking.
Reading:
“Fair prediction with disparate impact: A study of bias in recidivism prediction instruments”, Chouldechova (2017) arXiv
“Inherent Trade-Offs in the Fair Determination of Risk Scores”, J. Kleinberg, S. Mullainathan, M. Raghavan (2017) pdf
“Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions”, Mitchell, Porash, Barocas (2018) arXiv
“Dissecting racial bias in an algorithm used to manage the health of populations”, Obermeyer, Powers, Vogel, Mullainathan(2019) Science
slides    
Feb 3 Lab: IBM’s AI Fairness 360 toolkit
Reading:
“AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias”, R. Bellamy et al. (2018) pdf
“Data preprocessing techniques for classification without discrimination”, F. Kamiran and T. Calders (2012) pdf
“Certifying and removing disparate impact”, M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015) pdf
notebook    
Feb 10 Lecture: Data cleaning
Topics: Overview of data cleaning
Reading:
“Profiling relational data: a survey”, Abedjan, Golab, Naumann (2015) pdf
“Quantitative data cleaning for large databases”, Hellerstein (2008) pdf
slides    
Feb 10 Lab: IBM’s AI Fairness 360 toolkit
Reading:
“FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions”, S. Schelter, Y. He, J. Khilnani, and J. Stoyanovich (2019) pdf
notebook HW1 assigned  
Feb 17 No class, university holiday      
Feb 24 Lecture (part 1): Fairness and causality
Topics: Counterfactual fairness
“The long road to fairer algorithms”, M. Kushner, J. Loftus (2020) Nature
“Counterfactual fairness”, M. Kusner, J. Loftus, C. Russell, R. Silva(2017) pdf
Lecture (part 2): Data profiling
Topics: Types of data profiling tasks, overview of the relational model
slides(1) slides(2) HW1 due
 
Feb 24 Lab: Data profiling and data cleaning
course project discussion
notebook project assigned  
Mar 2 Lecture (part 1): Data profiling continued
Topics: Discovering uniques, frequent itemset and association rule mining
Lecture (part 2): Anonymity and privacy
Topics: Overview of responsible data sharing. Anonymization techniques; the limits of anonymization. Harms beyond re-identification.
Reading:
“The Belmont Report” (1979) pdf
“Critical questions for Big Data”, danah boyd and Kate Crawford (2012) pdf
slides(1) slides(2)    
Mar 2 Lab: Data profiling and data cleaning notebook    
Mar 9 Lecture: Anonymity and privacy
Topics: Differential privacy; privacy-preserving synthetic data generation; exploring the privacy / utility trade-off.
Reading:
“A firm foundation for private data analysis”, C. Dwork (2011) ACM DL
“Can a set of equations keep U.S. census data private?”, J. Mervis (2019) Science
slides project proposal due  
Mar 9 Lab: Data Synthesizer
Reading:
“DataSynthesizer: Privacy-Preserving Synthetic Datasets”, Ping, Stoyanovich, Howe (2017) ACM DL
notebook HW2 assigned  
Mar 16 No class, university holiday      
Mar 23 Lecture: Ethical frameworks
Reading: “The Belmont Report” (1979) pdf
“The Menlo Report” (2012) pdf
“Chapter 6: Ethics. Bit by Bit: Social Research in the Digital Age”, Matthew Salganik (2017) online
slides    
Mar 23 Lab: Ethical frameworks      
Mar 30 Lecture: Transparency
Topics: Auditing black-box models; explainable machine learning.
Reading:
“Why should I trust you? Explaining the predictions of any classifier”, Ribeiro, Singh, Guestrin (2016) pdf
“Algorithmic transparency via quantiative input influence: theory and experiments with learning systems”, Datta, Sen, Zick (2016) pdf
“A unified approach to interpreting model predictions”, Lundberg and Lee (2017) pdf
slides HW2 due  
Mar 30 Lab: LIME notebook    
Apr 6 Lecture: Transparency
Topics: Discrimination in online ad delivery.
Reading:
“Automated Experiments on Ad Privacy Settings”, Datta, Tschantz, Datta (2015) pdf
slides HW3 assigned;
project report draft due (extended to Apr 8)
 
Apr 6 Lab: SHAP notebook    
Apr 13 Lecture: Transparency
Topics: Discrimination in online ad delivery, continued.
Reading: “Discrimination through optimization: How Facebook’s ad delivery can lead to skewed outcomes”, Ali, Sapiezynski, Bogen, Korolova, Mislove, Rieke (2019) pdf
“Facebook has been charged with housing discrimination by the US government”, Russell Brandom for The Verge, Mar 28, 2019 read online
slides    
Apr 13 Lab: Course project discussion: working through an example of an ADS notebook    
Apr 20 Lecture: RDS in practice: Guest lecture by Robert Cheetham, President and CEO of Azavea
Topics: Project selection
Reading: “How Azavea selects projects”, Robert Cheetham (2019) link “HunchLab: Under the hood” (2015) link “Why we sold HunchLab” (2019) link
slides HW3 due  
Apr 20 Lab: Final exam review see slides on NYU Classes    
Apr 27 Lecture: Interpretability
Topics: What is interpretability?
Reading: “The Intuitive Appeal of Explainable Machines”, A. Selbst and S. Barocas (2018) SSRN
“Nutritional Labels for Data and Models”, J. Stoyanovich and B. Howe (2019) pdf
“The Imperative of Interpretable Machines”, J. Stoyanovich, J. Van Bavel, T. West (2020) link
slides Final exam assigned (take-home)  
Apr 27 Lab: Course project discussion: working through an example of an ADS notebook    
May 4 Lecture: Legal and regulatory frameworks
Topics: Data protection, algorithmic impact assessment, regulating Automated Decision Systems (ADS) and AI
Reading: GDPR link
Canadian Directive on Automated Decision-Making link
NYC ADS Task Force Report pdf
“Disparate Impact in Big Data Policing”, A. Selbst (2017) SSRN
“Ensuring a Future that Advances Equity in Algoritmic Employment Decisions”, J. Yang (2019) pdf
slides    
May 4 Lab: Slack, course project discussion      
May 11 Lecture: Project presentations   project report due  
May 11 Lab: Project presentations