DS-GA 3001.009: Special Topics in Data Science:
Responsible Data Science

New York University, Center for Data Science, Spring 2019

Lecture: Mondays from 11am-12:40pm; Lab: Thursdays from 5:20pm-6:10pm

Location: 60 5th Avenue, Room 110

Instructor: Julia Stoyanovich, Assistant Professor of Data Science, Computer Science and Engineering.
Office hours Mondays 1:30-3pm or by appointment, at 60 5th Avenue, Room 605.

Section Leader: Udita Gupta. Office hours Thursdays 4-5pm at 60 5th Avenue, Room 663.

Syllabus: pdf

Course Description:

The first wave of data science focused on accuracy and efficiency – on what we can do with data. The second wave focuses on responsibility – on what we should and shouldn’t do. Irresponsible use of data science can cause harm on an unprecedented scale. Algorithmic changes in search engines can sway elections and incite violence; irreproducible results can influence global economic policy; models based on biased data can legitimize and amplify racist policies in the criminal justice system; algorithmic hiring practices can silently and scalably violate equal opportunity laws, exposing companies to lawsuits and reinforcing the feedback loops that lead to lack of diversity. Therefore, as we develop and deploy data science methods, we are compelled to think about the effects these methods have on individuals, population groups, and on society at large.

Responsible Data Science is a technical course that tackles the issues of ethics, legal compliance, data quality, algorithmic fairness and diversity, transparency of data and algorithms, privacy, and data protection. The course is developed and taught by Julia Stoyanovich, Assistant Professor at the Center for Data Science and at the Tandon School of Engineering, and member of the NYC Automated Decision Systems Task Force.

Prerequisites: Introduction to Data Science, Introduction to Computer Science, or similar courses.

Background Reading (required)

Background Reading (optional)

Schedule

This weekly schedule is tentative and is subject to change.

Date Topic Materials Assignments
Jan 28 Lecture: Introduction and background
Topics: Course outline, aspects of responsibility in data science through recent examples.
Reading:
“Bias in Computer Systems”, Friedman and Nissenbaum (1996) ACM DL
“Machine Bias”, Angwin, Larson, Mattu, Kirchner (2016) ProPublica
“Data, Responsibly”, Abiteboul and Stoyanovich (2015) ACM SIGMOD blog
slides  
Jan 31 Lab: ProPublica’s Machine Bias jupyter notebook  
Feb 4 Lecture: Fairness
Topics: A taxonomy of fairness definitions; individual and group fairness. The importance of a socio-technical perspective: stakeholders and trade-offs.
Reading:
“Big Data’s Disparate Impact”, Barocas and Selbst (2016) pdf
“Fairness through awareness”, Dwork, Hardt, Pitassi, Reingold, Zemel (2012) ACM DL
“On the (im)possibility of fairness”, Friedler, Scheidegger, Venkatasubramanian (2016) arXiv
slides  
Feb 7 Lab: IBM’s AI Fairness 360 toolkit
Reading:
“Data preprocessing techniques for classification without discrimination”, Kamiran and Calders (2012) pdf
jupyter notebook slides  
Feb 11 Lecture: Fairness
Topics: Impossibility results; causal definitions; fairness beyond classification.
Reading:
“Fair prediction with disparate impact: A study of bias in recidivism prediction instruments”, Chouldechova (2017) arXiv
“Inherent Trade-Offs in the Fair Determination of Risk Scores”, Kleinberg, Mullainathan, Raghavan (2017) pdf
“Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions”, Mitchell, Porash, Barocas (2018) arXiv
slides  
Feb 14 Lab: IBM’s AI Fairness 360 toolkit
Reading:
“Certifying and removing disparate impact”, M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015) pdf
jupyter notebook slides HW1 assigned
Feb 18 No class, university holiday    
Feb 21 Lab: Fairness and Causality slides  
Feb 25 Lecture: Anonymity and privacy, guest lecture by Daniela Hochfellner
Topics: Overview of responsible data sharing. Anonymization techniques; the limits of anonymization. Harms beyond re-identification.
Reading:
“The Belmont Report” (1979) pdf
“Critical questions for Big Data”, danah boyd and Cate Crawford (2012) pdf
slides HW1 due
Feb 28 Lab: Anonymity and privacy    
Mar 4 Lecture: Anonymity and privacy
Topics: Differential privacy; privacy-preserving synthetic data generation; exploring the privacy / utility trade-off.
   
Mar 7 Lab: Anonymity and privacy   HW2 assigned
Mar 11 Lecture: Data science lifecycle, data profiling
Topics: Overview of the data science lifecycle. Data profiling and validation. Is my dataset “biased”? The limits of data profiling. Data provenance.
   
Mar 14 Lab: Data profiling    
Mar 18 No class, university holiday    
Mar 21 No class, university holiday    
Mar 25 Lecture: Data cleaning
Topics: Qualitative and quantitative error detection. Missing attribute values and imputation. Outlier detection; duplicate detection. Documenting data cleaning transformations.
  HW2 due
Mar 28 Lab: Data cleaning   HW3 assigned
Apr 1 Lecture: Transparency
Topics: Auditing black-box models; explainable machine learning; software testing.
   
Apr 4 Lab: LIME    
Apr 8 Lecture: Transparency
Topics: Online price discrimination, transparency in online ad delivery.
  HW3 due
Apr 11 Lab: Quantitative Input Influence   HW4 assigned
Apr 15 Lecture: From transparency to accountability
Topics: Transparency and accountability. Legal frameworks: GDPR and the right to explanation; NYC ADS transparency law. From auditing to interpretability.
   
Apr 18 Lab: Final review    
Apr 22 Lecture: Final exam (in class)   HW4 due
Apr 25 Lab: Nutritional labels   Project assigned
Apr 29 Lecture: Diversity
Topics: Background on diversity in information retrieval, recommender systems and crowdsourcing; diversity models and algorithms; diversity vs. fairness; trade-offs between diversity and utility.
   
May 2 Lab: Diversity    
May 6 Lecture: Reproducibiilty    
May 9 Lab: Reproducibility    
May 13 Lecture: Project presentations   Project report due