DS-GA 3001.009: Special Topics in Data Science:
Responsible Data Science

New York University, Center for Data Science, Spring 2020

Lecture: Mondays from 11am-12:40pm; Lab: Mondays from 3:30pm-4:20pm

Location: 60 5th Avenue, Room 110

Instructor: Julia Stoyanovich, Assistant Professor of Data Science, Computer Science and Engineering.
Office hours Mondays 2-3pm or by appointment, online.

Section Leader: Brina Seidel. Office hours Thursdays 3:30-4:30pm or by appointment, online

Grader: Prasanthi Gurumurthy. Office hours Wednesdays, 10:30-11:30am or by appointment, online.

Syllabus: pdf

Course Description:

The first wave of data science focused on accuracy and efficiency – on what we can do with data. The second wave focuses on responsibility – on what we should and shouldn’t do. Irresponsible use of data science can cause harm on an unprecedented scale. Algorithmic changes in search engines can sway elections and incite violence; irreproducible results can influence global economic policy; models based on biased data can legitimize and amplify racist policies in the criminal justice system; algorithmic hiring practices can silently and scalably violate equal opportunity laws, exposing companies to lawsuits and reinforcing the feedback loops that lead to lack of diversity. Therefore, as we develop and deploy data science methods, we are compelled to think about the effects these methods have on individuals, population groups, and on society at large.

Responsible Data Science is a technical course that tackles the issues of ethics, legal compliance, data quality, algorithmic fairness and diversity, transparency of data and algorithms, privacy, and data protection. The course is developed and taught by Julia Stoyanovich, Assistant Professor at the Center for Data Science and at the Tandon School of Engineering, and member of the NYC Automated Decision Systems Task Force.

Prerequisites: Introduction to Data Science, Introduction to Computer Science, or similar courses.

Lab Materials: Labs will be conducted using Jupyter Hub. Students should use their NYU NetID to log in, and click the “Assignments” tab to find the material for each week. After lab, links to the notebook for each class will be included on this page.

Background Reading (required)

Barocas and Selbst (2016) “Big Data’s Disparate Impact” pdf
White House Report on Big Data (2014) “Big Data: Seizing Opportunities, Preserving Values” pdf
Brauneis and Goodman (2017) “Algorithmic Transparency for the Smart City” pdf
Kroll et al. (2017) “Accountable Algorithms” pdf

Background Reading (optional)

Matthew Salganik “Bit by Bit: Social Research in the Digital Age” (read online)
Cathy O’Neil “Weapons of Math Destruction”
Frank Pasquale “The Black Box Society”
Virginia Eubanks “Automating Inequality”

Schedule

See Spring 2019 schedule, slides, labs: DS-GA 3001.009: Special Topics in Data Science: Responsible Data Science

This weekly schedule is tentative and is subject to change.

Date	Topic	Materials	Assignments
Jan 27	Lecture: Introduction and background. Algorithmic fairness. Topics: Course outline, aspects of responsibility in data science through recent examples. Fairness in classification. The importance of a socio-technical perspective: stakeholders and trade-offs. Reading: “Bias in Computer Systems”, Friedman and Nissenbaum (1996) ACM DL “Machine Bias”, Angwin, Larson, Mattu, Kirchner (2016) ProPublica “Data, Responsibly”, Abiteboul and Stoyanovich (2015) ACM SIGMOD blog “Fairness through awareness”, Dwork, Hardt, Pitassi, Reingold, Zemel (2012) ACM DL “On the (im)possibility of fairness”, Friedler, Scheidegger, Venkatasubramanian (2016) arXiv	slides
Jan 27	Lab: Intro to Jupyter Hub, ProPublica’s Machine Bias	notebook
Feb 3	Lecture: Algorithmic fairness continued. Topics: Fairness in risk assessment. Fairness in ranking. Reading: “Fair prediction with disparate impact: A study of bias in recidivism prediction instruments”, Chouldechova (2017) arXiv “Inherent Trade-Offs in the Fair Determination of Risk Scores”, J. Kleinberg, S. Mullainathan, M. Raghavan (2017) pdf “Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions”, Mitchell, Porash, Barocas (2018) arXiv “Dissecting racial bias in an algorithm used to manage the health of populations”, Obermeyer, Powers, Vogel, Mullainathan(2019) Science	slides
Feb 3	Lab: IBM’s AI Fairness 360 toolkit Reading: “AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias”, R. Bellamy et al. (2018) pdf “Data preprocessing techniques for classification without discrimination”, F. Kamiran and T. Calders (2012) pdf “Certifying and removing disparate impact”, M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015) pdf	notebook
Feb 10	Lecture: Data cleaning Topics: Overview of data cleaning Reading: “Profiling relational data: a survey”, Abedjan, Golab, Naumann (2015) pdf “Quantitative data cleaning for large databases”, Hellerstein (2008) pdf	slides
Feb 10	Lab: IBM’s AI Fairness 360 toolkit Reading: “FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions”, S. Schelter, Y. He, J. Khilnani, and J. Stoyanovich (2019) pdf	notebook	HW1 assigned
Feb 17	No class, university holiday
Feb 24	Lecture (part 1): Fairness and causality Topics: Counterfactual fairness “The long road to fairer algorithms”, M. Kushner, J. Loftus (2020) Nature “Counterfactual fairness”, M. Kusner, J. Loftus, C. Russell, R. Silva(2017) pdf Lecture (part 2): Data profiling Topics: Types of data profiling tasks, overview of the relational model	slides(1) slides(2)	HW1 due
Feb 24	Lab: Data profiling and data cleaning course project discussion	notebook	project assigned
Mar 2	Lecture (part 1): Data profiling continued Topics: Discovering uniques, frequent itemset and association rule mining Lecture (part 2): Anonymity and privacy Topics: Overview of responsible data sharing. Anonymization techniques; the limits of anonymization. Harms beyond re-identification. Reading: “The Belmont Report” (1979) pdf “Critical questions for Big Data”, danah boyd and Kate Crawford (2012) pdf	slides(1) slides(2)
Mar 2	Lab: Data profiling and data cleaning	notebook
Mar 9	Lecture: Anonymity and privacy Topics: Differential privacy; privacy-preserving synthetic data generation; exploring the privacy / utility trade-off. Reading: “A firm foundation for private data analysis”, C. Dwork (2011) ACM DL “Can a set of equations keep U.S. census data private?”, J. Mervis (2019) Science	slides	project proposal due
Mar 9	Lab: Data Synthesizer Reading: “DataSynthesizer: Privacy-Preserving Synthetic Datasets”, Ping, Stoyanovich, Howe (2017) ACM DL	notebook	HW2 assigned
Mar 16	No class, university holiday
Mar 23	Lecture: Ethical frameworks Reading: “The Belmont Report” (1979) pdf “The Menlo Report” (2012) pdf “Chapter 6: Ethics. Bit by Bit: Social Research in the Digital Age”, Matthew Salganik (2017) online	slides
Mar 23	Lab: Ethical frameworks
Mar 30	Lecture: Transparency Topics: Auditing black-box models; explainable machine learning. Reading: “Why should I trust you? Explaining the predictions of any classifier”, Ribeiro, Singh, Guestrin (2016) pdf “Algorithmic transparency via quantiative input influence: theory and experiments with learning systems”, Datta, Sen, Zick (2016) pdf “A unified approach to interpreting model predictions”, Lundberg and Lee (2017) pdf	slides	HW2 due
Mar 30	Lab: LIME	notebook
Apr 6	Lecture: Transparency Topics: Discrimination in online ad delivery. Reading: “Automated Experiments on Ad Privacy Settings”, Datta, Tschantz, Datta (2015) pdf	slides	HW3 assigned; project report draft due (extended to Apr 8)
Apr 6	Lab: SHAP	notebook
Apr 13	Lecture: Transparency Topics: Discrimination in online ad delivery, continued. Reading: “Discrimination through optimization: How Facebook’s ad delivery can lead to skewed outcomes”, Ali, Sapiezynski, Bogen, Korolova, Mislove, Rieke (2019) pdf “Facebook has been charged with housing discrimination by the US government”, Russell Brandom for The Verge, Mar 28, 2019 read online	slides
Apr 13	Lab: Course project discussion: working through an example of an ADS	notebook
Apr 20	Lecture: RDS in practice: Guest lecture by Robert Cheetham, President and CEO of Azavea Topics: Project selection Reading: “How Azavea selects projects”, Robert Cheetham (2019) link “HunchLab: Under the hood” (2015) link “Why we sold HunchLab” (2019) link	slides	HW3 due
Apr 20	Lab: Final exam review	see slides on NYU Classes
Apr 27	Lecture: Interpretability Topics: What is interpretability? Reading: “The Intuitive Appeal of Explainable Machines”, A. Selbst and S. Barocas (2018) SSRN “Nutritional Labels for Data and Models”, J. Stoyanovich and B. Howe (2019) pdf “The Imperative of Interpretable Machines”, J. Stoyanovich, J. Van Bavel, T. West (2020) link	slides	Final exam assigned (take-home)
Apr 27	Lab: Course project discussion: working through an example of an ADS	notebook
May 4	Lecture: Legal and regulatory frameworks Topics: Data protection, algorithmic impact assessment, regulating Automated Decision Systems (ADS) and AI Reading: GDPR link Canadian Directive on Automated Decision-Making link NYC ADS Task Force Report pdf “Disparate Impact in Big Data Policing”, A. Selbst (2017) SSRN “Ensuring a Future that Advances Equity in Algoritmic Employment Decisions”, J. Yang (2019) pdf	slides
May 4	Lab: Slack, course project discussion
May 11	Lecture: Project presentations		project report due
May 11	Lab: Project presentations

For additional information about our work, see Data, Responsibly

DS-GA 3001.009: Special Topics in Data Science: Responsible Data Science

New York University, Center for Data Science, Spring 2020

Background Reading (required)

Background Reading (optional)

Schedule

DS-GA 3001.009: Special Topics in Data Science:
Responsible Data Science