This textbook for graduate students in statistics, data science, and public health deals with the practical challenges that come with big, complex, and dynamic data. It presents a scientific roadmap to translate real-world data science applications into formal statistical estimation problems by using the general template of targeted maximum likelihood estimators. These targeted machine learning algorithms estimate quantities of interest while still providing valid inference. Targeted learning methods within data science area critical component for solving scientific problems in the modern age. The techniques can answer complex questions including optimal rules for assigning treatment based on longitudinal data with time-dependent confounding, as well as other estimands in dependent data structures, such as networks. Included in Targeted Learning in Data Science are demonstrations with soft ware packages and real data sets that present a case that targeted learning is crucial for the next generation of statisticians and data scientists. Th is book is a sequel to the first textbook on machine learning for causal inference, Targeted Learning, published in 2011.

Auteur

Mark van der Laan, PhD, is Jiann-Ping Hsu/Karl E. Peace Professor of Biostatistics and Statistics at UC Berkeley. His research interests include statistical methods in genomics, survival analysis, censored data, machine learning, semiparametric models, causal inference, and targeted learning. His applied research involves applications in HIV and safety analysis, among others. He has published over 250 journal articles, 4 books, and one handbook on big data. Dr. van der Laan is also co-founder and co-editor of the International Journal of Biostatistics and the Journal of Causal Inference and associate editor of a variety of journals. Dr. van der Laan received the 2004 Mortimer Spiegelman Award, the 2005 Van Dantzig Award, the 2005 COPSS Snedecor Award, the 2005 COPSS Presidential Award, and has graduated over 40 PhD students in biostatistics or statistics.

Sherri Rose, PhD, is Associate Professor of Health Care Policy (Biostatistics) at Harvard Medical School. Her work is centered on developing and integrating innovative statistical approaches to advance human health. Dr. Rose's methodological research focuses on nonparametric machine learning for causal inference and prediction. She has made major contributions to the development and application of targeted learning estimators, as well as adaptations to super learning for varied scientific problems. Within health policy, Dr. Rose works on comparative effectiveness research, health program impact evaluation, and computational health economics. She co-leads the Health Policy Data Science Lab and currently serves as an associate editor for the Journal of the American Statistical Association and Biostatistics.

Contenu

Part I: Introductory Chapters

The Statistical Estimation Problem in Complex Longitudinal Data

Data Science and Statistical Estimation
Roadmap for Causal Effect Estimation
Role of Targeted Learning in Data Science
Observed Data
Caussal Model and Causal target Quantity
Statistical Model
Statistical Target Parameter
Statistical Estimation Problem
1. Longitudinal Causal Models
Structural Causal Models
Causal Graphs / DAGs
Nonparametric Structural Equation Models

Super Learner for Longitudinal Problems

Ensemble Learning
Sequential Regression

Longitudinal Targeted Maximum Likelihood Estimation (LTMLE)

Step-by-Step Demonstration of LTMLE scalable inference="" for="" big="" data
1. Understanding LTMLE
Statistical Properties
Theoretical Background

Why LTMLE?

Landscape of Other Estimators
Comparison of Statistical Properties

Part II: Additional Core Topics

One-Step TMLE

General Framework
Theoretical Results
1. One-Step TMLE for the Effect Among the Treated
Demonstration for Effect Among the Treated
Simulation Studies

Online Targeted Learning

Batched Streaming Data
Online and One-Step Estimator
Theoretical Considerations

Networks

General Statistical Framework
Causal Model for Network Data Counterfactual Mean Under Stochastic Intervention on the Network Development of TMLE for Networks Inference

Application to Networks

Differing Network Structures
Realistic Network Examples (e.g., effect of vaccination)
R Package Implementation of TMLE

Targeted Estimation of the Nuisance Parameter

Asymptotic Linearity
IPW
TMLE

Sensitivity Analyses

General Nonparametric Approach to Sensitivity Analysis
Measurement Error
Unmeasured Confounding
Informative Missingness of the Outcome
FDA Meta-Analysis

Part III: Randomized Trials

Community Randomized Trials for Small Samples

Introduction of SEARCH Community Randomized Trial Adaptive Pair Matching Data-Adaptive Selection of Covariates for Small Samples TMLE Using Super Learning for Small Samples Inference