Multiple Imputation and its Application
Statistics in Practice (Band Nr. 1)

2. Auflage August 2023
464 Seiten, Hardcover
Praktikerbuch
Multiple Imputation and its Application
The most up-to-date edition of a bestselling guide to analyzing partially observed data
In this comprehensively revised Second Edition of Multiple Imputation and its Application, a team of distinguished statisticians delivers an overview of the issues raised by missing data, the rationale for multiple imputation as a solution, and the practicalities of applying it in a multitude of settings.
With an accessible and carefully structured presentation aimed at quantitative researchers, Multiple Imputation and its Application is illustrated with a range of examples and offers key mathematical details. The book includes a wide range of theoretical and computer-based exercises, tested in the classroom, which are especially useful for users of R or Stata. Readers will find:
* A comprehensive overview of one of the most effective and popular methodologies for dealing with incomplete data sets
* Careful discussion of key concepts
* A range of examples illustrating the key ideas
* Practical advice on using multiple imputation
* Exercises and examples designed for use in the classroom and/or private study
Written for applied researchers looking to use multiple imputation with confidence, and for methods researchers seeking an accessible overview of the topic, Multiple Imputation and its Application will also earn a place in the libraries of graduate students undertaking quantitative analyses.
Data acknowledgements xv
Acknowledgements xvii
Glossary xix
Part I Foundations 1
1 Introduction 3
1.1 Reasons for missing data 5
1.2 Examples 6
1.3 Patterns of missing data 7
1.3.1 Consequences of missing data 9
1.4 Inferential framework and notation 10
1.4.1 Missing completely at random (MCAR) 12
1.4.2 Missing at random (MAR) 13
1.4.3 Missing not at random (MNAR) 17
1.4.4 Ignorability 21
1.5 Using observed data to inform assumptions about the missingness mechanism 21
1.6 Implications of missing data mechanisms for regression analyses 24
1.6.1 Partially observed response 24
1.6.2 Missing covariates 27
1.6.3 Missing covariates and response 30
1.6.4 Subtle issues I: the odds ratio 30
1.6.5 Implication for linear regression 32
1.6.6 Subtle issues II: sub-sample ignorability 33
1.6.7 Summary: when restricting to complete records is valid 34
1.7 Summary 34
Exercises 35
2 The Multiple Imputation Procedure and Its Justification 39
2.1 Introduction 39
2.2 Intuitive outline of the MI procedure 40
2.3 The generic MI procedure 45
2.4 Bayesian justification of mi 48
2.5 Frequentist inference 50
2.5.1 Large number of imputations 50
2.5.2 Small number of imputations 51
2.5.3 Inference for vector ß 53
2.5.4 Combining likelihood ratio tests 54
2.6 Choosing the number of imputations 55
2.7 Some simple examples 56
2.7.1 Estimating the mean with sigma 2 known by the imputer and analyst 57
2.7.2 Estimating the mean with sigma 2 known only by the imputer 59
2.7.3 Estimating the mean with sigma 2 unknown 59
2.7.4 General linear regression with sigma 2 known 61
2.8 mi in more general settings 64
2.8.1 Proper imputation 64
2.8.2 Congenial imputation and substantive model 64
2.8.3 Uncongenial imputation and substantive models 65
2.8.4 Survey sample settings 71
2.9 Constructing congenial imputation models 72
2.10 Discussion 73
Exercises 73
Part II Multiple Imputation for Simple Data Structures 79
3 Multiple Imputation of Quantitative Data 81
3.1 Regression imputation with a monotone missingness pattern 81
3.1.1 MAR mechanisms consistent with a monotone pattern 83
3.1.2 Justification 84
3.2 Joint modelling 85
3.2.1 Fitting the imputation model 85
3.2.2 Adding covariates 89
3.3 Full conditional specification 90
3.3.1 Justification 91
3.4 Full conditional specification versus joint modelling 92
3.5 Software for multivariate normal imputation 93
3.6 Discussion 93
Exercises 94
4 Multiple Imputation of Binary and Ordinal Data 96
4.1 Sequential imputation with monotone missingness pattern 96
4.2 Joint modelling with the multivariate normal distribution 98
4.3 Modelling binary data using latent normal variables 100
4.3.1 Latent normal model for ordinal data 104
4.4 General location model 108
4.5 Full conditional specification 108
4.5.1 Justification 109
4.6 Issues with over-fitting 110
4.7 Pros and cons of the various approaches 114
4.8 Software 116
4.9 Discussion 116
Exercises 117
5 Imputation of Unordered Categorical Data 119
5.1 Monotone missing data 119
5.2 Multivariate normal imputation for categorical data 121
5.3 Maximum indicant model 121
5.3.1 Continuous and categorical variable 123
5.3.2 Imputing missing data 125
5.4 General location model 125
5.5 FCS with categorical data 128
5.6 Perfect prediction issues with categorical data 130
5.7 Software 130
5.8 Discussion 130
Exercises 131
Part III Multiple Imputation in Practice 133
6 Non-linear Relationships, Interactions, and Other Derived Variables 135
6.1 Introduction 135
6.1.1 Interactions 137
6.1.2 Squares 137
6.1.3 Ratios 138
6.1.4 Sum scores 139
6.1.5 Composite endpoints 140
6.2 No missing data in derived variables 141
6.3 Simple methods 143
6.3.1 Impute then transform 143
6.3.2 Transform then impute/just another variable 143
6.3.3 Adapting standard imputation models and passive imputation 145
6.3.4 Predictive mean matching 146
6.3.5 Imputation separately by groups for interactions 148
6.4 Substantive-model-compatible imputation 152
6.4.1 The basic idea 152
6.4.2 Latent-normal joint model SMC imputation 157
6.4.3 Factorised conditional model SMC imputation 160
6.4.4 Substantive model compatible fully conditional specification 161
6.4.5 Auxiliary variables 162
6.4.6 Missing outcome values 163
6.4.7 Congeniality versus compatibility 163
6.4.8 Discussion of SMC imputation 164
6.5 Returning to the problems 165
6.5.1 Ratios 165
6.5.2 Splines 165
6.5.3 Fractional polynomials 166
6.5.4 Multiple imputation with conditional questions or 'skips' 169
Exercises 172
7 Survival Data 175
7.1 Missing covariates in time-to-event data 175
7.1.1 Approximately compatible approaches 176
7.1.2 Substantive model compatible approaches 181
7.2 Imputing censored event times 186
7.3 Non-parametric, or 'hot deck' imputation 188
7.3.1 Non-parametric imputation for time-to-event data 189
7.4 Case-cohort designs 191
7.4.1 Standard analysis of case-cohort studies 192
7.4.2 Multiple imputation for case-cohort studies 193
7.4.3 Full cohort 193
7.4.4 Intermediate approaches 193
7.4.5 Sub-study approach 194
7.5 Discussion 197
Exercises 197
8 Prognostic Models, Missing Data, and Multiple Imputation 200
8.1 Introduction 200
8.2 Motivating example 201
8.3 Missing data at model implementation 201
8.4 Multiple imputation for prognostic modelling 202
8.5 Model building 202
8.5.1 Model building with missing data 202
8.5.2 Imputing predictors when model building is to be performed 204
8.6 Model performance 204
8.6.1 How should we pool MI results for estimation of performance? 205
8.6.2 Calibration 205
8.6.3 Discrimination 206
8.6.4 Model performance measures with clinical interpretability 206
8.7 Model validation 206
8.7.1 Internal model validation 207
8.7.2 External model validation 208
8.8 Incomplete data at implementation 208
8.8.1 MI for incomplete data at implementation 208
8.8.2 Alternatives to multiple imputation 210
Exercises 212
9 Multi-level Multiple Imputation 213
9.1 Multi-level imputation model 213
9.1.1 Imputation of level-1 variables 216
9.1.2 Imputation of level 2 variables 219
9.1.3 Accommodating the substantive model 223
9.2 MCMC algorithm for imputation model 224
9.2.1 Ordered and unordered categorical data 226
9.2.2 Imputing missing values 227
9.2.3 Substantive model compatible imputation 227
9.2.4 Checking model convergence 229
9.3 Extensions 231
9.3.1 Cross-classification and three-level data 231
9.3.2 Random level 1 covariance matrices 232
9.3.3 Model fit 234
9.4 Other imputation methods 234
9.4.1 One-step and two-step FCS 234
9.4.2 Substantive model compatible imputation 235
9.4.3 Non-parametric methods 236
9.4.4 Comparisons of different methods 236
9.5 Individual participant data meta-analysis 237
9.5.1 Different measurement scales 239
9.5.2 When to apply Rubin's rules 239
9.5.3 Homoscedastic versus heteroscedastic imputation model 240
9.6 Software 241
9.7 Discussion 241
Exercises 242
10 Sensitivity Analysis: MI Unleashed 245
10.1 Review of MNAR modelling 246
10.2 Framing sensitivity analysis: estimands 249
10.2.1 Definition of the estimand 249
10.2.2 Two common estimands 250
10.3 Pattern mixture modelling with mi 251
10.3.1 Missing covariates 256
10.3.2 Sensitivity with multiple variables: the NAR FCS procedure 258
10.3.3 Application to survival analysis 260
10.4 Pattern mixture approach with longitudinal data via mi 263
10.4.1 Change in slope post-deviation 264
10.5 Reference based imputation 267
10.5.1 Constructing joint distributions of pre- and post-intercurrent event data 268
10.5.2 Technical details 269
10.5.3 Software 271
10.5.4 Information anchoring 275
10.6 Approximating a selection model by importance weighting 279
10.6.1 Weighting the imputations 281
10.6.2 Stacking the imputations and applying the weights 282
10.7 Discussion 289
Exercises 290
11 Multiple Imputation for Measurement Error and Misclassification 294
11.1 Introduction 294
11.2 Multiple imputation with validation data 296
11.2.1 Measurement error 297
11.2.2 Misclassification 297
11.2.3 Imputing assuming error is non-differential 299
11.2.4 Non-linear outcome models 299
11.3 Multiple imputation with replication data 301
11.3.1 Measurement error 302
11.3.2 Misclassification 306
11.4 External information on the measurement process 307
11.5 Discussion 308
Exercises 309
12 Multiple Imputation with Weights 312
12.1 Using model-based predictions in strata 313
12.2 Bias in the MI variance estimator 314
12.3 MI with weights 317
12.3.1 Conditions for the consistency of theta MI 317
12.3.2 Conditions for the consistency of V MI 318
12.4 A multi-level approach 320
12.4.1 Evaluation of the multi-level multiple imputation approach for handling survey weights 322
12.4.2 Results 325
12.5 Further topics 328
12.5.1 Estimation in domains 328
12.5.2 Two-stage analysis 328
12.5.3 Missing values in the weight model 329
12.6 Discussion 329
Exercises 330
13 Multiple Imputation for Causal Inference 333
13.1 Multiple imputation for causal inference in point exposure studies 333
13.1.1 Randomised trials 335
13.1.2 Observational studies 335
13.2 Multiple imputation and propensity scores 338
13.2.1 Propensity scores for confounder adjustment 338
13.2.2 Multiple imputation of confounders 340
13.2.3 Imputation model specification 342
13.3 Principal stratification via multiple imputation 343
13.3.1 Principal strata effects 344
13.3.2 Estimation 345
13.4 Multiple imputation for IV analysis 346
13.4.1 Instrumental variable analysis for non-adherence 346
13.4.2 Instrumental variable analysis via multiple imputation 348
13.5 Discussion 350
Exercises 351
14 Using Multiple Imputation in Practice 355
14.1 A general approach 355
14.1.1 Explore the proportions and patterns of missing data 356
14.1.2 Consider plausible missing data mechanisms 356
14.1.3 Consider whether missing at random is plausible 356
14.1.4 Choose the variables for the imputation model 357
14.1.5 Choose an appropriate imputation strategy and model/s 357
14.1.6 Set and record the seed of the pseudo-random number generator 357
14.1.7 Fit the imputation model 358
14.1.8 Iterate and revise the imputation model if necessary 358
14.1.9 Estimate monte carlo error 358
14.1.10 Sensitivity analysis 359
14.2 Objections to multiple imputation 359
14.3 Reporting of analyses with incomplete data 363
14.4 Presenting incomplete baseline data 364
14.5 Model diagnostics 365
14.6 How many imputations? 366
14.6.1 Using the jack-knife estimate of the Monte-Carlo standard error 368
14.7 Multiple imputation for each substantive model, project, or dataset? 369
14.8 Large datasets 370
14.8.1 Large datasets and joint modelling 371
14.8.2 Shrinkage by constraining parameters 372
14.8.3 Comparison of the two approaches 375
14.9 Multiple imputation and record linkage 375
14.10 Setting random number seeds for multiple imputation analyses 377
14.11 Simulation studies including multiple imputation 377
14.11.1 Random number seeds for simulation studies including multiple imputation 377
14.11.2 Repeated simulation of all data or only the missingness mechanism? 378
14.11.3 How many imputations for simulation studies? 379
14.11.4 Multiple imputation for data simulation 380
14.12 Discussion 381
Exercises 381
Appendix A Markov Chain Monte Carlo 384
A.1 Metropolis Hastings sampler 385
A.2 Gibbs sampler 386
A.3 Missing data 387
Appendix B Probability Distributions 388
B.1 Posterior for the multivariate normal distribution 391
Appendix C Overview of Multiple Imputation in R, Stata 394
C.1 Basic multiple imputation using R 394
C.2 Basic MI using Stata 395
References 398
Author Index 419
Index of Examples 429
Subject Index 431
JONATHAN W. BARTLETT is a Professor of Medical Statistics at the London School of Hygiene & Tropical Medicine, UK.
TIM P. MORRIS is Principal Research Fellow in Medical Statistics at the MRC Clinical Trials Unit at UCL, UK.
ANGELA M. WOOD is Professor of Health Data Science in the Department of Public Health and Primary Care, University of Cambridge, UK.
MATTEO QUARTAGNO is Senior Research Fellow in Medical Statistics at the MRC Clinical Trials Unit at UCL, UK.
MICHAEL G. KENWARD retired in 2016 after sixteen years as GlaxoSmithKline Professor of Biostatistics at the London School of Hygiene & Tropical Medicine, UK.