Applied Univariate, Bivariate, and Multivariate Statistics Using Python

A Beginner's Guide to Advanced Data Analysis

Denis, Daniel J.

1. Edition August 2021
304 Pages, Hardcover
Wiley & Sons Ltd

ISBN: 978-1-119-57814-7

John Wiley & Sons

Wiley Online Library Sample Chapter

Further versions

Applied Univariate, Bivariate, and Multivariate Statistics Using Python

A practical, "how-to" reference for anyone performing essential statistical analyses and data management tasks in Python

Applied Univariate, Bivariate, and Multivariate Statistics Using Python delivers a comprehensive introduction to a wide range of statistical methods performed using Python in a single, one-stop reference. The book contains user-friendly guidance and instructions on using Python to run a variety of statistical procedures without getting bogged down in unnecessary theory. Throughout, the author emphasizes a set of computational tools used in the discovery of empirical patterns, as well as several popular statistical analyses and data management tasks that can be immediately applied.

Most of the datasets used in the book are small enough to be easily entered into Python manually, though they can also be downloaded for free from www.datapsyc.com. Only minimal knowledge of statistics is assumed, making the book perfect for those seeking an easily accessible toolkit for statistical analysis with Python. Applied Univariate, Bivariate, and Multivariate Statistics Using Python represents the fastest way to learn how to analyze data with Python.

Readers will also benefit from the inclusion of:
* A review of essential statistical principles, including types of data, measurement, significance tests, significance levels, and type I and type II errors
* An introduction to Python, exploring how to communicate with Python
* A treatment of exploratory data analysis, basic statistics and visual displays, including frequencies and descriptives, q-q plots, box-and-whisker plots, and data management
* An introduction to topics such as ANOVA, MANOVA and discriminant analysis, regression, principal components analysis, factor analysis, cluster analysis, among others, exploring the nature of what these techniques can vs. cannot do on a methodological level

Perfect for undergraduate and graduate students in the social, behavioral, and natural sciences, Applied Univariate, Bivariate, and Multivariate Statistics Using Python will also earn a place in the libraries of researchers and data analysts seeking a quick go-to resource for univariate, bivariate, and multivariate analysis in Python.

Preface xii

1 A Brief Introduction and Overview of Applied Statistics 1

1.1 How Statistical Inference Works 4

1.2 Statistics and Decision-Making 7

1.3 Quantifying Error Rates in Decision-Making: Type I and Type II Errors 8

1.4 Estimation of Parameters 9

1.5 Essential Philosophical Principles for Applied Statistics 11

1.6 Continuous vs. Discrete Variables 13

1.6.1 Continuity Is Not Always Clear-Cut 15

1.7 Using Abstract Systems to Describe Physical Phenomena:

Understanding Numerical vs. Physical Differences 16

1.8 Data Analysis, Data Science, Machine Learning, Big Data 18

1.9 "Training" and "Testing" Models: What "Statistical Learning" Means in the Age of Machine Learning and Data Science 20

1.10 Where We Are Going From Here: How to Use This Book 22

Review Exercises 23

2 Introduction to Python and the Field of Computational Statistics 25

2.1 The Importance of Specializing in Statistics and Research, Not Python: Advice for Prioritizing Your Hierarchy 26

2.2 How to Obtain Python 28

2.3 Python Packages 29

2.4 Installing a New Package in Python 31

2.5 Computing z-Scores in Python 32

2.6 Building a Dataframe in Python: And Computing Some Statistical Functions 35

2.7 Importing a .txt or .csv File 38

2.8 Loading Data into Python 39

2.9 Creating Random Data in Python 40

2.10 Exploring Mathematics in Python 40

2.11 Linear and Matrix Algebra in Python: Mechanics of Statistical Analyses 41

2.11.1 Operations on Matrices 44

2.11.2 Eigenvalues and Eigenvectors 47

Review Exercises 48

3 Visualization in Python: Introduction to Graphs and Plots 50

3.1 Aim for Simplicity and Clarity in Tables and Graphs: Complexity is for Fools! 52

3.2 State Population Change Data 54

3.3 What Do the Numbers Tell Us? Clues to Substantive Theory 56

3.4 The Scatterplot 58

3.5 Correlograms 59

3.6 Histograms and Bar Graphs 61

3.7 Plotting Side-by-Side Histograms 62

3.8 Bubble Plots 63

3.9 Pie Plots 65

3.10 Heatmaps 66

3.11 Line Charts 68

3.12 Closing Thoughts 69

Review Exercises 70

4 Simple Statistical Techniques for Univariate and Bivariate Analyses 72

4.1 Pearson Product-Moment Correlation 73

4.2 A Pearson Correlation Does Not (Necessarily) Imply Zero Relationship 75

4.3 Spearman's Rho 76

4.4 More General Comments on Correlation: Don't Let a Correlation Impress You Too Much! 79

4.5 Computing Correlation in Python 80

4.6 T-Tests for Comparing Means 84

4.7 Paired-Samples t-Test in Python 88

4.8 Binomial Test 90

4.9 The Chi-Squared Distribution and Goodness-of-Fit Test 91

4.10 Contingency Tables 93

Review Exercises 94

5 Power, Effect Size, P-Values, and Estimating Required Sample Size Using Python 96

5.1 What Determines the Size of a P-Value? 96

5.2 How P-Values Are a Function of Sample Size 99

5.3 What is Effect Size? 100

5.4 Understanding Population Variability in the Context of Experimental Design 102

5.5 Where Does Power Fit into All of This? 103

5.6 Can You Have Too Much Power? Can a Sample Be Too Large? 104

5.7 Demonstrating Power Principles in Python: Estimating Power or Sample Size 106

5.8 Demonstrating the Influence of Effect Size 108

5.9 The Influence of Significance Levels on Statistical Power 108

5.10 What About Power and Hypothesis Testing in the Age of "Big Data"? 110

5.11 Concluding Comments on Power, Effect Size, and Significance Testing 111

Review Exercises 112

6 Analysis of Variance 113

6.1 T-Tests for Means as a "Special Case" of ANOVA 114

6.2 Why Not Do Several t-Tests? 116

6.3 Understanding ANOVA Through an Example 117

6.4 Evaluating Assumptions in ANOVA 121

6.5 ANOVA in Python 124

6.6 Effect Size for Teacher 125

6.7 Post-Hoc Tests Following the ANOVA F-Test 125

6.8 A Myriad of Post-Hoc Tests 127

6.9 Factorial ANOVA 129

6.10 Statistical Interactions 131

6.11 Interactions in the Sample Are a Virtual Guarantee: Interactions in the Population Are Not 133

6.12 Modeling the Interaction Term 133

6.13 Plotting Residuals 134

6.14 Randomized Block Designs and Repeated Measures 135

6.15 Nonparametric Alternatives 138

6.15.1 Revisiting What "Satisfying Assumptions" Means: A Brief Discussion and Suggestion of How to Approach the Decision Regarding Nonparametrics 140

6.15.2 Your Experience in the Area Counts 140

6.15.3 What If Assumptions Are Truly Violated? 141

6.15.4 Mann-Whitney U Test 144

6.15.5 Kruskal-Wallis Test as a Nonparametric Alternative to ANOVA 145

Review Exercises 147

7 Simple and Multiple Linear Regression 148

7.1 Why Use Regression? 150

7.2 The Least-Squares Principle 152

7.3 Regression as a "New" Least-Squares Line 153

7.4 The Population Least-Squares Regression Line 154

7.5 How to Estimate Parameters in Regression 155

7.6 How to Assess Goodness of Fit? 157

7.7 R² - Coefficient of Determination 158

7.8 Adjusted R² 159

7.9 Regression in Python 161

7.10 Multiple Linear Regression 164

7.11 Defining the Multiple Regression Model 164

7.12 Model Specification Error 166

7.13 Multiple Regression in Python 167

7.14 Model-Building Strategies: Forward, Backward, Stepwise 168

7.15 Computer-Intensive "Algorithmic" Approaches 171

7.16 Which Approach Should You Adopt? 171

7.17 Concluding Remarks and Further Directions: Polynomial Regression 172

Review Exercises 174

8 Logistic Regression and the Generalized Linear Model 176

8.1 How Are Variables Best Measured? Are There Ideal Scales on Which a Construct Should Be Targeted? 178

8.2 The Generalized Linear Model 180

8.3 Logistic Regression for Binary Responses: A Special Subclass of the Generalized Linear Model 181

8.4 Logistic Regression in Python 184

8.5 Multiple Logistic Regression 188

8.5.1 A Model with Only Lag1 191

8.6 Further Directions 192

Review Exercises 192

9 Multivariate Analysis of Variance (MANOVA) and Discriminant Analysis 194

9.1 Why Technically Most Univariate Models are Actually Multivariate 195

9.2 Should I Be Running a Multivariate Model? 196

9.3 The Discriminant Function 198

9.4 Multivariate Tests of Significance: Why They Are Different from the F-Ratio 199

9.4.1 Wilks' Lambda 200

9.4.2 Pillai's Trace 201

9.4.3 Roy's Largest Root 201

9.4.4 Lawley-Hotelling's Trace 202

9.5 Which Multivariate Test to Use? 202

9.6 Performing MANOVA in Python 203

9.7 Effect Size for MANOVA 205

9.8 Linear Discriminant Function Analysis 205

9.9 How Many Discriminant Functions Does One Require? 207

9.10 Discriminant Analysis in Python: Binary Response 208

9.11 Another Example of Discriminant Analysis: Polytomous Classification 211

9.12 Bird's Eye View of MANOVA, ANOVA, Discriminant Analysis, and Regression: A Partial Conceptual Unification 212

9.13 Models "Subsumed" Under the Canonical Correlation Framework 214

Review Exercises 216

10 Principal Components Analysis 218

10.1 What Is Principal Components Analysis? 218

10.2 Principal Components as Eigen Decomposition 221

10.3 PCA on Correlation Matrix 223

10.4 Why Icebergs Are Not Good Analogies for PCA 224

10.5 PCA in Python 226

10.6 Loadings in PCA: Making Substantive Sense Out of an Abstract Mathematical Entity 229

10.7 Naming Components Using Loadings: A Few Issues 230

10.8 Principal Components Analysis on USA Arrests Data 232

10.9 Plotting the Components 237

Review Exercises 240

11 Exploratory Factor Analysis 241

11.1 The Common Factor Analysis Model 242

11.2 Factor Analysis as a Reproduction of the Covariance Matrix 243

11.3 Observed vs. Latent Variables: Philosophical Considerations 244

11.4 So, Why is Factor Analysis Controversial? The Philosophical Pitfalls of Factor Analysis 247

11.5 Exploratory Factor Analysis in Python 248

11.6 Exploratory Factor Analysis on USA Arrests Data 250

Review Exercises 254

12 Cluster Analysis 255

12.1 Cluster Analysis vs. ANOVA vs. Discriminant Analysis 258

12.2 How Cluster Analysis Defines "Proximity" 259

12.2.1 Euclidean Distance 260

12.3 K-Means Clustering Algorithm 261

12.4 To Standardize or Not? 262

12.5 Cluster Analysis in Python 263

12.6 Hierarchical Clustering 266

12.7 Hierarchical Clustering in Python 268

Review Exercises 272

References 273

Index 276

Daniel J. Denis, PhD, is Professor of Quantitative Psychology at the University of Montana. He is author of Applied Univariate, Bivariate, and Multivariate Statistics and Applied Univariate, Bivariate, and Multivariate Statistics Using R