# Data Mining for Business Analytics

## Concepts, Techniques, and Applications in R

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R presents an applied approach to data mining concepts and methods, using R software for illustration

Readers will learn how to implement a variety of popular data mining algorithms in R (a free and open-source software) to tackle business problems and opportunities.

This is the fifth version of this successful text, and the first using R. It covers both statistical and machine learning algorithms for prediction, classification, visualization, dimension reduction, recommender systems, clustering, text mining and network analysis. It also includes:

* Two new co-authors, Inbal Yahav and Casey Lichtendahl, who bring both expertise teaching business analytics courses using R, and data mining consulting experience in business and government

* Updates and new material based on feedback from instructors teaching MBA, undergraduate, diploma and executive courses, and from their students

* More than a dozen case studies demonstrating applications for the data mining techniques described

* End-of-chapter exercises that help readers gauge and expand their comprehension and competency of the material presented

* A companion website with more than two dozen data sets, and instructor materials including exercise solutions, PowerPoint slides, and case solutions www.dataminingbook.com

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R is an ideal textbook for graduate and upper-undergraduate level courses in data mining, predictive analytics, and business analytics. This new edition is also an excellent reference for analysts, researchers, and practitioners working with quantitative methods in the fields of business, finance, marketing, computer science, and information technology.

" This book has by far the most comprehensive review of business analytics methods that I have ever seen, covering everything from classical approaches such as linear and logistic regression, through to modern methods like neural networks, bagging and boosting, and even much more business specific procedures such as social network analysis and text mining. If not the bible, it is at the least a definitive manual on the subject."

Gareth M. James, University of Southern California and co-author (with Witten, Hastie and Tibshirani) of the best-selling book An Introduction to Statistical Learning, with Applications in R

Galit Shmueli, PhD, is Distinguished Professor at National Tsing Hua University's Institute of Service Science. She has designed and instructed data mining courses since 2004 at University of Maryland, Statistics.com, Indian School of Business, and National Tsing Hua University, Taiwan. Professor Shmueli is known for her research and teaching in business analytics, with a focus on statistical and data mining methods in information systems and healthcare. She has authored over 70 publications including books.

Peter C. Bruce is President and Founder of the Institute for Statistics Education at Statistics.com. He has written multiple journal articles and is the developer of Resampling Stats software. He is the author of Introductory Statistics and Analytics: A Resampling Perspective (Wiley) and co-author of Practical Statistics for Data Scientists: 50 Essential Concepts (O'Reilly).

Inbal Yahav, PhD, is Professor at the Graduate School of Business Administration at Bar-Ilan University, Israel. She teaches courses in social network analysis, advanced research methods, and software quality assurance. Dr. Yahav received her PhD in Operations Research and Data Mining from the University of Maryland, College Park.

Nitin R. Patel, PhD, is Chairman and cofounder of Cytel, Inc., based in Cambridge, Massachusetts. A Fellow of the American Statistical Association, Dr. Patel has also served as a Visiting Professor at the Massachusetts Institute of Technology and at Harvard University. He is a Fellow of the Computer Society of India and was a professor at the Indian Institute of Management, Ahmedabad, for 15 years.

Kenneth C. Lichtendahl, Jr., PhD, is Associate Professor at the University of Virginia. He is the Eleanor F. and Phillip G. Rust Professor of Business Administration and teaches MBA courses in decision analysis, data analysis and optimization, and managerial quantitative analysis. He also teaches executive education courses in strategic analysis and decision-making, and managing the corporate aviation function.

Foreword by Gareth James xix

Foreword by Ravi Bapna xxi

Preface to the R Edition xxiii

Acknowledgments xxvii

PART I PRELIMINARIES

CHAPTER 1 Introduction 3

1.1 What Is Business Analytics? 3

1.2 What Is Data Mining? 5

1.3 Data Mining and Related Terms 5

1.4 Big Data 6

1.5 Data Science 7

1.6 Why Are There So Many Different Methods? 8

1.7 Terminology and Notation 9

1.8 Road Maps to This Book 11

Order of Topics 11

CHAPTER 2 Overview of the Data Mining Process 15

2.1 Introduction 15

2.2 Core Ideas in Data Mining 16

Classification 16

Prediction 16

Association Rules and Recommendation Systems 16

Predictive Analytics 17

Data Reduction and Dimension Reduction 17

Data Exploration and Visualization 17

Supervised and Unsupervised Learning 18

2.3 The Steps in Data Mining 19

2.4 Preliminary Steps 21

Organization of Datasets 21

Predicting Home Values in the West Roxbury Neighborhood 21

Loading and Looking at the Data in R 22

Sampling from a Database 24

Oversampling Rare Events in Classification Tasks 25

Preprocessing and Cleaning the Data 26

2.5 Predictive Power and Overfitting 33

Overfitting 33

Creation and Use of Data Partitions 35

2.6 Building a Predictive Model 38

Modeling Process 39

2.7 Using R for Data Mining on a Local Machine 43

2.8 Automating Data Mining Solutions 43

Data Mining Software: The State of the Market (by Herb Edelstein) 45

Problems 49

PART II DATA EXPLORATION AND DIMENSION REDUCTION

CHAPTER 3 Data Visualization 55

3.1 Uses of Data Visualization 55

Base R or ggplot? 57

3.2 Data Examples 57

Example 1: Boston Housing Data 57

Example 2: Ridership on Amtrak Trains 59

3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots 59

Distribution Plots: Boxplots and Histograms 61

Heatmaps: Visualizing Correlations and Missing Values 64

3.4 Multidimensional Visualization 67

Adding Variables: Color, Size, Shape, Multiple Panels, and Animation 67

Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering 70

Reference: Trend Lines and Labels 74

Scaling up to Large Datasets 74

Multivariate Plot: Parallel Coordinates Plot 75

Interactive Visualization 77

3.5 Specialized Visualizations 80

Visualizing Networked Data 80

Visualizing Hierarchical Data: Treemaps 82

Visualizing Geographical Data: Map Charts 83

3.6 Summary: Major Visualizations and Operations, by Data Mining Goal 86

Prediction 86

Classification 86

Time Series Forecasting 86

Unsupervised Learning 87

Problems 88

CHAPTER 4 Dimension Reduction 91

4.1 Introduction 91

4.2 Curse of Dimensionality 92

4.3 Practical Considerations 92

Example 1: House Prices in Boston 93

4.4 Data Summaries 94

Summary Statistics 94

Aggregation and Pivot Tables 96

4.5 Correlation Analysis 97

4.6 Reducing the Number of Categories in Categorical Variables 99

4.7 Converting a Categorical Variable to a Numerical Variable 99

4.8 Principal Components Analysis 101

Example 2: Breakfast Cereals 101

Principal Components 106

Normalizing the Data 107

Using Principal Components for Classification and Prediction 109

4.9 Dimension Reduction Using Regression Models 111

4.10 Dimension Reduction Using Classification and Regression Trees 111

Problems 112

PART III PERFORMANCE EVALUATION

CHAPTER 5 Evaluating Predictive Performance 117

5.1 Introduction 117

5.2 Evaluating Predictive Performance 118

Naive Benchmark: The Average 118

Prediction Accuracy Measures 119

Comparing Training and Validation Performance 121

Lift Chart 121

5.3 Judging Classifier Performance 122

Benchmark: The Naive Rule 124

Class Separation 124

The Confusion (Classification) Matrix 124

Using the Validation Data 126

Accuracy Measures 126

Propensities and Cutoff for Classification 127

Performance in Case of Unequal Importance of Classes 131

Asymmetric Misclassification Costs 133

Generalization to More Than Two Classes 135

5.4 Judging Ranking Performance 136

Lift Charts for Binary Data 136

Decile Lift Charts 138

Beyond Two Classes 139

Lift Charts Incorporating Costs and Benefits 139

Lift as a Function of Cutoff 140

5.5 Oversampling 140

Oversampling the Training Set 144

Evaluating Model Performance Using a Non-oversampled Validation Set 144

Evaluating Model Performance if Only Oversampled Validation Set Exists 144

Problems 147

PART IV PREDICTION AND CLASSIFICATION METHODS

CHAPTER 6 Multiple Linear Regression 153

6.1 Introduction 153

6.2 Explanatory vs. Predictive Modeling 154

6.3 Estimating the Regression Equation and Prediction 156

Example: Predicting the Price of Used Toyota Corolla Cars 156

6.4 Variable Selection in Linear Regression 161

Reducing the Number of Predictors 161

How to Reduce the Number of Predictors 162

Problems 169

CHAPTER 7 k-Nearest Neighbors (kNN) 173

7.1 The k-NN Classifier (Categorical Outcome) 173

Determining Neighbors 173

Classification Rule 174

Example: Riding Mowers 175

Choosing k 176

Setting the Cutoff Value 179

k-NN with More Than Two Classes 180

Converting Categorical Variables to Binary Dummies 180

7.2 k-NN for a Numerical Outcome 180

7.3 Advantages and Shortcomings of k-NN Algorithms 182

Problems 184

CHAPTER 8 The Naive Bayes Classifier 187

8.1 Introduction 187

Cutoff Probability Method 188

Conditional Probability 188

Example 1: Predicting Fraudulent Financial Reporting 188

8.2 Applying the Full (Exact) Bayesian Classifier 189

Using the "Assign to the Most Probable Class" Method 190

Using the Cutoff Probability Method 190

Practical Difficulty with the Complete (Exact) Bayes Procedure 190

Solution: Naive Bayes 191

The Naive Bayes Assumption of Conditional Independence 192

Using the Cutoff Probability Method 192

Example 2: Predicting Fraudulent Financial Reports, Two Predictors 193

Example 3: Predicting Delayed Flights 194

8.3 Advantages and Shortcomings of the Naive Bayes Classifier 199

Problems 202

CHAPTER 9 Classification and Regression Trees 205

9.1 Introduction 205

9.2 Classification Trees 207

Recursive Partitioning 207

Example 1: Riding Mowers 207

Measures of Impurity 210

Tree Structure 214

Classifying a New Record 214

9.3 Evaluating the Performance of a Classification Tree 215

Example 2: Acceptance of Personal Loan 215

9.4 Avoiding Overfitting 216

Stopping Tree Growth: Conditional Inference Trees 221

Pruning the Tree 222

Cross-Validation 222

Best-Pruned Tree 224

9.5 Classification Rules from Trees 226

9.6 Classification Trees for More Than Two Classes 227

9.7 Regression Trees 227

Prediction 228

Measuring Impurity 228

Evaluating Performance 229

9.8 Improving Prediction: Random Forests and Boosted Trees 229

Random Forests 229

Boosted Trees 231

9.9 Advantages and Weaknesses of a Tree 232

Problems 234

CHAPTER 10 Logistic Regression 237

10.1 Introduction 237

10.2 The Logistic Regression Model 239

10.3 Example: Acceptance of Personal Loan 240

Model with a Single Predictor 241

Estimating the Logistic Model from Data: Computing Parameter Estimates 243

Interpreting Results in Terms of Odds (for a Profiling Goal) 244

10.4 Evaluating Classification Performance 247

Variable Selection 248

10.5 Example of Complete Analysis: Predicting Delayed Flights 250

Data Preprocessing 251

Model-Fitting and Estimation 254

Model Interpretation 254

Model Performance 254

Variable Selection 257

10.6 Appendix: Logistic Regression for Profiling 259

Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome 259

Appendix B: Evaluating Explanatory Power 261

Appendix C: Logistic Regression for More Than Two Classes 264

Problems 268

CHAPTER 11 Neural Nets 271

11.1 Introduction 271

11.2 Concept and Structure of a Neural Network 272

11.3 Fitting a Network to Data 273

Example 1: Tiny Dataset 273

Computing Output of Nodes 274

Preprocessing the Data 277

Training the Model 278

Example 2: Classifying Accident Severity 282

Avoiding Overfitting 283

Using the Output for Prediction and Classification 283

11.4 Required User Input 285

11.5 Exploring the Relationship Between Predictors and Outcome 287

11.6 Advantages and Weaknesses of Neural Networks 288

Problems 290

CHAPTER 12 Discriminant Analysis 293

12.1 Introduction 293

Example 1: Riding Mowers 294

Example 2: Personal Loan Acceptance 294

12.2 Distance of a Record from a Class 296

12.3 Fisher's Linear Classification Functions 297

12.4 Classification Performance of Discriminant Analysis 300

12.5 Prior Probabilities 302

12.6 Unequal Misclassification Costs 302

12.7 Classifying More Than Two Classes 303

Example 3: Medical Dispatch to Accident Scenes 303

12.8 Advantages and Weaknesses 306

Problems 307

CHAPTER 13 Combining Methods: Ensembles and Uplift Modeling 311

13.1 Ensembles 311

Why Ensembles Can Improve Predictive Power 312

Simple Averaging 314

Bagging 315

Boosting 315

Bagging and Boosting in R 315

Advantages and Weaknesses of Ensembles 315

13.2 Uplift (Persuasion) Modeling 317

A-B Testing 318

Uplift 318

Gathering the Data 319

A Simple Model 320

Modeling Individual Uplift 321

Computing Uplift with R 322

Using the Results of an Uplift Model 322

13.3 Summary 324

Problems 325

PART V MINING RELATIONSHIPS AMONG RECORDS

CHAPTER 14 Association Rules and Collaborative Filtering 329

14.1 Association Rules 329

Discovering Association Rules in Transaction Databases 330

Example 1: Synthetic Data on Purchases of Phone Faceplates 330

Generating Candidate Rules 330

The Apriori Algorithm 333

Selecting Strong Rules 333

Data Format 335

The Process of Rule Selection 336

Interpreting the Results 337

Rules and Chance 339

Example 2: Rules for Similar Book Purchases 340

14.2 Collaborative Filtering 342

Data Type and Format 343

Example 3: Netflix Prize Contest 343

User-Based Collaborative Filtering: "People Like You" 344

Item-Based Collaborative Filtering 347

Advantages and Weaknesses of Collaborative Filtering 348

Collaborative Filtering vs. Association Rules 349

14.3 Summary 351

Problems 352

CHAPTER 15 Cluster Analysis 357

15.1 Introduction 357

Example: Public Utilities 359

15.2 Measuring Distance Between Two Records 361

Euclidean Distance 361

Normalizing Numerical Measurements 362

Other Distance Measures for Numerical Data 362

Distance Measures for Categorical Data 365

Distance Measures for Mixed Data 366

15.3 Measuring Distance Between Two Clusters 366

Minimum Distance 366

Maximum Distance 366

Average Distance 367

Centroid Distance 367

15.4 Hierarchical (Agglomerative) Clustering 368

Single Linkage 369

Complete Linkage 370

Average Linkage 370

Centroid Linkage 370

Ward's Method 370

Dendrograms: Displaying Clustering Process and Results 371

Validating Clusters 373

Limitations of Hierarchical Clustering 375

15.5 Non-Hierarchical Clustering: The k-Means Algorithm 376

Choosing the Number of Clusters (k) 377

Problems 382

PART VI FORECASTING TIME SERIES

CHAPTER 16 Handling Time Series 387

16.1 Introduction 387

16.2 Descriptive vs. Predictive Modeling 389

16.3 Popular Forecasting Methods in Business 389

Combining Methods 389

16.4 Time Series Components 390

Example: Ridership on Amtrak Trains 390

16.5 Data-Partitioning and Performance Evaluation 395

Benchmark Performance: Naive Forecasts 395

Generating Future Forecasts 396

Problems 398

CHAPTER 17 Regression-Based Forecasting 401

17.1 A Model with Trend 401

Linear Trend 401

Exponential Trend 405

Polynomial Trend 407

17.2 A Model with Seasonality 407

17.3 A Model with Trend and Seasonality 411

17.4 Autocorrelation and ARIMA Models 412

Computing Autocorrelation 413

Improving Forecasts by Integrating Autocorrelation Information 416

Evaluating Predictability 420

Problems 422

CHAPTER 18 Smoothing Methods 433

18.1 Introduction 433

18.2 Moving Average 434

Centered Moving Average for Visualization 434

Trailing Moving Average for Forecasting 435

Choosing Window Width (w) 439

18.3 Simple Exponential Smoothing 439

Choosing Smoothing Parameter 440

Relation Between Moving Average and Simple Exponential Smoothing 440

18.4 Advanced Exponential Smoothing 442

Series with a Trend 442

Series with a Trend and Seasonality 443

Series with Seasonality (No Trend) 443

Problems 446

PART VII DATA ANALYTICS

CHAPTER 19 Social Network Analytics 455

19.1 Introduction 455

19.2 Directed vs. Undirected Networks 457

19.3 Visualizing and Analyzing Networks 458

Graph Layout 458

Edge List 460

Adjacency Matrix 461

Using Network Data in Classification and Prediction 461

19.4 Social Data Metrics and Taxonomy 462

Node-Level Centrality Metrics 463

Egocentric Network 463

Network Metrics 465

19.5 Using Network Metrics in Prediction and Classification 467

Link Prediction 467

Entity Resolution 467

Collaborative Filtering 468

19.6 Collecting Social Network Data with R 471

19.7 Advantages and Disadvantages 474

Problems 476

CHAPTER 20 Text Mining 479

20.1 Introduction 479

20.2 The Tabular Representation of Text: Term-Document Matrix and "Bag-of-Words" 480

20.3 Bag-of-Words vs. Meaning Extraction at Document Level 481

20.4 Preprocessing the Text 482

Tokenization 484

Text Reduction 485

Presence/Absence vs. Frequency 487

Term Frequency-Inverse Document Frequency (TF-IDF) 487

From Terms to Concepts: Latent Semantic Indexing 488

Extracting Meaning 489

20.5 Implementing Data Mining Methods 489

20.6 Example: Online Discussions on Autos and Electronics 490

Importing and Labeling the Records 490

Text Preprocessing in R 491

Producing a Concept Matrix 491

Fitting a Predictive Model 492

Prediction 492

20.7 Summary 494

Problems 495

PART VIII CASES

CHAPTER 21 Cases 499

21.1 Charles Book Club 499

The Book Industry 499

Database Marketing at Charles 500

Data Mining Techniques 502

Assignment 504

21.2 German Credit 505

Background 505

Data 506

Assignment 507

21.3 Tayko Software Cataloger 510

Background 510

The Mailing Experiment 510

Data 510

Assignment 512

21.4 Political Persuasion 513

Background 513

Predictive Analytics Arrives in US Politics 513

Political Targeting 514

Uplift 514

Data 515

Assignment 516

21.5 Taxi Cancellations 517

Business Situation 517

Assignment 517

21.6 Segmenting Consumers of Bath Soap 518

Business Situation 518

Key Problems 519

Data 519

Measuring Brand Loyalty 519

Assignment 521

21.7 Direct-Mail Fundraising 521

Background 521

Data 522

Assignment 523

21.8 Catalog Cross-Selling 524

Background 524

Assignment 524

21.9 Predicting Bankruptcy 525

Predicting Corporate Bankruptcy 525

Assignment 526

21.10 Time Series Case: Forecasting Public Transportation Demand 528

Background 528

Problem Description 528

Available Data 528

Assignment Goal 528

Assignment 529

Tips and Suggested Steps 529

References 531

Data Files Used in the Book 533

Index 535

Peter C. Bruce is President and Founder of the Institute for Statistics Education at Statistics.com. He has written multiple journal articles and is the developer of Resampling Stats software. He is the author of Introductory Statistics and Analytics: A Resampling Perspective (Wiley) and co-author of Practical Statistics for Data Scientists: 50 Essential Concepts (O'Reilly).

Inbal Yahav, PhD, is Professor at the Graduate School of Business Administration at Bar-Ilan University, Israel. She teaches courses in social network analysis, advanced research methods, and software quality assurance. Dr. Yahav received her PhD in Operations Research and Data Mining from the University of Maryland, College Park.

Nitin R. Patel, PhD, is Chairman and cofounder of Cytel, Inc., based in Cambridge, Massachusetts. A Fellow of the American Statistical Association, Dr. Patel has also served as a Visiting Professor at the Massachusetts Institute of Technology and at Harvard University. He is a Fellow of the Computer Society of India and was a professor at the Indian Institute of Management, Ahmedabad, for 15 years.

Kenneth C. Lichtendahl, Jr., PhD, is Associate Professor at the University of Virginia. He is the Eleanor F. and Phillip G. Rust Professor of Business Administration and teaches MBA courses in decision analysis, data analysis and optimization, and managerial quantitative analysis. He also teaches executive education courses in strategic analysis and decision-making, and managing the corporate aviation function.