# Data Mining

## Concepts, Models, Methods, and Algorithms

3. Edition December 2019

672 Pages, Hardcover*Handbook/Reference Book*

**978-1-119-51604-0**

Presents the latest techniques for analyzing and extracting information from large amounts of data in high-dimensional data spaces

The revised and updated third edition of Data Mining contains in one volume an introduction to a systematic approach to the analysis of large data sets that integrates results from disciplines such as statistics, artificial intelligence, data bases, pattern recognition, and computer visualization. Advances in deep learning technology have opened an entire new spectrum of applications. The author--a noted expert on the topic--explains the basic concepts, models, and methodologies that have been developed in recent years.

This new edition introduces and expands on many topics, as well as providing revised sections on software tools and data mining applications. Additional changes include an updated list of references for further study, and an extended list of problems and questions that relate to each chapter.This third edition presents new and expanded information that:

* Explores big data and cloud computing

* Examines deep learning

* Includes information on convolutional neural networks (CNN)

* Offers reinforcement learning

* Contains semi-supervised learning and S3VM

* Reviews model evaluation for unbalanced data

Written for graduate students in computer science, computer engineers, and computer information systems professionals, the updated third edition of Data Mining continues to provide an essential guide to the basic principles of the technology and the most recent developments in the field.

Preface to the Second Edition xv

Preface to the First Edition xvii

1 Data-Mining Concepts 1

1.1 Introduction 2

1.2 Data-Mining Roots 4

1.3 Data-Mining Process 6

1.4 From Data Collection to Data Preprocessing 10

1.5 Data Warehouses for Data Mining 15

1.6 From Big Data to Data Science 18

1.7 Business Aspects of Data Mining: Why a Data-Mining Project Fails? 22

1.8 Organization of This Book 26

1.9 Review Questions and Problems 28

1.10 References for Further Study 30

2 Preparing the Data 33

2.1 Representation of Raw Data 34

2.2 Characteristics of Raw Data 38

2.3 Transformation of Raw Data 40

2.4 Missing Data 43

2.5 Time-Dependent Data 44

2.6 Outlier Analysis 49

2.7 Review Questions and Problems 56

2.8 References for Further Study 59

3 Data Reduction 61

3.1 Dimensions of Large Data Sets 62

3.2 Features Reduction 64

3.3 Relief Algorithm 75

3.4 Entropy Measure for Ranking Features 77

3.5 Principal Component Analysis 80

3.6 Value Reduction 83

3.7 Feature Discretization: ChiMerge Technique 86

3.8 Case Reduction 90

3.9 Review Questions and Problems 93

3.10 References for Further Study 95

4 Learning from Data 97

4.1 Learning Machine 99

4.2 Statistical Learning Theory 104

4.3 Types of Learning Methods 110

4.4 Common Learning Tasks 112

4.5 Support Vector Machines 117

4.6 Semi-Supervised Support Vector Machines (S3VM) 131

4.7 kNN: Nearest Neighbor Classifier 134

4.8 Model Selection vs. Generalization 138

4.9 Model Estimation 142

4.10 Imbalanced Data Classification 150

4.11 90% Accuracy ... Now What? 154

4.12 Review Questions and Problems 158

4.13 References for Further Study 161

5 Statistical Methods 165

5.1 Statistical Inference 166

5.2 Assessing Differences in Data Sets 168

5.3 Bayesian Inference 172

5.4 Predictive Regression 175

5.5 Analysis of Variance 181

5.6 Logistic Regression 184

5.7 Log-Linear Models 185

5.8 Linear Discriminant Analysis 189

5.9 Review Questions and Problems 191

5.10 References for Further Study 194

6 Decision Trees and Decision Rules 197

6.1 Decision Trees 199

6.2 C4.5 Algorithm: Generating a Decision Tree 201

6.3 Unknown Attribute Values 209

6.4 Pruning Decision Trees 214

6.5 C4.5 Algorithm: Generating Decision Rules 215

6.6 Cart Algorithm and Gini Index 219

6.7 Limitations of Decision Trees and Decision Rules 222

6.8 Review Questions and Problems 225

6.9 References for Further Study 229

7 Artificial Neural Networks 231

7.1 Model of an Artificial Neuron 233

7.2 Architectures of Artificial Neural Networks 237

7.3 Learning Process 239

7.4 Learning Tasks Using Anns 243

7.5 Multilayer Perceptrons 245

7.6 Competitive Networks and Competitive Learning 255

7.7 Self-Organizing Maps 259

7.8 Deep Learning 264

7.9 Convolutional Neural Networks (CNNs) 270

7.10 Review Questions and Problems 273

7.11 References for Further Study 276

8 Ensemble Learning 279

8.1 Ensemble Learning Methodologies 280

8.2 Combination Schemes for Multiple Learners 285

8.3 Bagging and Boosting 286

8.4 AdaBoost 288

8.5 Review Questions and Problems 290

8.6 References for Further Study 293

9 Cluster Analysis 295

9.1 Clustering Concepts 296

9.2 Similarity Measures 299

9.3 Agglomerative Hierarchical Clustering 306

9.4 Partitional Clustering 310

9.5 Incremental Clustering 313

9.6 DBSCAN Algorithm 317

9.7 BIRCH Algorithm 320

9.8 Clustering Validation 323

9.9 Review Questions and Problems 328

9.10 References for Further Study 333

10 Association Rules 335

10.1 Market-Basket Analysis 337

10.2 Algorithm Apriori 338

10.3 From Frequent Itemsets to Association Rules 340

10.4 Improving the Efficiency of the Apriori Algorithm 342

10.5 Frequent Pattern Growth Method 344

10.6 Associative-Classification Method 346

10.7 Multidimensional Association Rule Mining 349

10.8 Review Questions and Problems 351

10.9 References for Further Study 355

11 Web Mining and Text Mining 357

11.1 Web Mining 358

11.2 Web Content, Structure, and Usage Mining 360

11.3 Hits and Logsom Algorithms 362

11.4 Mining Path-Traversal Patterns 368

11.5 PageRank Algorithm 371

11.6 Recommender Systems 374

11.7 Text Mining 375

11.8 Latent Semantic Analysis 379

11.9 Review Questions and Problems 385

11.10 References for Further Study 388

12 Advances in Data Mining 391

12.1 Graph Mining 392

12.2 Temporal Data Mining 406

12.3 Spatial Data Mining 422

12.4 Distributed Data Mining 426

12.5 Correlation Does not Imply Causality! 435

12.6 Privacy, Security, and Legal Aspects of Data Mining 442

12.7 Cloud Computing Based on Hadoop and Map/Reduce 449

12.8 Reinforcement Learning 454

12.9 Review Questions and Problems 459

12.10 References for Further Study 461

13 Genetic Algorithms 465

13.1 Fundamentals of Genetic Algorithms 466

13.2 Optimization Using Genetic Algorithms 468

13.3 A Simple Illustration of a Genetic Algorithm 474

13.4 Schemata 480

13.5 Traveling Salesman Problem 483

13.6 Machine Learning Using Genetic Algorithms 485

13.7 Genetic Algorithms for Clustering 490

13.8 Review Questions and Problems 493

13.9 References for Further Study 494

14 Fuzzy Sets and Fuzzy Logic 497

14.1 Fuzzy Sets 498

14.2 Fuzzy Set Operations 504

14.3 Extension Principle and Fuzzy Relations 509

14.4 Fuzzy Logic and Fuzzy Inference Systems 513

14.5 Multifactorial Evaluation 518

14.6 Extracting Fuzzy Models from Data 521

14.7 Data Mining and Fuzzy Sets 526

14.8 Review Questions and Problems 528

14.9 References for Further Study 530

15 Visualization Methods 533

15.1 Perception and Visualization 534

15.2 Scientific Visualization and Information Visualization 535

15.3 Parallel Coordinates 542

15.4 Radial Visualization 544

15.5 Visualization Using Self-Organizing Maps 547

15.6 Visualization Systems for Data Mining 549

15.7 Review Questions and Problems 554

15.8 References for Further Study 555

Appendix A: Information on Data Mining 559

A.1 Data-Mining Journals 559

A.2 Data-Mining Conferences 564

A.3 Data-Mining Forums/Blogs 568

A.4 Data Sets 570

A.5 Comercially and Publicly Available Tools 574

A.6 Web Site Links 583

Appendix B: Data-Mining Applications 589

B.1 Data Mining for Financial Data Analyses 589

B.2 Data Mining for the Telecomunication Industry 593

B.3 Data Mining for the Retail Industry 596

B.4 Data Mining in Healthcare and Biomedical Research 599

B.5 Data Mining in Science and Engineering 602

B.6 Pitfalls of Data Mining 605

Bibliography 607

Index 633