John Wiley & Sons Big Data Cover Learn Big Data from the ground up with this complete and up-to-date resource from leaders in the fie.. Product #: 978-1-119-70182-8 Regular price: $116.82 $116.82 Auf Lager

Big Data

Concepts, Technology, and Architecture

Balusamy, Balamurugan / Abirami R, Nandhini / Kadry, Seifedine / Gandomi, Amir H.

Cover

1. Auflage Juni 2021
368 Seiten, Hardcover
Wiley & Sons Ltd

ISBN: 978-1-119-70182-8
John Wiley & Sons

Jetzt kaufen

Preis: 125,00 €

Preis inkl. MwSt, zzgl. Versand

Weitere Versionen

epubmobipdf

Learn Big Data from the ground up with this complete and up-to-date resource from leaders in the field

Big Data: Concepts, Technology, and Architecture delivers a comprehensive treatment of Big Data tools, terminology, and technology perfectly suited to a wide range of business professionals, academic researchers, and students. Beginning with a fulsome overview of what we mean when we say, "Big Data," the book moves on to discuss every stage of the lifecycle of Big Data.

You'll learn about the creation of structured, unstructured, and semi-structured data, data storage solutions, traditional database solutions like SQL, data processing, data analytics, machine learning, and data mining. You'll also discover how specific technologies like Apache Hadoop, SQOOP, and Flume work.

Big Data also covers the central topic of big data visualization with Tableau, and you'll learn how to create scatter plots, histograms, bar, line, and pie charts with that software.

Accessibly organized, Big Data includes illuminating case studies throughout the material, showing you how the included concepts have been applied in real-world settings. Some of those concepts include:
* The common challenges facing big data technology and technologists, like data heterogeneity and incompleteness, data volume and velocity, storage limitations, and privacy concerns
* Relational and non-relational databases, like RDBMS, NoSQL, and NewSQL databases
* Virtualizing Big Data through encapsulation, partitioning, and isolating, as well as big data server virtualization
* Apache software, including Hadoop, Cassandra, Avro, Pig, Mahout, Oozie, and Hive
* The Big Data analytics lifecycle, including business case evaluation, data preparation, extraction, transformation, analysis, and visualization

Perfect for data scientists, data engineers, and database managers, Big Data also belongs on the bookshelves of business intelligence analysts who are required to make decisions based on large volumes of information. Executives and managers who lead teams responsible for keeping or understanding large datasets will also benefit from this book.

Big Data - concepts, Technology and Architecture. 1

Book Description.. 11

1.1 Understanding Big Data. 13

1.2 Evolution of Big Data. 14

1.3 Failure of Traditional database in handling Big Data. 15

1.3 (a) Data Mining Vs Big Data. 16

1.4 3 V's of Big Data. 17

1.4.1 Volume. 17

1.4.2 Velocity. 18

1.4.3 Variety. 19

1.5 Sources of Big Data. 19

1.6 Different Types of Data. 21

1.6.1 Structured Data. 22

1.6.2 Unstructured Data. 22

1.6.3 Semi-Structured Data. 23

1.7 Big Data Infrastructure. 24

1.8 Big Data Life Cycle. 25

1.8.1 Big Data Generation. 26

1.8.2 Data Aggregation. 26

1.8.3 Data Preprocessing. 27

1.7.3Big Data Analytics. 31

1.7.4 Visualizing Big Data. 32

1.8 Big Data Technology. 32

1.8.1 Challenges faced by Big Data technology. 34

1.8.1 Heterogeneity and incompleteness. 34

1.8.2 Volume and velocity of the Data. 35

1.8.3 Data Storage. 35

1.8.4 Data Privacy. 36

1.9 Big Data Applications. 36

1.10 Big Data Use Cases. 37

1.9. 1 Healthcare. 37

1.9.2 Telecom.. 38

1.9.3 Financial Services. 39

Chapter 1 refresher: 40

Conceptual short Questions with answers. 43

Frequently asked Interview questions. 45

Chapter Objective. 46

Big Data Storage Concepts. 46

2.1 Cluster computing. 47

2.1.1 Types of cluster. 49

2.1.1.1 High availability cluster. 50

2.1.1.2 Load balancing cluster. 50

2.1.2 Cluster structure. 51

2.3 Distribution Models. 53

2.3.1 Sharding. 54

2.3.2 Data Replication. 56

2.3.2.1 Master-Slave model 57

2.3.2.2 Peer-to-Peer model 58

2.3.3 Sharding and Replication. 59

2.4 Distributed file system.. 60

2.5 Relational and Non Relational Databases. 61

CoursesOffered. 62

Figure 2.12 Data divided across multiple related tables. 62

2.4.2 RDBMS Databases. 63

2.4.3 NoSQL Databases. 63

2.4.4 NewSQL Databases. 64

2.5 Scaling Up and Scaling Out Storage. 65

Chapter 2 refresher. 67

Conceptual short questions with answers. 69

Chapter Objective. 72

3.1 Introduction to NoSQL. 72

3.2 Why NoSQL. 72

3.3 CAP theorem.. 73

3.4 ACID.. 75

3.5 BASE. 76

3.6 Schemaless Database. 77

3.7 NoSQL (Not Only SQL) 77

3.7.1 NoSQL Vs RDBMS. 78

3.7.2Features of NoSQL database. 79

3.7.3Types of NoSQL Technologies. 80

3.7.3.1 Key-Value store database. 81

3.7.3.2 Column-store database. 82

3.7.3.3 Document Oriented Database. 84

3.7.3.4 Graph-oriented Database. 86

3.7.4 NoSQL Operations. 93

3.9 Migrating from RDBMS to NoSQL. 98

Chapter 3 refresher. 99

Conceptual short questions with answers. 102

Chapter Objective. 104

4.1 Data Processing. 104

4.2 Shared Everything Architecture. 106

4.2.1 Symmetric multiprocessing architecture. 107

4.2.2 Distributed Shared memory. 108

4.3 Shared nothing architecture. 109

4.4 Batch Processing. 110

4.5 Real-Time Data Processing. 111

4.6 Parallel Computing. 112

4.7 Distributed Computing. 113

4.8 Big Data Virtualization. 113

4.8.1 Attributes of Virtualization. 114

4.8.1.1 Encapsulation. 115

4.8.1.2 Partitioning. 115

4.8.1.3 Isolation. 115

4.8.2Big Data Server Virtualization. 116

4.9 Introduction. 116

4.10 Cloud computing types. 118

4.11Cloud Services. 120

4.12 Cloud Storage. 121

4.12.1 Architecture of GFS. 121

4.12.1.1 Master. 123

4.12.1.2 Client. 123

4.13 Cloud Architecture. 127

Cloud Challenges. 129

Chapter 4 Refresher. 130

Conceptual short questions with answers. 133

Chapter Objective. 139

5.1 Apache Hadoop. 139

5.1.1 Architecture of Apache Hadoop. 140

5.1.2Hadoop Ecosystem Components Overview.. 140

5.2 Hadoop Storage. 142

5.2.1HDFS (Hadoop Distributed File System). 142

5.2.2Why HDFS?. 143

5.2.3HDFS Architecture. 143

5.2.4HDFS Read/Write Operation. 146

5.2.5Rack Awareness. 148

5.2.6Features of HDFS. 149

5.2.6.1Cost-effective. 149

5.2.6.2Distributed storage. 149

5.2.6.3Data Replication. 149

5.3 Hadoop Computation. 149

5.3.1MapReduce. 149

5.3.1.1Mapper. 151

5.3.1.2Combiner. 151

5.3.1.3 Reducer. 152

5.3.1.4 JobTracker and TaskTracker. 153

5.3.2 MapReduce Input Formats. 154

5.3.3 MapReduce Example. 156

5.3.4 MapReduce Processing. 157

5.3.5 MapReduce Algorithm.. 160

5.3.6 Limitations of MapReduce. 161

5.4Hadoop 2.0. 161

5.4.1Hadoop 1.0 limitations. 162

5.4.2 Features of Hadoop 2.0. 163

5.4.3 Yet Another Resource Negotiator (YARN). 164

5.4.3 Core components of YARN.. 165

5.4.3.1 ResourceManager. 165

5.4.3.2 NodeManager. 166

5.4.4 YARN Scheduler. 169

5.4.4.1 FIFO scheduler. 169

5.4.4.2 Capacity Scheduler. 170

5.4.4.3 Fair Scheduler. 170

5.4.5 Failures in YARN.. 171

5.4.5.1ResourceManager failure. 171

5.4.5.2 ApplicationMaster failure. 172

5.4.5.3 NodeManagerFailure. 172

5.4.5.4 Container Failure. 172

5.3 HBASE. 173

5.4 Apache Cassandra. 176

5.5 SQOOP. 177

5.6 Flume. 179

5.6.1 Flume Architecture. 179

5.6.1.1 Event. 180

5.6.1.2 Agent. 180

5.7 Apache Avro. 181

5.8 Apache Pig. 182

5.9 Apache Mahout. 183

5.10 Apache Oozie. 183

5.10.1 Oozie Workflow.. 184

5.10.2 Oozie Coordinators. 186

5.10.3 Oozie Bundles. 187

5.11 Apache Hive. 187

5.11 Apache Hive. 187

Hive Architecture. 189

Hadoop Distributions. 190

Chapter 5refresher. 191

Conceptual short questions with answers. 194

Frequently asked Interview Questions. 199

Chapter Objective. 200

6.1 Terminologies of Big Data Analytics. 201

Data Warehouse. 201

Business Intelligence. 201

Analytics. 202

6.2 Big Data Analytics. 202

6.2.1 Descriptive Analytics. 204

6.2.2 Diagnostic Analytics. 205

6.2.3 Predictive Analytics. 205

6.2.4 Prescriptive Analytics. 205

6.3 Data Analytics Lifecycle. 207

6.3.1 Business case evaluation and Identify the source data. 208

6.3.2 Data preparation. 209

6.3.3 Data Extraction and Transformation. 210

6.3.4 Data Analysis and visualization. 211

6.3.5 Analytics application. 212

6.4 Big Data Analytics Techniques. 212

6.4.1 Quantitative Analysis. 212

6.4.3 Statistical analysis. 214

6.4.3.1 A/B testing. 214

6.4.3.2 Correlation. 215

6.4.3.3 Regression. 218

6.5 Semantic Analysis. 220

6.5.1 Natural Language Processing. 220

6.5.2 Text Analytics. 221

6.7 Big Data Business Intelligence. 222

6.7.1 Online Transaction Processing (OLTP). 223

6.7.2 Online Analytical Processing (OLAP). 223

6.7.3 Real-Time Analytics Platform (RTAP). 224

6.6Big Data Real Time Analytics Processing. 225

6.7 Enterprise Data Warehouse. 227

Chapter 6 Refresher. 228

Conceptual short questions with answers. 230

Chapter Objective. 233

7.1 Introduction to Machine learning. 233

7.2 Machine learning use cases. 234

7.3 Types of Machine learning. 235

7.3.1 Supervised machine learning algorithm.. 236

7.3.1.1 Classification. 237

7.3.1.2 Regression. 238

Support vector machines (SVM). 239

Big Data Analytics Practical Application. 244

Chapter 7 Refresher. 245

Conceptual short questions with answers. 247

Chapter Objective. 249

8.1 Itemset Mining. 249

8.2 Association Rules. 255

8.3 Frequent itemset generation. 259

8.4 Itemset Mining Algorithms. 260

8.4.1 Apriori Algorithm.. 260

8.4.1.2 Frequent Itemset generation using Apriori Algorithm.. 266

8.4.2 Eclat Algorithm - Equivalence Class Transformation Algorithm.. 268

8.4.3 FP growth algorithm.. 271

8.5 Maximal and Closed Frequent Itemset. 278

Mining Closed Frequent Itemsets: Charm Algorithm.. 284

CHARM Algorithm implementation. 285

Data Mining Methods. 287

8.8 Prediction. 288

8.8.2 Classification techniques. 289

8.8.2.1 Bayesian Network. 289

8.8.2.2 K- Nearest Neighbor Algorithm.. 294

8.8.2.2.1 The Distance metric. 296

8.8.2.2.2 The parameter selection - cross validation. 296

8.8.2.3 Decision tree classifier. 297

Density based clustering algorithm.. 299

DBSCAN.. 299

Kernel Density Estimation. 303

8.9.3 Artificial Neural Network. 303

The Biological Neural Network. 303

8.11 Mining Data Streams. 305

Time Series Forecasting. 306

9.1Clustering. 308

Application of Hierarchical methods. 315

Kernel k-means clustering. 321

Expectation Maximization Clustering Algorithm.. 323

Methods of determining the Number of clusters: 327

Outlier detection. 327

Types of Outliers. 329

Outlier detection techniques. 332

Training dataset based outlier detection. 332

Assumption based outlier detection. 333

Applications of outlier detection. 334

9.6.3 Optimization Algorithm.. 335

Choosing the Number of Clusters. 339

Bayesian Analysis of Mixtures. 342

Fuzzy Clustering. 342

10.1 Big Data Visualization. 345

10.2 Conventional Data Visualization Techniques. 346

10.2.1 Line Chart. 346

10.2.2 Bar Chart. 347

10.2.3 Pie Chart. 348

10.2.4 Scatter Plot. 349

10.2.5 Bubble plot. 350

Tableau. 350

Connecting to data. 354

Connecting to data in Cloud. 355

Connect to a file. 356

Scatter plot in tableau. 362

Histogram using Tablaeu. 365

Bar chart in tableau. 365

Line Chart. 367

Pie chart. 368

Bubble chart. 369

Box Plot. 370

Tableau Use Cases. 371

Airlines. 371

Office Supplies. 372

Sports. 374

Science - Earthquake Analysis. 375

Tableau is used to analyze the magnitude of earth quake and the frequency of occurrence over the years. 375

Installing R and Getting Ready. 377

R Basic commands. 378

Assigning value to a variable. 378

Data Structures in R. 379

Vector. 379

Coercion. 380

Length, Mean and median. 381

Matrix. 382

Arrays. 385

Data frames. 387

Lists. 390

Importing data from a file. 392

Importing data from a delimited text file. 394

Control Structures in R. 394

If-else. 395

Nested if-else. 395

for loops. 396

Example. 396

[1] 4. 397

while loops. 397

Break. 398

Basic Graphs in R. 398

Pie Charts. 398

3D - Pie Charts. 399

Bar Charts. 400

Boxplots. 401

Histograms. 402

Line charts. 403

Scatter plots. 405
BALAMURUGAN BALUSAMY, PHD, is a Professor with the School of Computing Science and Engineering at Galgotias University, Greater Noida, India

NANDHINI ABIRAMI. R is an IT Consultant and Research Scholar at VIT University in Vellore.

SEIFEDINE KADRY, PhD, is a Professor of Data Science at the Faculty of Applied Computing and Technology at Noroff University College, Kristiansand, Norway.

AMIR H. GANDOMI, PHD, is a Professor of Data Science at the Faculty of Engineering & Information Technology, University of Technology Sydney, Australia.