Designing Big Data Platforms

How to Use, Deploy, and Maintain Big Data Systems

Aytas, Yusuf

1. Auflage August 2021
336 Seiten, Hardcover
Wiley & Sons Ltd

ISBN: 978-1-119-69092-4

John Wiley & Sons

Wiley Online Library Probekapitel

Weitere Versionen

DESIGNING BIG DATA PLATFORMS

Provides expert guidance and valuable insights on getting the most out of Big Data systems

An array of tools are currently available for managing and processing data--some are ready-to-go solutions that can be immediately deployed, while others require complex and time-intensive setups. With such a vast range of options, choosing the right tool to build a solution can be complicated, as can determining which tools work well with each other. Designing Big Data Platforms provides clear and authoritative guidance on the critical decisions necessary for successfully deploying, operating, and maintaining Big Data systems.

This highly practical guide helps readers understand how to process large amounts of data with well-known Linux tools and database solutions, use effective techniques to collect and manage data from multiple sources, transform data into meaningful business insights, and much more. Author Yusuf Aytas, a software engineer with a vast amount of big data experience, discusses the design of the ideal Big Data platform: one that meets the needs of data analysts, data engineers, data scientists, software engineers, and a spectrum of other stakeholders across an organization. Detailed yet accessible chapters cover key topics such as stream data processing, data analytics, data science, data discovery, and data security. This real-world manual for Big Data technologies:
* Provides up-to-date coverage of the tools currently used in Big Data processing and management
* Offers step-by-step guidance on building a data pipeline, from basic scripting to distributed systems
* Highlights and explains how data is processed at scale
* Includes an introduction to the foundation of a modern data platform

Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems is a must-have for all professionals working with Big Data, as well researchers and students in computer science and related fields.

List of Contributors xvii

Preface xix

Acknowledgments xxi

Acronyms xxiii

Introduction xxv

1 An Introduction: What's a Modern Big Data Platform 1

1.1 Defining Modern Big Data Platform 1

1.2 Fundamentals of a Modern Big Data Platform 2

1.2.1 Expectations from Data 2

1.2.1.1 Ease of Access 2

1.2.1.2 Security 2

1.2.1.3 Quality 3

1.2.1.4 Extensibility 3

1.2.2 Expectations from Platform 3

1.2.2.1 Storage Layer 4

1.2.2.2 Resource Management 4

1.2.2.3 ETL 5

1.2.2.4 Discovery 6

1.2.2.5 Reporting 7

1.2.2.6 Monitoring 7

1.2.2.7 Testing 8

1.2.2.8 Lifecycle Management 9

2 A Bird's Eye View on Big Data 11

2.1 A Bit of History 11

2.1.1 Early Uses of Big Data Term 11

2.1.2 A New Era 12

2.1.2.1 Word Count Problem 12

2.1.2.2 Execution Steps 13

2.1.3 An Open-Source Alternative 15

2.1.3.1 Hadoop Distributed File System 15

2.1.3.2 HadoopMapReduce 17

2.2 What Makes Big Data 20

2.2.1 Volume 20

2.2.2 Velocity 21

2.2.3 Variety 21

2.2.4 Complexity 21

2.3 Components of Big Data Architecture 22

2.3.1 Ingestion 22

2.3.2 Storage 23

2.3.3 Computation 23

2.3.4 Presentation 24

2.4 Making Use of Big Data 24

2.4.1 Querying 24

2.4.2 Reporting 25

2.4.3 Alerting 25

2.4.4 Searching 25

2.4.5 Exploring 25

2.4.6 Mining 25

2.4.7 Modeling 26

3 A Minimal Data Processing and Management System 27

3.1 Problem Definition 27

3.1.1 Online Book Store 27

3.1.2 User Flow Optimization 28

3.2 Processing Large Data with Linux Commands 28

3.2.1 Understand the Data 28

3.2.2 Sample the Data 28

3.2.3 Building the Shell Command 29

3.2.4 Executing the Shell Command 30

3.2.5 Analyzing the Results 31

3.2.6 Reporting the Findings 32

3.2.7 Automating the Process 33

3.2.8 A Brief Review 33

3.3 Processing Large Data with PostgreSQL 34

3.3.1 Data Modeling 34

3.3.2 Copying Data 35

3.3.3 Sharding in PostgreSQL 37

3.3.3.1 Setting up Foreign Data Wrapper 37

3.3.3.2 Sharding Data over Multiple Nodes 38

3.4 Cost of Big Data 39

4 Big Data Storage 41

4.1 Big Data Storage Patterns 41

4.1.1 Data Lakes 41

4.1.2 Data Warehouses 42

4.1.3 Data Marts 43

4.1.4 Comparison of Storage Patterns 43

4.2 On-Premise Storage Solutions 44

4.2.1 Choosing Hardware 44

4.2.1.1 DataNodes 44

4.2.1.2 NameNodes 45

4.2.1.3 Resource Managers 45

4.2.1.4 Network Equipment 45

4.2.2 Capacity Planning 46

4.2.2.1 Overall Cluster 46

4.2.2.2 Resource Sharing 47

4.2.2.3 Doing the Math 47

4.2.3 Deploying Hadoop Cluster 48

4.2.3.1 Networking 48

4.2.3.2 Operating System 48

4.2.3.3 Management Tools 49

4.2.3.4 Hadoop Ecosystem 49

4.2.3.5 A Humble Deployment 49

4.3 Cloud Storage Solutions 53

4.3.1 Object Storage 54

4.3.2 Data Warehouses 55

4.3.2.1 Columnar Storage 55

4.3.2.2 Provisioned Data Warehouses 56

4.3.2.3 Serverless Data Warehouses 56

4.3.2.4 Virtual Data Warehouses 57

4.3.3 Archiving 58

4.4 Hybrid Storage Solutions 59

4.4.1 Making Use of Object Store 59

4.4.1.1 Additional Capacity 59

4.4.1.2 Batch Processing 59

4.4.1.3 Hot Backup 60

4.4.2 Making Use of Data Warehouse 60

4.4.2.1 Primary Data Warehouse 60

4.4.2.2 Shared Data Mart 61

4.4.3 Making Use of Archiving 61

5 Offline Big Data Processing 63

5.1 Defining Offline Data Processing 63

5.2 MapReduce Technologies 65

5.2.1 Apache Pig 65

5.2.1.1 Pig Latin Overview 66

5.2.1.2 Compilation To MapReduce 66

5.2.2 Apache Hive 67

5.2.2.1 Hive Database 68

5.2.2.2 Hive Architecture 69

5.3 Apache Spark 70

5.3.1 What's Spark 71

5.3.2 Spark Constructs and Components 71

5.3.2.1 Resilient Distributed Datasets 71

5.3.2.2 Distributed Shared Variables 73

5.3.2.3 Datasets and DataFrames 74

5.3.2.4 Spark Libraries and Connectors 75

5.3.3 Execution Plan 76

5.3.3.1 The Logical Plan 77

5.3.3.2 The Physical Plan 77

5.3.4 Spark Architecture 77

5.3.4.1 Inside of Spark Application 78

5.3.4.2 Outside of Spark Application 79

5.4 Apache Flink 81

5.5 Presto 83

5.5.1 Presto Architecture 83

5.5.2 Presto System Design 84

5.5.2.1 Execution Plan 84

5.5.2.2 Scheduling 86

5.5.2.3 Resource Management 86

5.5.2.4 Fault Tolerance 87

6 Stream Big Data Processing 89

6.1 The Need for Stream Processing 89

6.2 Defining Stream Data Processing 90

6.3 Streams via Message Brokers 92

6.3.1 Apache Kafka 92

6.3.1.1 Apache Samza 93

6.3.1.2 Kafka Streams 98

6.3.2 Apache Pulsar 100

6.3.2.1 Pulsar Functions 102

6.3.3 AMQP Based Brokers 105

6.4 Streams via Stream Engines 106

6.4.1 Apache Flink 106

6.4.1.1 Flink Architecture 107

6.4.1.2 System Design 109

6.4.2 Apache Storm 111

6.4.2.1 Storm Architecture 114

6.4.2.2 System Design 115

6.4.3 Apache Heron 116

6.4.3.1 Storm Limitations 116

6.4.3.2 Heron Architecture 117

6.4.4 Spark Streaming 118

6.4.4.1 Discretized Streams 119

6.4.4.2 Fault-tolerance 120

7 Data Analytics 121

7.1 Log Collection 121

7.1.1 Apache Flume 122

7.1.2 Fluentd 122

7.1.2.1 Data Pipeline 123

7.1.2.2 Fluent Bit 124

7.1.2.3 Fluentd Deployment 124

7.2 Transferring Big Data Sets 125

7.2.1 Reloading 126

7.2.2 Partition Loading 126

7.2.3 Streaming 127

7.2.4 Timestamping 127

7.2.5 Tools 128

7.2.5.1 Sqoop 128

7.2.5.2 Embulk 128

7.2.5.3 Spark 129

7.2.5.4 Apache Gobblin 130

7.3 Aggregating Big Data Sets 132

7.3.1 Data Cleansing 132

7.3.2 Data Transformation 134

7.3.2.1 Transformation Functions 134

7.3.2.2 Transformation Stages 135

7.3.3 Data Retention 135

7.3.4 Data Reconciliation 136

7.4 Data Pipeline Scheduler 136

7.4.1 Jenkins 137

7.4.2 Azkaban 138

7.4.2.1 Projects 139

7.4.2.2 Execution Modes 139

7.4.3 Airflow 139

7.4.3.1 Task Execution 140

7.4.3.2 Scheduling 141

7.4.3.3 Executor 141

7.4.3.4 Security and Monitoring 142

7.4.4 Cloud 143

7.5 Patterns and Practices 143

7.5.1 Patterns 143

7.5.1.1 Data Centralization 143

7.5.1.2 Singe Source of Truth 144

7.5.1.3 Domain Driven Data Sets 145

7.5.2 Anti-Patterns 146

7.5.2.1 Data Monolith 146

7.5.2.2 Data Swamp 147

7.5.2.3 Technology Pollution 147

7.5.3 Best Practices 148

7.5.3.1 Business-Driven Approach 148

7.5.3.2 Cost of Maintenance 148

7.5.3.3 Avoiding Modeling Mistakes 149

7.5.3.4 Choosing Right Tool for The Job 150

7.5.4 Detecting Anomalies 150

7.5.4.1 Manual Anomaly Detection 151

7.5.4.2 Automated Anomaly Detection 151

7.6 Exploring Data Visually 152

7.6.1 Metabase 152

7.6.2 Apache Superset 153

8 Data Science 155

8.1 Data Science Applications 155

8.1.1 Recommendation 156

8.1.2 Predictive Analytics 156

8.1.3 Pattern Discovery 157

8.2 Data Science Life Cycle 158

8.2.1 Business Objective 158

8.2.2 Data Understanding 159

8.2.3 Data Ingestion 159

8.2.4 Data Preparation 160

8.2.5 Data Exploration 160

8.2.6 Feature Engineering 161

8.2.7 Modeling 161

8.2.8 Model Evaluation 162

8.2.9 Model Deployment 163

8.2.10 Operationalizing 163

8.3 Data Science Toolbox 164

8.3.1 R 164

8.3.2 Python 165

8.3.3 SQL 167

8.3.4 TensorFlow 167

8.3.4.1 Execution Model 169

8.3.4.2 Architecture 170

8.3.5 Spark MLlib 171

8.4 Productionalizing Data Science 173

8.4.1 Apache PredictionIO 173

8.4.1.1 Architecture Overview 174

8.4.1.2 Machine Learning Templates 174

8.4.2 Seldon 175

8.4.3 MLflow 175

8.4.3.1 MLflow Tracking 175

8.4.3.2 MLflow Projects 176

8.4.3.3 MLflow Models 176

8.4.3.4 MLflow Model Registry 177

8.4.4 Kubeflow 177

8.4.4.1 Kubeflow Pipelines 177

8.4.4.2 Kubeflow Metadata 178

8.4.4.3 Kubeflow Katib 178

9 Data Discovery 179

9.1 Need for Data Discovery 179

9.1.1 Single Source of Metadata 180

9.1.2 Searching 181

9.1.3 Data Lineage 182

9.1.4 Data Ownership 182

9.1.5 Data Metrics 183

9.1.6 Data Grouping 183

9.1.7 Data Clustering 184

9.1.8 Data Classification 184

9.1.9 Data Glossary 185

9.1.10 Data Update Notification 185

9.1.11 Data Presentation 186

9.2 Data Governance 186

9.2.1 Data Governance Overview 186

9.2.1.1 Data Quality 186

9.2.1.2 Metadata 187

9.2.1.3 Data Access 187

9.2.1.4 Data Life Cycle 188

9.2.2 Big Data Governance 188

9.2.2.1 Data Architecture 188

9.2.2.2 Data Source Integration 189

9.3 Data Discovery Tools 189

9.3.1 Metacat 189

9.3.1.1 Data Abstraction and Interoperability 190

9.3.1.2 Metadata Enrichment 190

9.3.1.3 Searching and Indexing 190

9.3.1.4 Update Notifications 191

9.3.2 Amundsen 191

9.3.2.1 Discovery Capabilities 191

9.3.2.2 Integration Points 192

9.3.2.3 Architecture Overview 192

9.3.3 Apache Atlas 193

9.3.3.1 Searching 194

9.3.3.2 Glossary 194

9.3.3.3 Type System 194

9.3.3.4 Lineage 195

9.3.3.5 Notifications 196

9.3.3.6 Bridges and Hooks 196

9.3.3.7 Architecture Overview 196

10 Data Security 199

10.1 Infrastructure Security 199

10.1.1 Computing 200

10.1.1.1 Auditing 200

10.1.1.2 Operating System 200

10.1.1.3 Network 200

10.1.2 Identity and Access Management 201

10.1.2.1 Authentication 201

10.1.2.2 Authorization 201

10.1.3 Data Transfers 202

10.2 Data Privacy 202

10.2.1 Data Encryption 202

10.2.1.1 File System Layer Encryption 203

10.2.1.2 Database Layer Encryption 203

10.2.1.3 Transport Layer Encryption 203

10.2.1.4 Application Layer Encryption 203

10.2.2 Data Anonymization 204

10.2.2.1 k-Anonymity 204

10.2.2.2 l-Diversity 204

10.2.3 Data Perturbation 204

10.3 Law Enforcement 205

10.3.1 PII 205

10.3.1.1 Identifying PII Tables/Columns 205

10.3.1.2 Segregating Tables Containing PII 205

10.3.1.3 Protecting PII Tables via Access Control 206

10.3.1.4 Masking and Anonymizing PII Data 206

10.3.2 Privacy Regulations/Acts 207

10.3.2.1 Collecting Data 207

10.3.2.2 Erasing Data 207

10.4 Data Security Tools 208

10.4.1 Apache Ranger 208

10.4.1.1 Ranger Policies 208

10.4.1.2 Managed Components 209

10.4.1.3 Architecture Overview 211

10.4.2 Apache Sentry 212

10.4.2.1 Managed Components 212

10.4.2.2 Architecture Overview 213

10.4.3 Apache Knox 214

10.4.3.1 Authentication and Authorization 215

10.4.3.2 Supported Hadoop Services 215

10.4.3.3 Client Services 216

10.4.3.4 Architecture Overview 217

10.4.3.5 Audit 218

11 Putting All Together 219

11.1 Platforms 219

11.1.1 In-house Solutions 220

11.1.1.1 Cloud Provisioning 220

11.1.1.2 On-premise Provisioning 221

11.1.2 Cloud Providers 221

11.1.2.1 Vendor Lock-in 222

11.1.2.2 Outages 223

11.1.3 Hybrid Solutions 223

11.1.3.1 Kubernetes 224

11.2 Big Data Systems and Tools 224

11.2.1 Storage 224

11.2.1.1 File-Based Storage 225

11.2.1.2 NoSQL 225

11.2.2 Processing 226

11.2.2.1 Batch Processing 226

11.2.2.2 Stream Processing 227

11.2.2.3 Combining Batch and Streaming 227

11.2.3 Model Training 228

11.2.4 A Holistic View 228

11.3 Challenges 229

11.3.1 Growth 229

11.3.2 SLA 230

11.3.3 Versioning 231

11.3.4 Maintenance 232

11.3.5 Deprecation 233

11.3.6 Monitoring 233

11.3.7 Trends 234

11.3.8 Security 235

11.3.9 Testing 235

11.3.10 Organization 236

11.3.11 Talent 237

11.3.12 Budget 238

12 An Ideal Platform 239

12.1 Event Sourcing 240

12.1.1 How It Works 240

12.1.2 Messaging Middleware 241

12.1.3 Why Use Event Sourcing 242

12.2 Kappa Architecture 242

12.2.1 How It Works 243

12.2.2 Limitations 244

12.3 Data Mesh 245

12.3.1 Domain-Driven Design 245

12.3.2 Self-Serving Infrastructure 247

12.3.3 Data as Product Approach 247

12.4 Data Reservoirs 248

12.4.1 Data Organization 248

12.4.2 Data Standards 249

12.4.3 Data Policies 249

12.4.4 Multiple Data Reservoirs 250

12.5 Data Catalog 250

12.5.1 Data Feedback Loop 251

12.5.2 Data Synthesis 252

12.6 Self-service Platform 252

12.6.1 Data Publishing 253

12.6.2 Data Processing 254

12.6.3 Data Monitoring 254

12.7 Abstraction 254

12.7.1 Abstractions via User Interface 255

12.7.2 Abstractions via Wrappers 256

12.8 Data Guild 256

12.9 Trade-offs 257

12.9.1 Quality vs Efficiency 258

12.9.2 Real time vs Offline 258

12.9.3 Performance vs Cost 258

12.9.4 Consistency vs Availability 259

12.9.5 Disruption vs Reliability 259

12.9.6 Build vs Buy 259

12.10 Data Ethics 260

Appendix A Further Systems and Patterns 261

A.1 Lambda Architecture 261

A.1.1 Batch Layer 262

A.1.2 Speed Layer 262

A.1.3 Serving Layer 262

A.2 Apache Cassandra 263

A.2.1 Cassandra Data Modeling 263

A.2.2 Cassandra Architecture 265

A.2.2.1 Cassandra Components 265

A.2.2.2 Storage Engine 267

A.3 Apache Beam 267

A.3.1 Programming Overview 268

A.3.2 Execution Model 270

Appendix B Recipes 271

B.1 Activity Tracking Recipe 271

B.1.1 Problem Statement 271

B.1.2 Design Approach 271

B.1.2.1 Data Ingestion 271

B.1.2.2 Computation 272

B.2 Data Quality Assurance 273

B.2.1 Problem Statement 273

B.2.2 Design Approach 273

B.2.2.1 Ingredients 273

B.2.2.2 Preparation 275

B.2.2.3 The Menu 277

B.3 Estimating Time to Delivery 277

B.3.1 Problem Definition 278

B.3.2 Design Approach 278

B.3.2.1 Streaming Reference Architecture 278

B.4 Incident Response Recipe 283

B.4.1 Problem Definition 283

B.4.2 Design Approach 283

B.4.2.1 Minimizing Disruptions 284

B.4.2.2 Categorizing Disruptions 284

B.4.2.3 Handling Disruptions 284

B.5 Leveraging Spark SQL Metrics 286

B.5.1 Problem Statement 286

B.5.2 Design Approach 286

B.5.2.1 Spark SQL Metrics Pipeline 286

B.5.2.2 Integrating SQL Metrics 288

B.5.2.3 The Menu 289

B.6 Airbnb Price Prediction 289

B.6.1 Problem Statement 290

B.6.2 Design Approach 290

B.6.2.1 Tech Stack 290

B.6.2.2 Data Understanding and Ingestion 291

B.6.2.3 Data Preparation 291

B.6.2.4 Modeling and Evaluation 293

B.6.2.5 Monitor and Iterate 294

Bibliography 295

Index 301

Yusuf Aytas is an experienced software engineer. He holds a Master and Bachelor of Computer Science from Bilkent University, Ankara, Turkey. Yusuf has experience in building highly scalable, optimized analytics platforms. He is proficient in agile methodologies, continuous delivery, and software development best practices.