Data Analytics in the AWS Cloud

Building a Data Platform for BI and Predictive Analytics on AWS

Minichino, Joe

1. Auflage Mai 2023
416 Seiten, Softcover
Wiley & Sons Ltd

ISBN: 978-1-119-90924-8

John Wiley & Sons

Probekapitel

Jetzt kaufen

Preis: 57,90 €

Preis inkl. MwSt, zzgl. Versand

Weitere Versionen

A comprehensive and accessible roadmap to performing data analytics in the AWS cloud

In Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS, accomplished software engineer and data architect Joe Minichino delivers an expert blueprint to storing, processing, analyzing data on the Amazon Web Services cloud platform. In the book, you'll explore every relevant aspect of data analytics--from data engineering to analysis, business intelligence, DevOps, and MLOps--as you discover how to integrate machine learning predictions with analytics engines and visualization tools.

You'll also find:
* Real-world use cases of AWS architectures that demystify the applications of data analytics
* Accessible introductions to data acquisition, importation, storage, visualization, and reporting
* Expert insights into serverless data engineering and how to use it to reduce overhead and costs, improve stability, and simplify maintenance

A can't-miss for data architects, analysts, engineers and technical professionals, Data Analytics in the AWS Cloud will also earn a place on the bookshelves of business leaders seeking a better understanding of data analytics on the AWS cloud platform.

Introduction xxiii

Chapter 1 AWS Data Lakes and Analytics Technology Overview 1

Why AWS? 1

What Does a Data Lake Look Like in AWS? 2

Analytics on AWS 3

Skills Required to Build and Maintain an AWS Analytics Pipeline 3

Chapter 2 The Path to Analytics: Setting Up a Data and Analytics Team 5

The Data Vision 6

Support 6

DA Team Roles 7

Early Stage Roles 7

Team Lead 8

Data Architect 8

Data Engineer 8

Data Analyst 9

Maturity Stage Roles 9

Data Scientist 9

Cloud Engineer 10

Business Intelligence (BI) Developer 10

Machine Learning Engineer 10

Business Analyst 11

Niche Roles 11

Analytics Flow at a Process Level 12

Workflow Methodology 12

The DA Team Mantra: "Automate Everything" 14

Analytics Models in the Wild: Centralized, Distributed, Center of Excellence 15

Centralized 15

Distributed 16

Center of Excellence 16

Summary 17

Chapter 3 Working on AWS 19

Accessing AWS 20

Everything Is a Resource 21

S3: An Important Exception 21

IAM: Policies, Roles, and Users 22

Policies 22

Identity- Based Policies 24

Resource- Based Policies 25

Roles 25

Users and User Groups 25

Summarizing IAM 26

Working with the Web Console 26

The AWS Command- Line Interface 29

Installing AWS cli 29

Linux Installation 30

macOS Installation 30

Windows 31

Configuring AWS cli 31

A Note on Region 33

Setting Individual Parameters 33

Using Profiles and Configuration Files 33

Final Notes on Configuration 36

Using the AWS cli 36

Using Skeletons and File Inputs 39

Cleaning Up! 43

Infrastructure- as- Code: CloudFormation and Terraform 44

CloudFormation 44

CloudFormation Stacks 46

CloudFormation Template Anatomy 47

CloudFormation Changesets 52

Getting Stack Information 55

Cleaning Up Again 57

CloudFormation Conclusions 58

Terraform 58

Coding Style 58

Modularity 59

Limitations 59

Terraform vs. CloudFormation 60

Infrastructure- as- Code: CDK, Pulumi, Cloudcraft, and Other Solutions 60

AWS CDK 60

Pulumi 62

Cloudcraft 62

Infrastructure Management Conclusions 63

Chapter 4 Serverless Computing and Data Engineering 65

Serverless vs. Fully Managed 65

AWS Serverless Technologies 66

AWS Lambda 67

Pricing Model 67

Laser Focus on Code 68

The Lambda Paradigm Shift 69

Virtually Infinite Scalability 70

Geographical Distribution 70

A Lambda Hello World 71

Lambda Configuration 74

Runtime 74

Container- Based Lambdas 75

Architectures 75

Memory 75

Networking 76

Execution Role 76

Environment Variables 76

AWS EventBridge 77

AWS Fargate 77

AWS DynamoDB 77

AWS SNS 77

Amazon SQS 78

AWS CloudWatch 78

Amazon QuickSight 78

AWS Step Functions 78

Amazon API Gateway 79

Amazon Cognito 79

AWS Serverless Application Model (SAM) 79

Ephemeral Infrastructure 80

AWS SAM Installation 80

Configuration 80

Creating Your First AWS SAM Project 81

Application Structure 83

SAM Resource Types 85

SAM Lambda Template 86

!! Recursive Lambda Invocation !! 88

Function Metadata 88

Outputs 89

Implicitly Generated Resources 89

Other Template Sections 90

Lambda Code 90

Building Your First SAM Application 93

Testing the AWS SAM Application Locally 96

Deployment 99

Cleaning Up 104

Summary 104

Chapter 5 Data Ingestion 105

AWS Data Lake Architecture 106

Serverless Data Lake Architecture Structure 106

Ingestion 106

Storage and Processing 108

Cataloging, Governance, and Search 108

Security and Monitoring 109

Consumption 109

Sample Processing Architecture: Cataloging Images into DynamoDB 109

Use Case Description 109

SAM Application Creation 110

S3- Triggered Lambda 111

Adding DynamoDB 119

Lambda Execution Context 121

Inserting into DynamoDB 121

Cleaning Up 123

Serverless Ingestion 124

AWS Fargate 124

AWS Lambda 124

Example Architecture: Fargate- Based Periodic Batch Import 125

The Basic Importer 125

ECS CLI 128

AWS Copilot cli 128

Clean Up 136

AWS Kinesis Ingestion 136

Example Architecture: Two- Pronged Delivery 137

Fully Managed Ingestion with AppFlow 146

Operational Data Ingestion with Database Migration Service 151

DMS Concepts 151

DMS Instance 151

DMS Endpoints 152

DMS Tasks 152

Summary of the Workflow 152

Common Use of DMS 153

Example Architecture: DMS to S3 154

DMS Instance 154

DMS Endpoints 156

DMS Task 162

Summary 167

Chapter 6 Processing Data 169

Phases of Data Preparation 170

What Is ETL? Why Should I Care? 170

ETL Job vs. Streaming Job 171

Overview of ETL in AWS 172

ETL with AWS Glue 172

ETL with Lambda Functions 172

ETL with Hadoop/EMR 173

Other Ways to Perform ETL 173

ETL Job Design Concepts 173

Source Identification 174

Destination Identification 174

Mappings 174

Validation 174

Filter 175

Join, Denormalization, Relationalization 175

AWS Glue for ETL 176

Really, It's Just Spark 176

Visual 176

Spark Script Editor 177

Python Shell Script Editor 177

Jupyter Notebook 177

Connectors 177

Creating Connections 178

Creating Connections with the Web Console 178

Creating Connections with the AWS cli 179

Creating ETL Jobs with AWS Glue Visual Editor 184

ETL Example: Format Switch from Raw (JSON) to Cleaned (Parquet) 184

Job Bookmarks 187

Transformations 188

Apply Mapping 189

Filter 189

Other Available Transforms 190

Run the Edited Job 191

Visual Editor with Source and Target Conclusions 192

Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target) 192

Creating ETL Jobs with the Spark Script Editor 192

Developing ETL Jobs with AWS Glue Notebooks 193

What Is a Notebook? 194

Notebook Structure 194

Step 1: Load Code into a DynamicFrame 196

Step 2: Apply Field Mapping 197

Step 3: Apply the Filter 197

Step 4: Write to S3 in Parquet Format 198

Example: Joining and Denormalizing Data from Two S3 Locations 199

Conclusions for Manually Authored Jobs with Notebooks 203

Creating ETL Jobs with AWS Glue Interactive Sessions 204

It's Magic 205

Development Workflow 206

Streaming Jobs 207

Differences with a Standard ETL Job 208

Streaming Sources 208

Example: Process Kinesis Streams with a Streaming Job 208

Streaming ETL Jobs Conclusions 217

Summary 217

Chapter 7 Cataloging, Governance, and Search 219

Cataloging with AWS Glue 219

AWS Glue and the AWS Glue Data Catalog 219

Glue Databases and Tables 220

Databases 220

The Idea of Schema- on- Read 221

Tables 222

Create Table Manually 223

Creating a Table from an Existing Schema 225

Creating a Table with a Crawler 225

Summary on Databases and Tables 226

Crawlers 226

Updating or Not Updating? 230

Running the Crawler 231

Creating a Crawler from the AWS CLI 231

Retrieving Table Information from the CLI 233

Classifiers 235

Classifier Example 236

Crawlers and Classifiers Summary 237

Search with Amazon Athena: The Heart of Analytics in AWS 238

A Bit of History 238

Interface Overview 238

Creating Tables Manually 239

Athena Data Types 240

Complex Types 241

Running a Query 242

Connecting with JDBC and ODBC 243

Query Stats 243

Recent Queries and Saved Queries 243

The Power of Partitions 244

Athena Pricing Model 244

Automatic Naming 245

Athena Query Output 246

Athena Peculiarities (SQL and Not) 246

Computed Fields Gotcha and WITH Statement Workaround 246

Lowercase! 247

Query Explain 248

Deduplicating Records 249

Working with JSON, Flattening, and Unnesting 250

Athena Views 251

Create Table as Select (CTAS) 252

Saving Queries and Reusing Saved Queries 253

Running Parameterized Queries 254

Athena Federated Queries 254

Athena Lambda Connectors 255

Note on Connection Errors 256

Performing Federated Queries 257

Creating a View from a Federated Query 258

Governing: Athena Workgroups, Lake Formation, and More 258

Athena Workgroups 259

Fine- Grained Athena Access with IAM 262

Recap of Athena- Based Governance 264

AWS Lake Formation 265

Registering a Location in Lake Formation 266

Creating a Database in Lake Formation 268

Assigning Permissions in Lake Formation 269

LF- Tags and Permissions in Lake Formation 271

Data Filters 277

Governance Conclusions 279

Summary 280

Chapter 8 Data Consumption: BI, Visualization, and Reporting 283

QuickSight 283

Signing Up for QuickSight 284

Standard Plan 284

Enterprise Plan 284

Users and User Groups 285

Managing Users and Groups 285

Managing QuickSight 286

Users and Groups 287

Your Subscriptions 287

SPICE Capacity 287

Account Settings 287

Security and Permissions 287

VPC Connections 288

Mobile Settings 289

Domains and Embedding 289

Single Sign- On 289

Data Sources and Datasets 289

Creating an Athena Data Source 291

Creating Other Data Sources 292

Creating a Data Source from the AWS cli 292

Creating a Dataset from a Table 294

Creating a Dataset from a SQL Query 295

Duplicating Datasets 296

Note on Creating Datasets 297

QuickSight Favorites, Recent, and Folders 297

SPICE 298

Manage SPICE Capacity 298

Refresh Schedule 299

QuickSight Data Editor 299

QuickSight Data Types 302

Change Data Types 302

Calculated Fields 303

Joining Data 305

Excluding Fields 309

Filtering Data 309

Removing Data 310

Geospatial Hierarchies and Adding Fields to Hierarchies 310

Unsupported Format Dates 311

Visualizing Data: QuickSight Analysis 312

Adding a Title and a Description to Your Analysis 313

Renaming the Sheet 314

Your First Visual with AutoGraph 314

Field Wells 314

Visuals Types 315

Saving and Autosaving 316

A First Example: Pie Chart 316

Renaming a Visual 317

Filtering Data 318

Adding Drill- Downs 320

Parameters 321

Actions 324

Insights 328

ML- Powered Insights 330

Sharing an Analysis 335

Dashboards 335

Dashboard Layouts and Themes 335

Publishing a Dashboard 336

Embedding Visuals and Dashboards 337

Data Consumption: Not Only Dashboards 337

Summary 338

Chapter 9 Machine Learning at Scale 339

Machine Learning and Artificial Intelligence 339

What Are ML/AI Use Cases? 340

Types of ML Models 340

Overview of ML/AI AWS Solutions 341

Amazon SageMaker 341

SageMaker Domains 342

Adding a User to the Domain 344

SageMaker Studio 344

SageMaker Example Notebook 346

Step 1: Prerequisites and Preprocessing 346

Step 2: Data Ingestion 347

Step 3: Data Inspection 348

Step 4: Data Conversion 349

Step 5: Upload Training Data 349

Step 6: Train the Model 349

Step 7: Set Up Hosting and Deploy the Model 351

Step 8: Validate the Model 352

Step 9: Use the Model 353

Inference 353

Real Time 354

Asynchronous 354

Serverless 354

Batch Transform 354

Data Wrangler 356

SageMaker Canvas 357

Summary 358

Appendix Example Data Architectures in AWS 359

Modern Data Lake Architecture 360

ETL in a Lake House 361

Consuming Data in the Lake House 361

The Modern Data Lake Architecture 362

Batch Processing 362

Stream Processing 363

Architecture Design Recommendations 364

Automate Everything 365

Build on Events 365

Performance = Cost Savings 365

AWS Glue Catalog and Athena- Centric Workflow 365

Design Flexible 365

Pick Your Battles 365

Parquet 366

Summary 366

Index 367

GIONATA "JOE" MINICHINO is Principal Software Engineer and Data Architect on the Data & Analytics Team at Teamwork. He specializes in cloud computing, machine/deep learning, and artificial intelligence and designs end-to-end Amazon Web Services pipelines that move large quantities of diverse data for analysis and visualization.