CHF35.65
Download est disponible immédiatement
PEEK "UNDER THE HOOD" OF BIG DATA ANALYTICS
The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance.
The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within.
Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system.
Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to:
Identify the foundations of large-scale, distributed data processing systems
Make major software design decisions that optimize performance
Diagnose performance problems and distributed operation issues
Understand state-of-the-art research in big data
Explain and use the major big data frameworks and understand what underpins them
Use big data analytics in the real world to solve practical problems
Auteur
SUPUN KAMBURUGAMUVE, PhD, is a computer scientist researching and designing large scale data analytics tools. He received his doctorate in Computer Science from Indiana University, Bloomington and architected the data processing systems Twister2 and Cylon.
SALIYA EKANAYAKE, PhD, is a Senior Software Engineer at Microsoft working in the intersection of scaling deep learning systems and parallel computing. He is also a research affiliate at Berkeley Lab. He received his doctorate in Computer Science from Indiana University, Bloomington.
Texte du rabat
PEEK UNDER THE HOOD OF BIG DATA ANALYTICS
The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance. The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within. Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system. Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to:
Contenu
Introduction xxvii
Chapter 1 Data Intensive Applications 1
Anatomy of a Data-Intensive Application 1
A Histogram Example 2
Program 2
Process Management 3
Communication 4
Execution 5
Data Structures 6
Putting It Together 6
Application 6
Resource Management 6
Messaging 7
Data Structures 7
Tasks and Execution 8
Fault Tolerance 8
Remote Execution 8
Parallel Applications 9
Serial Applications 9
Lloyd's K-Means Algorithm 9
Parallelizing Algorithms 11
Decomposition 11
Task Assignment 12
Orchestration 12
Mapping 13
K-Means
Algorithm 13
Parallel and Distributed Computing 15
Memory Abstractions 16
Shared Memory 16
Distributed Memory 18
Hybrid (Shared + Distributed) Memory 20
Partitioned Global Address Space Memory 21
Application Classes and Frameworks 22
Parallel Interaction Patterns 22
Pleasingly Parallel 23
Dataflow 23
Iterative 23
Irregular 23
Data Abstractions 24
Data-Intensive
Frameworks 24
Components 24
Workflows 25
An Example 25
What Makes It Difficult? 26
Developing Applications 27
Concurrency 27
Data Partitioning 28
Debugging 28
Diverse Environments 28
Computer Networks 29
Synchronization 29
Thread Synchronization 29
Data Synchronization 30
Ordering of Events 31
Faults 31
Consensus 31
Summary 32
References 32
Chapter 2 Data and Storage 35
Storage Systems 35
Storage for Distributed Systems 36
Direct-Attached Storage 37
Storage Area Network 37
Network-Attached Storage 38
DAS or SAN or NAS? 38
Storage Abstractions 39
Block Storage 39
File Systems 40
Object Storage 41
Data Formats 41
XML 42
JSON 43
CSV 44
Apache Parquet 45
Apache Avro 47
Avro Data Definitions (Schema) 48
Code Generation 49
Without Code Generation 49
Avro File 49
Schema Evolution 49
Protocol Buffers, Flat Buffers, and Thrift 50
Data Replication 51
Synchronous and Asynchronous Replication 52
Single-Leader and Multileader Replication 52
Data Locality 53
Disadvantages of Replication 54
Data Partitioning 54
Vertical Partitioning 55
Horizontal Partitioning (Sharding) 55
Hybrid Partitioning 56
Considerations for Partitioning 57
NoSQL Databases 58
Data Models 58
Key-Value Databases 58
Document Databases 59
Wide Column Databases 59
Graph Databases 59
CAP Theorem 60
Message Queuing 61
Message Processing Guarantees 63
Durability of Messages 64
Acknowledgments 64
Storage First Brokers and Transient Brokers 65
Summary 66
References 66
Chapter 3 Computing Resources 69
A Demonstration 71
Computer Clusters 72
Anatomy of a Computer Cluster 73
Data Analytics in Clusters 74
Dedicated Clusters 76
Classic Parallel Systems 76
Big Data Systems 77
Shared Clusters 79
OpenMPI on a Slurm Cluster 79
Spark on a Yarn Cluster 80
Distributed Application Life Cycle 80 Life Cycle Steps 80</p...