Fundamentals of Data Engineering: Plan and Build Robust Data Systems

Length: 424 pages
Edition: 1
Language: English
Publisher: O'Reilly Media
Publication Date: 2022-08-02
ISBN-10: 1098108302
ISBN-13: 9781098108304
Sales Rank: #69072 (See Top 100 Books)

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you will learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available in the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You will understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, governance, and deployment that are critical in any data environment regardless of the underlying technology.

This book will help you:

Assess data engineering problems using an end-to-end data framework of best practices
Cut through marketing hype when choosing data technologies, architecture, and processes
Use the data engineering lifecycle to design and build a robust architecture
Incorporate data governance and security across the data engineering lifecycle

Preface
    What This Book Isn’t
    What This Book Is About
    Who Should Read This Book
    Prerequisites
    What You’ll Learn and How It Will Improve Your Abilities
    The Book Outline
    Conventions Used in This Book
    How to Contact Us
    Acknowledgments
I. Foundation and Building Blocks
1. Data Engineering Described
    What Is Data Engineering?
        Data Engineering Defined
        The Data Engineering Lifecycle
        Evolution of the Data Engineer
        Data Engineering and Data Science
    Data Engineering Skills and Activities
        Data Maturity and the Data Engineer
        The Background and Skills of a Data Engineer
        Business Responsibilities
        Technical Responsibilities
        The Continuum of Data Engineering Roles, from A to B
    Data Engineers Inside an Organization
        Internal-Facing Versus External-Facing Data Engineers
        Data Engineers and Other Technical Roles
        Data Engineers and Business Leadership
    Conclusion
    Additional Resources
2. The Data Engineering Lifecycle
    What Is the Data Engineering Lifecycle?
        The Data Lifecycle Versus the Data Engineering Lifecycle
        Generation: Source Systems
        Storage
        Ingestion
        Transformation
        Serving Data
    Major Undercurrents Across the Data Engineering Lifecycle
        Security
        Data Management
        Orchestration
        DataOps
        Data Architecture
        Software Engineering
    Conclusion
    Additional Resources
3. Designing Good Data Architecture
    What Is Data Architecture?
        Enterprise Architecture, Defined
        Data Architecture Defined
        “Good” Data Architecture
    Principles of Good Data Architecture
        Principle 1: Choose Common Components Wisely
        Principle 2: Plan for Failure
        Principle 3: Architect for Scalability
        Principle 4: Architecture Is Leadership
        Principle 5: Always Be Architecting
        Principle 6: Build Loosely Coupled Systems
        Principle 7: Make Reversible Decisions
        Principle 8: Prioritize Security
        Principle 9: Embrace FinOps
    Major Architecture Concepts
        Domains and Services
        Distributed Systems, Scalability, and Designing for Failure
        Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
        User Access: Single Versus Multitenant
        Event-Driven Architecture
        Brownfield Versus Greenfield Projects
    Examples and Types of Data Architecture
        Data Warehouse
        Data Lake
        Convergence, Next-Generation Data Lakes, and the Data Platform
        Modern Data Stack
        Lambda Architecture
        Kappa Architecture
        The Dataflow Model and Unified Batch and Streaming
        Architecture for IoT
        Data Mesh
        Other Data Architecture Examples
    Who’s Involved with Designing a Data Architecture?
    Conclusion
    Additional Resources
4. Choosing Technologies Across the Data Engineering Lifecycle
    Team Size and Capabilities
    Speed to Market
    Interoperability
    Cost Optimization and Business Value
        Total Cost of Ownership
        Total Opportunity Cost of Ownership
        FinOps
    Today Versus the Future: Immutable Versus Transitory Technologies
        Our Advice
    Location
        On Premises
        Cloud
        Hybrid Cloud
        Multicloud
        Decentralized: Blockchain and the Edge
        Our Advice
        Cloud Repatriation Arguments
    Build Versus Buy
        Open Source Software
        Proprietary Walled Gardens
        Our Advice
    Monolith Versus Modular
        Monolith
        Modularity
        The Distributed Monolith Pattern
        Our Advice
    Serverless Versus Servers
        Serverless
        Containers
        When Infrastructure Makes Sense
        Our Advice
    Optimization, Performance, and the Benchmark Wars
        Big Data...for the 1990s
        Nonsensical Cost Comparisons
        Asymmetric Optimization
        Caveat Emptor
    Undercurrents and Their Impacts on Choosing Technologies
        Data Management
        DataOps
        Data Architecture
        Orchestration Example: Airflow
        Software Engineering
    Conclusion
II. The Data Engineering Lifecycle in Depth
5. Data Generation in Source Systems
    Sources of Data: How Is Data Created?
    Source Systems: Main Ideas
        Files and Unstructured Data
        APIs
        Application Databases (OLTP systems)
        Online Analytical Processing System
        Change Data Capture
        Logs
        Database Logs
        CRUD
        Insert-Only
        Messages and Streams
        Types of Time
    Source System Practical Details
        Databases
        APIs
        Data Sharing
        Third-Party Data Sources
        Message Queues and Event-Streaming Platforms
    Whom You’ll Work With
    Undercurrents and Their Impact on Source Systems
        Security
        Data Management
        DataOps
        Data Architecture
        Orchestration
        Software Engineering
    Conclusion
    Additional Resources
6. Storage
    Raw Ingredients of Data Storage
        Magnetic Disk Drive
        Solid-State Drive
        Random Access Memory
        Networking and CPU
        Serialization
        Compression
        Caching
    Data Storage Systems
        Single Machine Versus Distributed Storage
        Eventual Versus Strong Consistency
        File Storage
        Block Storage
        Object Storage
        Cache and Memory-Based Storage Systems
        The Hadoop Distributed File System
        Streaming Storage
        Indexes, Partitioning, and Clustering
    Data Engineering Storage Abstractions
        The Data Warehouse
        The Data Lake
        The Data Lakehouse
        Data Platforms
        Stream-to-Batch Storage Architecture
    Big Ideas and Trends in Storage
        Data Catalog
        Data Sharing
        Schema
        Separation of Compute from Storage
        Data Storage Lifecycle and Data Retention
        Single-Tenant Versus Multitenant Storage
    Whom You’ll Work With
    Undercurrents
        Security
        Data Management
        DataOps
        Data Architecture
        Orchestration
        Software Engineering
    Conclusion
    Additional Resources
7. Ingestion
    What Is Data Ingestion?
    Key Engineering Considerations for the Ingestion Phase
        Bounded Versus Unbounded
        Frequency
        Synchronous Versus Asynchronous Ingestion
        Serialization and Deserialization
        Throughput and Scalability
        Reliability and Durability
        Payload
        Push Versus Pull Versus Poll Patterns
    Batch Ingestion Considerations
        Snapshot or Differential Extraction
        File-Based Export and Ingestion
        ETL Versus ELT
        Inserts, Updates, and Batch Size
        Data Migration
    Message and Stream Ingestion Considerations
        Schema Evolution
        Late-Arriving Data
        Ordering and Multiple Delivery
        Replay
        Time to Live
        Message Size
        Error Handling and Dead-Letter Queues
        Consumer Pull and Push
        Location
    Ways to Ingest Data
        Direct Database Connection
        Change Data Capture
        APIs
        Message Queues and Event-Streaming Platforms
        Managed Data Connectors
        Moving Data with Object Storage
        EDI
        Databases and File Export
        Practical Issues with Common File Formats
        Shell
        SSH
        SFTP and SCP
        Webhooks
        Web Interface
        Web Scraping
        Transfer Appliances for Data Migration
        Data Sharing
    Whom You’ll Work With
        Upstream Stakeholders
        Downstream Stakeholders
    Undercurrents
        Security
        Data Management
        DataOps
        Orchestration
        Software Engineering
    Conclusion
    Additional Resources
8. Queries, Modeling, and Transformation
    Queries
        What Is a Query?
        The Life of a Query
        The Query Optimizer
        Improving Query Performance
        Queries on Streaming Data
    Data Modeling
        What Is a Data Model?
        Conceptual, Logical, and Physical Data Models
        Normalization
        Techniques for Modeling Batch Analytical Data
        Modeling Streaming Data
    Transformations
        Batch Transformations
        Materialized Views, Federation, and Query Virtualization
        Streaming Transformations and Processing
    Whom You’ll Work With
        Upstream Stakeholders
        Downstream Stakeholders
    Undercurrents
        Security
        Data Management
        DataOps
        Data Architecture
        Orchestration
        Software Engineering
    Conclusion
    Additional Resources
9. Serving Data for Analytics, Machine Learning, and Reverse ETL
    General Considerations for Serving Data
        Trust
        What’s the Use Case, and Who’s the User?
        Data Products
        Self-Service or Not?
        Data Definitions and Logic
        Data Mesh
    Analytics
        Business Analytics
        Operational Analytics
        Embedded Analytics
    Machine Learning
        What a Data Engineer Should Know About ML
    Ways to Serve Data for Analytics and ML
        File Exchange
        Databases
        Streaming Systems
        Query Federation
        Data Sharing
        Semantic and Metrics Layers
        Serving Data in Notebooks
    Reverse ETL
        Ways to Serve Data with Reverse ETL
    Whom You’ll Work With
    Undercurrents
        Security
        Data Management
        DataOps
        Data Architecture
        Orchestration
        Software Engineering
    Conclusion
    Additional Resources
III. Security, Privacy, and the Future of Data Engineering
10. Security and Privacy
    People
        The Power of Negative Thinking
        Always Be Paranoid
    Processes
        Security Theater Versus Security Habit
        Active Security
        The Principle of Least Privilege
        Shared Responsibility in the Cloud
        Always Back Up Your Data
        An Example Security Policy
    Technology
        Patch and Update Systems
        Encryption
        Logging, Monitoring, and Alerting
        Network Access
        Security for Low-Level Data Engineering
    Conclusion
    Additional Resources
11. The Future of Data Engineering
    The Data Engineering Lifecycle Isn’t Going Away
    The Decline of Complexity and the Rise of Easy-to-Use Data Tools
    The Cloud-Scale Data OS and Improved Interoperability
    “Enterprisey” Data Engineering
    Titles and Responsibilities Will Morph...
    Moving Beyond the Modern Data Stack, Toward the Live Data Stack
        The Live Data Stack
        Streaming Pipelines and Real-Time Analytical Databases
        The Fusion of Data with Applications
        The Tight Feedback Between Applications and ML
        Dark Matter Data and the Rise of...Spreadsheets?!
    Conclusion
A. Serialization and Compression Technical Details
    Serialization Formats
        Row-Based Serialization
        Columnar Serialization
        Hybrid Serialization
    Database Storage Engines
    Compression: gzip, bzip2, Snappy, etc.
B. Cloud Networking
    Cloud Network Topology
        Data Egress Charges
        Availability Zones
        Regions
        GCP-Specific Networking and Multiregional Redundancy
        Direct Network Connections to the Clouds
    CDNs
    The Future of Data Egress Fees
Index
About the Authors

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://www.oreilly.com/

2. Search the book title: Fundamentals of Data Engineering: Plan and Build Robust Data Systems, sometime you may not get the results, please search the main title

3. Click the book title in the search results

3. Publisher resources section, click Download Example Code.

1. Disable the AdBlock plugin. Otherwise, you may not get any links.

2. Solve the CAPTCHA.

3. Click download link.

4. Lead to download server to download.