Designing Data-Intensive Applications

This is a summary the book


 

Part I - Foundations of Data Systems

Chapter 1 - Reliable, Scalable, and Maintainable Applications

Key Concept

  • Data-Intensive vs. Compute-Intensive Applications

    • Modern applications require to handle large amount of data rather than requiring heavy computational power. The challenge is to manage/balance the volume, complexity and the changes in data and in business requirements.

  • Building Blocks of Data Systems

    • Databases

    • Caches

    • Search Indexes (for querying)

    • Streaming

    • Batch Processing

  • Choosing the Right Tools:

    • There isn’t a magic pill that fits all use cases, therefore each application has it’s own unique challenges.

  • Designing Data Systems:

    • Databases, caches, queues are individual components, yet they are connected. The lines between these tools can be blurred which could make the integration complex, but essential.

  • Reliability:

    • Fault tolerance (hardware faults, bugs)

    • Human Error 

      • Minimising the chances for human error

      • Testing

      • Monitoring

      • Recovery methods (rollback)

  • Scalability

    • Load Parameters

      • Requests per second, data volume helps quantifying load.

    • Vertical (Adding more Nodes)

    • Horizontal (Adding more CPU)

  • Maintainability:

    • Modern applications require to handle large amount of data rather than requiring heavy computational power. The challenge is to manage/balance the volume, complexity and the changes in data and in business requirements.

    • Operability

      • Good Docs

      • Simplicity as much as it is possible 

      • Evolvability (easily adaptable for changes)

 

Chapter 2 - Data Models and Query Languages

Key Concept

  • Importance of Data Models

    • Data model is shaping how we write code and understanding the problem

    • Apps often has multiple layers, each abstracting the complexity one below

  • Data Representation Layers

    • App level (data structure specific to the application)

    • Storage level (data models JSON, relational tables, graph models)

    • Database level (how database represents data in memory)

  • Relational Model vs. Document Model

    • Relational database supports complex queries and joins which can be cumbersome in document databases

    • Document database is flexible and better for read operations, useful for certain types of data 

  • NoSQL and Schema Flexibility

    • NoSQL

      • Great flexibility, (specialised query operations ?)

      • Supports dynamic and and expressive data model and reducing schema rigidity compared to SQL

    • Schema Flexibility

      • Document databases are often schemaless allowing for a schema-on-read rather than schema-on-write

      • Great choice for evolving data structures

  • Graph-Like Data Models

    • Efficiently handles many-to-many relationships using vertices ([nodes]) and (edges →)  

    • Great choice for interconnected data like social media, road networks

Chapter 3 - Storage and Retrieval

Key Concept

  • Fundamentals of Database Storage and Retrieval

  • Storage Engines 

    • Transactional: Optimised for high write throughput and fast key-value lookups

    • Analytical: Optimised for read-heavy operation managing large datasets and complex queries

  • Data Structures and Indexing

    • Log-Structured Storage Engines

      • Data is appended to a log file, with updates creating new entries.

      • Advantages: Efficient sequential writes, simple crash recovery, reduced fragmentation.

      • Limitations: Limited by in-memory index size, less efficient for range queries.

      • Variants:

        • SSTables (Sorted String Tables): Ensure data is stored in a sorted manner for efficient merging.

        • LSM-Trees (Log-Structured Merge-Trees): Combine multiple sorted segments for high write efficiency and effective merging.

    • B-trees

      • Widely used in relational databases.

      • Store data in fixed-size blocks (pages) organized in a tree structure with hierarchical references.

      • Advantages: Efficient key-value lookups, balanced structure, strong transactional support.

      • Challenges: Managing concurrent updates and ensuring crash recovery is complex.

  • Storage Engines Comparison

    • LSM-Trees

      • High write throughput, efficient sequential writes. Although, slower reads due to multiple data structures, complex compaction process.

    • B-Trees

      • Great for quickly finding and managing data, but writing data can be more resource-intensive 

  • Analytical Systems and Data Warehousing

    • Data Warehousing (ETL)

      • Specialized system for storing and analyzing large amounts of data

    • Column-Oriented Storage

      • Great choice for storing data that focuses on columns instead of rows, making it ideal for read-heavy analytical queries. It enhances data compression and speeds up query processing by loading only the columns needed for a query

  • Advanced Topics

    • Secondary Indexes

      • Improve query performance by indexing additional columns

    • Full-Text Search and Fuzzy indexes

      • Handle imprecise queries and textual data

    • In-Memory Databases

      • Keep all data in RAM for fast access

 

Chapter 4 - Encoding and Evolution

Adapting to Change

  • Applications evolve as features are added or the business needs a change, and this often requires modifying the data stored. The ability to handle this evolution smoothly is very important.

  • Backward Compatibility: New code must be able to read data written by older versions.

  • Forward Compatibility: Older code should gracefully handle data written by newer versions.

Data Encoding Formats Data must be transformed between in-memory representations and byte sequences for storage or transmission. This process is called encoding (serialization), and the reverse is decoding (deserialization).

  1. Language-Specific Formats

    • Some programming languages, like Java or Python, offer built-in serialization libraries, but they pose several challenges, including:

      • Lack of compatibility across languages.

      • Security vulnerabilities from arbitrary class instantiation.

      • Poor handling of versioning and performance inefficiencies.

  2. Textual Formats (JSON, XML, CSV)

    • Widely supported and human-readable, these formats are popular for data interchange but come with issues like:

      • Ambiguous handling of numbers and binary data.

      • Schema complexities and inefficiencies for large datasets.

  3. Binary Formats (Thrift, Protocol Buffers, Avro)

    • These formats are more compact and efficient than JSON or XML, particularly useful for large datasets.

    • They require schemas, making them powerful for documentation and ensuring compatibility.

    • Avro stands out for its approach to schema evolution, enabling compatibility between different schema versions by resolving differences at read time.

Schema Evolution

  • When a schema changes, backward and forward compatibility is key to ensuring that systems continue to function smoothly during rolling upgrades or when different parts of an application are running different versions.

  • Thrift, Protocol Buffers, and Avro manage schema evolution well, allowing systems to maintain compatibility while evolving.

Modes of Dataflow

  1. Dataflow Through Databases

    • Databases must ensure both backward and forward compatibility, as new code may read old data, and vice versa.

    • Schema evolution allows databases to avoid costly data migrations, and Avro is often used in systems like LinkedIn’s Espresso to manage evolving schemas.

  2. Dataflow Through Services (REST and RPC)

    • REST services use JSON or other formats, and compatibility is maintained by adding new fields or optional parameters.

    • RPC frameworks like gRPC, Thrift, and Avro provide more control over data encoding, offering performance improvements but requiring careful handling of backward and forward compatibility.

  3. Message-Passing Systems

    • Message brokers like RabbitMQ and Kafka enable asynchronous communication between services, allowing greater decoupling between services and improved system reliability.

    • Actor frameworks, such as Akka and Orleans, use message-passing but require careful schema management to support rolling upgrades.

Summary

Managing how data is encoded, stored, and transmitted is crucial for maintaining an application's evolvability. By leveraging schema-based formats and designing for backward and forward compatibility, developers can ensure their systems remain adaptable, even as features evolve and deployments become more frequent. Whether working with databases, services, or message-passing systems, encoding formats play a central role in making change manageable.

 

Part II - Distributed Data

 

Part Summary: Distributed Data

In this section, the focus shifts from data stored on a single machine to data distributed across multiple machines. There are several reasons for distributing data, and each comes with unique challenges and trade-offs:

Reasons for Distributing Data

  1. Scalability: As the amount of data grows, distributing it across multiple machines helps balance the load and improve performance.

  2. Fault Tolerance/High Availability: Distributing data across machines increases redundancy, ensuring the system remains operational even when some nodes fail.

  3. Latency: Distributing data to multiple geographic locations reduces the time it takes for users across the globe to access the data.

Scaling Approaches

  1. Vertical Scaling (Scaling Up): Adding more resources (CPU, RAM, disk) to a single machine. This approach has limits due to costs and bottlenecks.

  2. Horizontal Scaling (Scaling Out): Distributing data across multiple machines (shared-nothing architecture), allowing independent machines to manage parts of the workload. This approach is more cost-effective and allows for redundancy across regions.

Replication vs. Partitioning

  1. Replication: Data is duplicated across multiple nodes, improving availability and performance. If one node fails, another can serve the data.

  2. Partitioning (Sharding): Data is split into subsets called partitions, with each partition stored on a different node. Partitioning helps distribute the load across multiple nodes and is often used alongside replication.

Shared-Nothing Architecture

  • Each node manages its own resources (CPU, RAM, disk) independently, coordinating over a network.

  • This architecture is popular due to its scalability and cost-effectiveness, but it also introduces more complexity for the developer, particularly in handling distributed systems' trade-offs and constraints.

 

Chapter 5 - Replication

Replication involves keeping copies of the same data on multiple machines connected via a network. This is crucial for:

  • Latency Reduction: Keeping data geographically close to users for faster access.

  • Fault Tolerance/High Availability: Ensuring the system continues to function even if some nodes fail.

  • Read Scalability: Distributing read queries across multiple replicas to increase throughput.

Key Concepts in Replication:

  1. Leaders and Followers (Single-Leader Replication):

    • Leader: The primary node that handles all write requests.

    • Followers: Nodes that replicate the leader's data changes and handle read requests.

    • Operation:

      • Clients send writes to the leader.

      • The leader writes to its local storage and sends changes to followers.

      • Followers update their data to reflect the leader's state.

      • Clients can read from either the leader or followers.

  2. Synchronous vs. Asynchronous Replication:

    • Synchronous Replication:

      • The leader waits for followers to acknowledge writes.

      • Guarantees followers have up-to-date data.

      • Slower and can become unavailable if followers don't respond.

    • Asynchronous Replication:

      • The leader doesn't wait for followers.

      • Faster and more available.

      • Risk of followers lagging behind the leader.

  3. Replication Lag Problems:

    • Eventual Consistency: Followers may not have the latest data immediately.

    • Read-After-Write Consistency: Users might not see their own recent writes when reading from followers.

    • Monotonic Reads: Users might see data moving backward in time when reading from different replicas.

    • Consistent Prefix Reads: Causally related writes might be read out of order, leading to confusion.

  4. Handling Replication Lag:

    • Read from Leader: For recent writes, read from the leader to ensure consistency.

    • Monotonic Reads: Ensure reads are always from the same replica.

    • Consistent Prefix Reads: Use mechanisms to ensure causal ordering of operations.

  5. Multi-Leader Replication:

    • Use Cases:

      • Multi-datacenter operation for better latency and fault tolerance.

      • Clients with offline operation needing local writes.

      • Collaborative editing applications.

    • Challenges:

      • Conflict Resolution: Concurrent writes to the same data need to be resolved.

      • Conflict Avoidance: Routing all writes for a record to the same leader.

  6. Conflict Resolution Strategies:

    • Last Write Wins: Accept the write with the latest timestamp (risking data loss).

    • Merge Operations: Combine concurrent writes intelligently.

    • Custom Logic: Application-specific rules to resolve conflicts.

  7. Leaderless Replication:

    • Operation:

      • Clients send writes to multiple replicas.

      • Reads are made from multiple replicas in parallel.

    • Quorums:

      • Write Quorum (w): Minimum number of replicas that must acknowledge a write.

      • Read Quorum (r): Minimum number of replicas to read from.

      • Ensures that reads and writes overlap in at least one replica to achieve consistency.

    • Challenges:

      • Sloppy Quorums: Writes may be accepted by nodes not in the designated replica set, leading to potential inconsistencies.

      • Concurrent Writes: Handling writes that occur simultaneously on different replicas.

  8. Detecting and Resolving Conflicts in Leaderless Replication:

    • Version Vectors: Tracking versions from different replicas to detect concurrency.

    • Merging Concurrent Writes: Clients may need to merge conflicting data.

    • Automatic Conflict Resolution: Using data structures like CRDTs (Conflict-Free Replicated Data Types) to automate merging.

Summary of Trade-offs in Replication Methods:

  • Single-Leader Replication:

    • Simplifies conflict resolution.

    • Risk of leader failure causing unavailability.

    • Replication lag can affect read consistency.

  • Multi-Leader Replication:

    • Improves availability and latency in distributed environments.

    • Complexity in conflict resolution.

    • Potential for data inconsistencies if not managed properly.

  • Leaderless Replication:

    • High availability and fault tolerance.

    • Clients handle more complexity.

    • Requires careful configuration of quorum sizes.

Chapter Summary:

Replication is essential for building reliable, scalable, and performant distributed systems. However, it introduces complexities, especially around ensuring data consistency and handling failures. Understanding the different replication strategies and their trade-offs allows system architects and developers to make informed decisions that align with their application's requirements.

 

Chapter 6 - Partitioning

Partitioning, also called sharding, divides a large dataset across multiple nodes to improve scalability and query performance. It allows databases to scale horizontally by spreading data and load across a distributed system.

Types of Partitioning:

  1. Key Range Partitioning:
    Data is divided into ranges of keys, like an encyclopedia, where each volume contains a range of topics. This allows efficient range queries but risks hot spots when frequently accessed keys fall in the same partition.

  2. Hash Partitioning:
    A hash function is applied to keys, distributing data randomly across partitions to avoid hot spots. This breaks the key ordering, making range queries inefficient. It's common in distributed systems like Cassandra and MongoDB.

Replication with Partitioning:
Partitioning is often combined with replication for fault tolerance. Each partition may have a leader node responsible for handling writes, while follower nodes maintain backup copies.

Challenges of Partitioning:

  • Hot Spots: When one partition receives a disproportionate amount of load.

  • Skewed Workloads: Specific use cases, like a celebrity on social media, can create skewed workloads where too many requests target the same partition.

  • Secondary Indexes: Complicates partitioning since secondary indexes (used for searching non-primary key fields) need to be distributed differently from primary data.

Strategies for Rebalancing Partitions:

  1. Fixed Number of Partitions:
    A static number of partitions is created initially, and partitions are reassigned to nodes during rebalancing without changing the partition boundaries.

  2. Dynamic Partitioning:
    Partitions are created and split dynamically as the dataset grows. This approach is common in key-range partitioned databases like HBase.

Rebalancing Methods:

  • Manual vs. Automatic Rebalancing:
    While automatic rebalancing is convenient, it can cause performance issues during failure detection. Manual rebalancing gives administrators more control to prevent unexpected issues.

Request Routing: The challenge is ensuring queries are sent to the correct node as partition assignments change. Solutions include using ZooKeeper to track partition locations or employing a gossip protocol between nodes.

Parallel Query Execution:
In more advanced systems, queries can be broken into smaller tasks and executed in parallel across multiple nodes, improving performance for complex analytical queries.

The chapter concludes that partitioning is essential for scaling databases, but requires careful planning around query patterns, load balancing, and partitioning schemes to ensure consistent performance and scalability.

 

Chapter 7 - Transactions

  1. Challenges in Databases:

    • Databases face various issues such as software/hardware failure, application crashes, network interruptions, and race conditions between clients that can lead to corrupted data or errors.

    • Implementing fault-tolerant mechanisms is crucial but complex, requiring careful handling and testing.

  2. Role of Transactions:

    • Transactions are the primary mechanism for handling such complexities, simplifying error handling by ensuring operations either fully complete (commit) or are undone (abort).

    • Transactions abstract away concurrency issues, allowing application developers to focus less on potential failure scenarios.

  3. ACID Properties:

    • Atomicity: Ensures that either all the operations in a transaction are executed or none are, preventing partial updates.

    • Consistency: Transactions must leave the database in a valid state, ensuring data validity through integrity constraints.

    • Isolation: Concurrent transactions must not interfere with each other, providing a semblance of serial execution even when transactions run in parallel.

    • Durability: Once a transaction is committed, the changes must be permanent, even in the event of system failure.

  4. The Two-Phase Commit Problem:

    • Expense of Two-Phase Commit: Two-phase commit protocols used in distributed systems are often cited as expensive, affecting performance and availability. Despite these challenges, the authors argue that it’s preferable for application developers to manage performance bottlenecks rather than always avoid transactions.

  5. Non-relational and Distributed Databases:

    • With the rise of non-relational (NoSQL) databases, many systems have weakened or abandoned transactional guarantees to achieve higher performance and scalability.

    • There's an ongoing debate whether large-scale systems should avoid transactions for better performance or retain them for ensuring correctness.

  6. Concurrency Control:

    • The excerpt explores various mechanisms for handling concurrency in databases, particularly focusing on:

      • Snapshot Isolation: Allows transactions to operate on a consistent snapshot of the database without seeing uncommitted changes made by others, but still leaves room for anomalies like write skew.

      • Write Skew: A form of concurrency anomaly where two transactions read the same data but write to different parts, causing conflicts.

      • Lost Updates: Occur when two clients concurrently modify the same data, resulting in one client’s changes being overwritten.

  7. Performance Implications:

    • Stronger guarantees like serializability (where transactions behave as if executed one at a time) have significant performance costs, especially in terms of latency and throughput.

    • Databases typically compromise between full isolation (serializability) and weaker forms of isolation to balance performance and consistency.

Conclusion:

The text explains that while transactions, particularly in the ACID model, offer essential guarantees for handling concurrency and failure, the trade-offs in performance (especially with mechanisms like two-phase commit) require developers to carefully weigh the need for strict transactional guarantees versus the performance benefits of relaxing those guarantees. Understanding these trade-offs is crucial in designing reliable and performant distributed systems.

 

Chapter 8 - The Trouble with Distributed Systems

Working with distributed systems introduces unique challenges, including many ways for things to go wrong. This chapter explores the complexities of distributed systems, focusing on network faults, clock issues, and partial failures, while also providing insights into how to reason about system states.

Faults and Partial Failures

In distributed systems, nodes are connected by unreliable networks. These systems are prone to partial failures, where some parts fail while others continue functioning, creating nondeterministic behavior. Unlike a single computer where failures are typically total, distributed systems can have unpredictable, intermittent failures.

Network Issues

Distributed systems rely on network communication, which is inherently unreliable. Packets can be lost, delayed, or arrive out of order. Detecting failures is difficult because it’s hard to distinguish between a network issue and a node failure. Timeouts are commonly used but can be tricky to calibrate, as a long timeout slows fault detection, and a short timeout may lead to false positives, where a node is incorrectly declared dead.

Clock and Time Issues

Each node in a distributed system has its own clock, which may drift over time due to hardware inaccuracies. This makes synchronizing actions between nodes complex, as network delays further complicate time-based event ordering. Time-of-day clocks (used for recording the actual time) and monotonic clocks (used for measuring durations) both play a role in system behavior, but neither can be fully trusted without synchronization mechanisms like NTP (Network Time Protocol).

Cloud vs Supercomputers

In contrast to supercomputers that rely on highly reliable hardware and often stop when faults occur, distributed systems in the cloud must tolerate frequent component failures and continue operating. Systems built from commodity hardware are cheaper but have higher failure rates, making fault-tolerance a key design requirement.

Building Reliable Systems from Unreliable Components

Despite unreliable networks and nodes, reliable systems can be built using mechanisms like error-correcting codes and protocols like TCP, which handle packet loss and reordering. However, no system is perfectly reliable, and it’s crucial to understand the limits of fault-tolerance mechanisms.

Handling Network Partitions and Faults

A network partition occurs when part of the system becomes disconnected from the rest. Systems must be designed to detect and recover from such faults. Techniques like redundancy and quorum-based decision-making help ensure system availability even in the face of failures.

Process Pauses and Garbage Collection

Nodes may experience pauses due to garbage collection or virtualization, where they stop processing temporarily. This can cause issues if a paused node believes it is still the leader in a system but has already been replaced. Handling such pauses is essential to maintaining system consistency and avoiding conflicting actions.

Summary

Distributed systems are inherently complex due to their reliance on unreliable networks and independent nodes. Partial failures, unreliable clocks, and process pauses make it difficult to reason about system states. Despite these challenges, fault-tolerance mechanisms, careful failure detection, and strategies for handling partitions help engineers build resilient systems that can operate even when individual components fail.

 

Chapter 9 - Consistency and Consensus

1. Consensus Problem:

  • The goal of consensus is to get multiple nodes in a distributed system to agree on something (e.g., leader election, atomic commit of transactions).

  • Although consensus seems straightforward, it is one of the most fundamental and difficult problems in distributed systems. Misunderstandings and incorrect assumptions about consensus led to broken systems in the past.

  • The FLP result proves that consensus is impossible to achieve in a completely asynchronous model if there is any risk of a node crashing. However, in practice, consensus algorithms are achievable by relaxing some assumptions, like using timeouts or random numbers.

2. Atomic Commit Problem:

  • In transactions that span multiple nodes or partitions, nodes must agree on the outcome (commit or abort) to maintain atomicity.

  • If a transaction succeeds on some nodes but fails on others, consistency is compromised. The system must ensure either all nodes commit or all nodes abort the transaction.

3. Two-Phase Commit (2PC):

  • 2PC is the most common way to achieve atomic commit. It consists of two phases:

    • Phase 1 (Prepare phase): The coordinator asks all participants if they are ready to commit.

    • Phase 2 (Commit phase): If all participants agree to commit, the transaction is committed. If any participant votes to abort, the coordinator sends an abort request.

  • While 2PC ensures atomicity, it is not very fault-tolerant. If the coordinator crashes, participants can be left in an uncertain state, waiting indefinitely for a decision.

4. Consensus Algorithms:

  • Consensus can be achieved by algorithms like Paxos, Raft, and Zab (used in ZooKeeper). These algorithms make distributed systems fault-tolerant by ensuring all nodes agree on a sequence of values.

  • Consensus algorithms are fault-tolerant even in cases where some nodes fail. They can still reach agreement as long as a majority of nodes are available.

5. Three-Phase Commit (3PC):

  • 3PC is an alternative to 2PC that attempts to be non-blocking, ensuring progress even if nodes fail. However, it assumes a network with bounded delays, which is often unrealistic in real-world distributed systems.

  • Non-blocking atomic commit requires a perfect failure detector, which is hard to achieve in practice.

6. Fault-Tolerant Consensus:

  • Consensus protocols use quorums to ensure decisions are safely made even in case of node failures.

  • These protocols define a unique leader within each epoch (term or ballot number), and this leader is responsible for making decisions during that epoch. If the leader fails, a new leader is elected.

The section bridges concepts such as atomic commit, 2PC, and fault-tolerant consensus, explaining how distributed systems ensure consistency and agreement despite failures.

 

Chapter 10 - Batch Processing

 

Batch processing is all about handling large amounts of data offline to produce useful results, typically without the pressure of responding instantly like online systems. It’s widely used to power scalable, reliable, and efficient applications.

 

Types of Systems:

  1. Online Systems: Handle requests immediately (like APIs or web services). Focus is on fast response times.

  2. Batch Processing Systems: Work on large datasets over a period, such as generating daily reports or analyzing logs. Focus is on throughput (how much data can be processed efficiently).

  3. Stream Processing Systems: Sit between the two, processing data soon after it arrives. Focus is on low latency.

 

Key Concepts in Batch Processing:

MapReduce

  • What It Is: A framework for processing massive datasets in a distributed way.

  • How It Works:

    1. Split data into smaller chunks.

    2. Mapper: Extracts key-value pairs from data (e.g., URLs from logs).

    3. Shuffle: Sorts and organizes key-value pairs by keys.

    4. Reducer: Combines or summarizes data for each key (e.g., counting visits per URL).

  • Key Benefit: Scales across many machines, making large-scale processing efficient.

Distributed Filesystems:

  • Tools like HDFS store data across many machines.

  • Files are split into blocks and replicated to ensure reliability.

Intermediate Outputs:

  • MapReduce saves temporary results to files between steps. This ensures fault tolerance but can slow down processing.

Strengths of Batch Processing:

  1. Scalability: Can process terabytes or even petabytes of data.

  2. Fault Tolerance: If a task fails, it’s retried without losing progress.

  3. Historical Context: Inspired by older systems like punch cards and Unix tools.

Modern Batch Processing Improvements:

  1. Dataflow Engines: Tools like Spark and Flink improve on MapReduce:

    • Avoid unnecessary sorting.

    • Use in-memory processing when possible for speed.

    • Allow tasks to start as soon as some input is ready.

  2. Joins and Grouping:

    • Sort-Merge Joins: Combine data by sorting keys.

    • Broadcast Joins: Send small datasets to all tasks for faster processing.

    • Partitioned Joins: Match data by partitioning both datasets similarly.

Best Practices:

  • Immutable Inputs/Outputs: Keeps data clean and easy to debug.

  • Separation of Concerns: Processing logic is distinct from input/output configuration, like Unix pipes.

  • Workflow Tools: Tools like Airflow or Luigi manage complex batch jobs with many steps.

Use Cases:

  1. Analytics: Summarizing logs, generating reports, or calculating trends.

  2. Indexing: Building search engine indexes.

  3. Machine Learning: Training models or preparing recommendation data.

  4. ETL (Extract, Transform, Load): Cleaning and structuring data for use elsewhere.

Philosophy:

Batch processing draws heavily from Unix’s simplicity:

  • Break problems into smaller steps.

  • Use tools that each “do one thing well.”

  • Compose tools together to solve big problems.

 

Chapter 11 - Stream Processing

 

Key Concepts

  1. Streams vs. Batch

    • Batch: Works on a fixed chunk of data (like yesterday's logs).

    • Stream: Data flows in continuously (like tweets appearing live).

  2. Events:
    Tiny records of something that happened, e.g., a user clicks a button or a temperature reading.

  3. Topics:
    Think of them like channels where related events (like "user clicks") are grouped together.

Stream Transport: How Do We Move Streams?

  1. Direct Messaging (e.g., TCP, Webhooks):
    Simple but limited (one sender, one receiver).

  2. Message Brokers (e.g., Kafka):

    • Like a middleman!

    • Producers send messages → Brokers store them → Consumers read when ready.

  3. Log-Based Brokers:
    Keep all messages in order (like a history book). Useful for revisiting old data!

What Can You Do with Streams?

  1. Analyze Trends:

    • Detect fraud by analyzing patterns.

    • See real-time sales performance.

  2. React to Events:

    • Send alerts when something unusual happens.

    • Update dashboards with live data.

  3. Enrich Data:
    Combine multiple streams (e.g., match a user’s profile with their clicks).

Types of Joins in Stream Processing

Joining is like merging two datasets to find related information:

  1. Stream-Stream Join:

    • Example: Match a user’s search query with their clicks.

  2. Stream-Table Join:

    • Example: Add user profile info to each activity event.

  3. Table-Table Join:

    • Example: Update a user’s Twitter timeline when they follow/unfollow.

Handling Time in Streams

  1. Event Time vs. Processing Time:

    • Event Time: When something happened.

    • Processing Time: When the system received it.

    • Example: A tweet posted yesterday but processed today!

  2. Windows:
    Break streams into chunks of time for better analysis:

    • Tumbling Window: Fixed blocks (e.g., 1-minute intervals).

    • Sliding Window: Overlapping blocks (e.g., the last 5 minutes, every minute).

Fault Tolerance: Handling Failures

What happens if something breaks? 

  1. Checkpoints:
    Save progress periodically. If it crashes, restart from the last checkpoint.

  2. Idempotence:
    Make sure retries don’t duplicate results (e.g., if a system already processed a message, ignore it).

Why Stream Processing is Cool 🎉

  • Real-Time: Instant insights and actions!

  • Scalable: Process huge volumes of data without delays.

  • Reliable: Even if something crashes, no data is lost.

Final Thought

Stream processing is like keeping track of live events at a concert. You’re listening to updates as they happen, and you’re ready to act immediately. For developers, it’s the magic behind systems like Netflix recommendations, live sports scores, or ride-sharing apps.

Summary of the book

So far I think this book should be a first book to read for someone who is starting a career in software development. It gives broad insights into what is possible in writing scalable, maintainable and reliable software. Although, the writer does not go into deep into every topic (it would be impossible), but gives references to every chapter listing many books.

So far one of the best books I read on software.