A comprehensive collection of resources for learning distributed systems, performance engineering, and scalable system design.
Books
Distributed Systems & Databases
- Database Internals - Alex Petrov
- Designing Data-Intensive Applications - Martin Kleppmann
- High Performance Browser Networking - Ilya Grigorik
- Just Use Postgres! - Denis Magda
- Latency - Pekka Enberg
Language-Centric
- Fluent Python - Lucian Ramalho
- Rust Atomics and Locks - Mara Bos
- Rust for Rustaceans - Jon Gjengset
AI
I’m not focussing on learning AI skills specifically. For my learning I plan on avoiding AI altogether, at least coding agents.
- Build a Large Language Model (From Scratch) - Sebastian Raschka ( Low Priority )
- AI Engineering - Chip Huyen ( Low Priority )
- Designing Machine Learning Systems - Chip Huyen ( Low Priority )
Python PEPs
I wanted to read some Python Enhancement Proposals that led to improved concurrency in Python. These are what I have so far.
- PEP 563 â Postponed Evaluation of Annotations
- PEP 649 â Deferred Evaluation Of Annotations Using Descriptors
- PEP 703 â Making the Global Interpreter Lock Optional in CPython
- PEP 734 â Multiple Interpreters in the Stdlib
- PEP 744 â JIT Compilation
- PEP 749 - Implementing PEP 649
- PEP 810 â Explicit lazy imports
- PEP 3156 - Asynchronous IO Support Rebooted: the “asyncio” Module
Blog Articles
- Napkin Math - Back-of-the-envelope calculations for systems design
Brendan Gregg - Performance Engineering
- Linux Load Averages: Solving the Mystery
- CPU Utilization is Wrong
- Flame Graphs
- Differential Flame Graphs
- BPF: A New Type of Software
- Give me 15 minutes and I’ll change your view of Linux tracing
- Learn eBPF Tracing: Tutorial and Examples
- Systems Performance: Enterprise and the Cloud, 2nd Edition
- BPF Performance Tools: Linux System and Application Observability
Engineering Blogs
Cloudflare Postmortems
- Cloudflare outage on November 18, 2025
- Cloudflare outage on December 5, 2025
- Code Orange: Fail Small â our resilience plan following recent incidents
Papers
Uncurated List
This list hasn’t been curated yet. I’ll
prune or strikethrough papers that I haven’t found useful or won’t read for my
preparation.
Essential Papers (Start Here)
- Time, clocks, and the ordering of events in a distributed system - Leslie Lamport, 1978
- The Byzantine Generals Problem - Leslie Lamport, Robert Shostak, and Marshall Pease, 1982
- Distributed snapshots: determining global states of distributed systems - K. Mani Chandy and Leslie Lamport, 1985
- Impossibility of distributed consensus with one faulty process - Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 1985
- Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems - Brian M. Oki and Barbara H. Liskov, 1988
- The part-time parliament (Paxos) - Leslie Lamport, 1998
- Paxos Made Simple - Leslie Lamport, 2001
- Bitcoin: A Peer-to-Peer Electronic Cash System - Satoshi Nakamoto, 2008
- Conflict-free replicated data types - Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski, 2011
- In search of an understandable consensus algorithm (Raft) - Diego Ongaro and John Ousterhout, 2014
Comprehensive Reading List
System Design Principles
- Harvest, Yield and Scalable Tolerant Systems
- On Designing and Deploying Internet Scale Services
- The Perils of Good Abstractions
- Chaotic Perspectives
- Data on the Outside versus Data on the Inside
- Memories, Guesses and Apologies
- SOA and Newton’s Universe
- Building on Quicksand
- Why Distributed Computing?
- A Note on Distributed Computing
- Stevey’s Google Platforms Rant
Latency
Amazon Systems
- A Conversation with Werner Vogels
- Discipline and Focus
- Vogels on Scalability
- SOA creates order out of chaos @ Amazon
Google Systems
- MapReduce
- Chubby Lock Manager
- Google File System
- BigTable
- Data Management for Internet-Scale Single-Sign-On
- Dremel: Interactive Analysis of Web-Scale Datasets
- Large-scale Incremental Processing Using Distributed Transactions and Notifications
- Megastore: Providing Scalable, Highly Available Storage for Interactive Services
- Spanner
- Photon
- Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
Consistency Models
- CAP Conjecture
- Consistency, Availability, and Convergence
- CAP Twelve Years Later: How the “Rules” Have Changed
- Consistency and Availability
- Eventual Consistency
- Avoiding Two-Phase Commit
- 2PC or not 2PC, Wherefore Art Thou XA?
- Life Beyond Distributed Transactions
- If you have too much data, then ‘good enough’ is good enough
- Starbucks doesn’t do two phase commit
- You Can’t Sacrifice Partition Tolerance
- Optimistic Replication
Theory
- Distributed Computing Economics
- Rules of Thumb in Data Engineering
- Fallacies of Distributed Computing
- Coordinated Attack or Two Generals Problem
- Unreliable Failure Detectors for Reliable Distributed Systems
- Virtual Time and Global States of Distributed Systems
- Practical uses of synchronized clocks in distributed systems
- Lazy Replication: Exploiting the Semantics of Distributed Services
- Scalable Agreement - Towards Ordering as a Service
- Scalable Eventually Consistent Counters over Unreliable Networks
Expository and Tutorial Resources:
- There is No Now - Justin Sheehy, ACM Queue 2015
- Why Logical Clocks are Easy - Carlos Baquero and Nuno Preguiça, ACM Queue 2016
- Hybrid logical clocks - Murat Buffalo blog
- Logical clocks and Vector clocks modeling in TLA+/PlusCal - Murat Buffalo blog
- A Brief Tour of FLP Impossibility - Paper Trail blog
- Paper summary: Perspectives on the CAP theorem - Murat Buffalo blog
Languages and Tools
Infrastructure
Distributed Storage
Consensus and Replication
- Implementing Fault-Tolerant Services Using the State Machine Approach
- How to build a highly available system with consensus
- Paxos Made Live - An Engineering Perspective
- Paxos Made Moderately Complex
- Revisiting the Paxos Algorithm
- Consensus in the Cloud: Paxos Systems Demystified
- Flexible Paxos: Quorum intersection revisited
- Practical Byzantine Fault Tolerance
- Chain Replication for Supporting High Throughput and Availability
- ZooKeeper: Wait-free coordination for Internet-scale systems
- Tango: Distributed Data Structures over a Shared Log
- There is more consensus in Egalitarian parliaments
- Mencius: Building Efficient Replicated State Machines for WANs
- Reconfiguring a State Machine
- WormSpace: A modular foundation for simple, verifiable distributed systems
Expository and Tutorial Resources:
- Modeling Paxos and Flexible Paxos in Pluscal and TLA+ - Murat Buffalo blog
- Dissecting performance bottlenecks of Paxos protocols - Murat Buffalo blog
Gossip Protocols and Epidemic Algorithms
- How robust are gossip-based communication protocols?
- Astrolabe: A Robust and Scalable Technology For Distributed Systems Monitoring, Management, and Data Mining
- Epidemic Computing at Cornell
- Fighting Fire With Fire: Using Randomized Gossip To Combat Stochastic Scalability Limits
- Bi-Modal Multicast
- ACM SIGOPS Operating Systems Review - Gossip-based computer networking
- SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
Peer-to-Peer Systems
- Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications
- Kademlia: A Peer-to-peer Information System Based on the XOR Metric
- Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems
- PAST: A large-scale, persistent peer-to-peer storage utility
- SCRIBE: A large-scale and decentralised application-level multicast infrastructure
Distributed Algorithms
- Self-stabilizing systems in spite of distributed control
- The Drinking Philosophers Problem
- Sparse partitions
- Distributed reset
- The Arrow Distributed Directory Protocol
Expository and Tutorial Resources:
- Dijkstra’s stabilizing token ring algorithm - Murat Buffalo blog
- Modeling the hygienic dining philosophers algorithm in TLA+ - Murat Buffalo blog
System Design and Architecture
- Hints for computer system design
- The role of distributed state
- SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
- Crash only software
Expository and Tutorial Resources:
- Learning about distributed systems: where to start? - Murat Buffalo blog
Cloud Computing and Big Data
- Lessons from Giant-Scale Services
- Consistency Analysis in Bloom: a CALM and Collected Approach
- Resilient Distributed Datasets (Spark)
- TensorFlow: A System for Large-Scale Machine Learning
- Above the Clouds: A Berkeley View of Cloud Computing
- Cloud Programming Simplified: A Berkeley View on Serverless Computing