Distributed Systems Learning Resources

A comprehensive collection of resources for learning distributed systems, performance engineering, and scalable system design.

Books

Distributed Systems & Databases

Database Internals - Alex Petrov
Designing Data-Intensive Applications - Martin Kleppmann
High Performance Browser Networking - Ilya Grigorik
Just Use Postgres! - Denis Magda
Latency - Pekka Enberg

Language-Centric

Fluent Python - Lucian Ramalho
Rust Atomics and Locks - Mara Bos
Rust for Rustaceans - Jon Gjengset

AI

I’m not focussing on learning AI skills specifically. For my learning I plan on avoiding AI altogether, at least coding agents.

Build a Large Language Model (From Scratch) - Sebastian Raschka ( Low Priority )
AI Engineering - Chip Huyen ( Low Priority )
Designing Machine Learning Systems - Chip Huyen ( Low Priority )

Python PEPs

I wanted to read some Python Enhancement Proposals that led to improved concurrency in Python. These are what I have so far.

Blog Articles

Napkin Math - Back-of-the-envelope calculations for systems design

Brendan Gregg - Performance Engineering

Engineering Blogs

Cloudflare Postmortems

Papers

Uncurated List

This list hasn’t been curated yet. I’ll prune or strikethrough papers that I haven’t found useful or won’t read for my preparation.

Essential Papers (Start Here)

Time, clocks, and the ordering of events in a distributed system - Leslie Lamport, 1978
The Byzantine Generals Problem - Leslie Lamport, Robert Shostak, and Marshall Pease, 1982
Distributed snapshots: determining global states of distributed systems - K. Mani Chandy and Leslie Lamport, 1985
Impossibility of distributed consensus with one faulty process - Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 1985
Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems - Brian M. Oki and Barbara H. Liskov, 1988
The part-time parliament (Paxos) - Leslie Lamport, 1998
Paxos Made Simple - Leslie Lamport, 2001
Bitcoin: A Peer-to-Peer Electronic Cash System - Satoshi Nakamoto, 2008
Conflict-free replicated data types - Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski, 2011
In search of an understandable consensus algorithm (Raft) - Diego Ongaro and John Ousterhout, 2014

Comprehensive Reading List

System Design Principles

Latency

Amazon Systems

Google Systems

Consistency Models

Theory

Expository and Tutorial Resources:

There is No Now - Justin Sheehy, ACM Queue 2015
Why Logical Clocks are Easy - Carlos Baquero and Nuno Preguiça, ACM Queue 2016
Hybrid logical clocks - Murat Buffalo blog
Logical clocks and Vector clocks modeling in TLA+/PlusCal - Murat Buffalo blog
A Brief Tour of FLP Impossibility - Paper Trail blog
Paper summary: Perspectives on the CAP theorem - Murat Buffalo blog

Languages and Tools

Programming Distributed Erlang Applications: Pitfalls and Recipes

Infrastructure

Principles of Robust Timing over the Internet

Distributed Storage

Consensus and Replication

Expository and Tutorial Resources:

Modeling Paxos and Flexible Paxos in Pluscal and TLA+ - Murat Buffalo blog
Dissecting performance bottlenecks of Paxos protocols - Murat Buffalo blog

Gossip Protocols and Epidemic Algorithms

Peer-to-Peer Systems

Distributed Algorithms

Expository and Tutorial Resources:

Dijkstra’s stabilizing token ring algorithm - Murat Buffalo blog
Modeling the hygienic dining philosophers algorithm in TLA+ - Murat Buffalo blog

System Design and Architecture

Expository and Tutorial Resources:

Learning about distributed systems: where to start? - Murat Buffalo blog

Books#

Distributed Systems & Databases#

Language-Centric#

AI#

Python PEPs#

Blog Articles#

Brendan Gregg - Performance Engineering#

Engineering Blogs#

Cloudflare Postmortems#

Papers#

Essential Papers (Start Here)#

Comprehensive Reading List#

System Design Principles#

Latency#

Amazon Systems#

Google Systems#

Consistency Models#

Theory#

Languages and Tools#

Infrastructure#

Distributed Storage#

Consensus and Replication#

Gossip Protocols and Epidemic Algorithms#

Peer-to-Peer Systems#

Distributed Algorithms#

System Design and Architecture#

Cloud Computing and Big Data#

Subscribe to Newsletter

Books

Distributed Systems & Databases

Language-Centric

AI

Python PEPs

Blog Articles

Brendan Gregg - Performance Engineering

Engineering Blogs

Cloudflare Postmortems

Papers

Essential Papers (Start Here)

Comprehensive Reading List

System Design Principles

Latency

Amazon Systems

Google Systems

Consistency Models

Theory

Languages and Tools

Infrastructure

Distributed Storage

Consensus and Replication

Gossip Protocols and Epidemic Algorithms

Peer-to-Peer Systems

Distributed Algorithms

System Design and Architecture

Cloud Computing and Big Data