Paxos from Scratch

Josh Segal
11 min readMay 12, 2021
Paxos Islands

Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament’s protocol provides a new way of implementing the state-machine approach to the design of distributed systems

- The Part-Time Parliament, Leslie Lamport

Lamport describes the parliament of the ancient civilization in the Greek islands of Paxos. He describes the incredibly sophisticated system that ancient Paxons used to reach congressional consensus when human messengers often failed and a large part of the parliament couldn’t be present at the same time. The problems that the Paxon parliament overcame turned out to be the same problems distributed systems faced in the presence of networks and hardware unreliability. This turned out to be too much of a coincidence though, as Lamport actually made up this archeological discovery to draw a stronger analogy for his paper. Nevertheless, the Paxos protocol is one of the most influential papers and has built the foundation for many modern systems.

Paxos Goals

The goal of Paxos is to reach consensus when a majority of servers in a cluster are alive and connected. Consensus is reached when

  1. all servers have the ability to figure out the agreed upon value.
  2. Once one server believes a value is decided, all other servers should come to the same conclusion.

Motivation

In a replicated state machine all servers need to perform the same operations in the same order. Each operation that each server gets is another battle for consensus. There were some good attempts at replicated state machines before paxos (like primary backup paradigm), but they all had some shortcomings.

If you have a cluster of servers and a client sends a PUT request to one server, that server could try to sync up with the rest of the servers before responding. Otherwise, you risk the client sending a GET request to another server and not seeing its previous PUT. Syncing up with all the other servers before responding works great in a perfect world without any network partitions or hardware failures, but in the real world, this fails.

In this diagram, S1 will wait indefinitely for the last server to respond.

Another issue is what happens when one client sends a PUT request to one server and another client sends a PUT request to another server. All servers need to agree on the order of the PUT requests or else the servers will be out of sync. Additionally, what makes this even harder is that some servers can be down and others can be partitioned from the network, making consensus really hard to reach. This is the same dilemma that the Paxos legislators had to deal with, but they still managed to reach consensus.

Paxos Protocol

Just like the legislators of Paxos, servers need to be able to reach consensus with network partitions and hardware failures. Note that Paxos needs at least a majority of servers to be connected to work. In fact, the Paxos protocol revolves around the theme of having a majority, so keep that in mind.

In the Paxos algorithm, servers have two roles to play: proposers and acceptors. Servers that have something the propose can propose, and all servers are acceptors. To continue the parliament analogy, members who have an idea to propose will propose, but everyone (including the proposers) can listen and accept/reject a proposal.

We’ll explore the Paxos protocol by starting with a naive algorithm and build up to the fully fledged Paxos protocol. Here is a rough outline of the variables and functions of the server. Note that some of them will be introduced later and others will have there meaning adjusted as the algorithm evolves.

Server Variables

  • decidedValue once this is assigned on any server, no other server should ever assign it a different value
  • acceptedValue an auxiliary variable that helps servers reach a decided value
  • acceptedN the number associated with the acceptedValue
  • promisedN acceptors wont accept anything less than this number
  • serverMajorityserverTotal/2⎦+ 1
  • serverTotal number of all the servers in the cluster

Server Functions

  • propose called when a server wants to propose a value or learn the decided value.
  • accept RPC handler that updates the state and replies with accept/reject
  • prepare RPC handler that sets up the accept stage
  • proposeValue calls the accept RPC on all servers and returns the count of how many accepted it
  • prepareValue calls the prepare RPC on all servers and returns their responses

Idea 1

propose(v):
while acceptedCount != serverTotal:
acceptedCount = proposeValue(v)
decidedValue = v
accept(v):
acceptedValue = v
reply Accept

First issue here is that consensus can never be reached with a network partition. In a network partition proposeValue cant reach all servers and so accepetedCount will never equal serverTotal.

Since our only guarantee is that some majority is connected, we should start thinking about what we can do with a majority rather than all the servers.

Idea 2

propose(v):
while acceptedCount != serverMajority:
acceptedCount = proposeValue(v)
decidedValue = v
accept(v):
acceptedValue = v
reply Accept

This solves the issue of reasonable network partitions because we only need a majority of servers connected at any time. However, this breaks the rule of values not being able to change after a DecidedValue has been assigned.

Remember, the server that proposed is also an acceptor, so to reach out to a majority you call your own accept and call the accept RPC on at least 2 others in a group of 5 servers, and you got yourself a majority.

In the diagram, server A proposes value 1, gets Accepts from a majority of the servers, and assigns decidedValue=1, but server B can do the same thing and gets a different decidedValue=2 .

We need a way for proposers to stop bombarding all the acceptors with their own value. After all, we’re just trying to reach consensus on a value, and there’s no advantage for a proposer to have its own value become the decidedValue

Idea 3

propose(v):
while acceptedCount < serverMajority:
acceptedValues = prepareValue()
if acceptedValues is not empty:
v = arbitrary value in acceptedValues
acceptedCount = proposeValue(v)
decidedValue = v
accept(v):
acceptedValue = v
reply Accept
prepare():
reply acceptedValue

Proposers are less selfish now. Proposers first asks at least a majority of servers if they’ve accepted a value by calling prepareValue (the name of this function doesn’t make sense until idea 5, hang in there). But since you chose one arbitrarily, this could lead to problems.

Server A proposes 1, but doesn’t get a majority to accept it. Server B proposes 2 and does get a majority to accept it. Now server A calls prepare to find all the accepted values and sees a 1 and a 2 and arbitrarily picks the 1 to propose. Thus, the decidedValue is different on servers A and B. It seems like we need a better way to choose the value to propose in acceptedValues .

Idea 4

propose(v):
while acceptedCount < serverMajority:
// returns a list of (v, n) now
acceptedValues = prepareValue()
if acceptedValues is empty:
n = randInt()
else:
v, n = tuple with max n in acceptedValues
acceptedCount = proposeValue(v, n)
DecidedValue = v
accept(v, n):
if n > acceptedN:
acceptedN = n
acceptedValue = v
reply Accept
else:
reply Reject
prepare():
reply (acceptedValue, acceptedN)

Now we add in a proposal number along with our proposed value in an attempt to make choices less arbitrary. Now proposers will propose the value with the highest proposal number instead of an arbitrary one. Also, acceptors will now reject any proposals with a lower number to help all the acceptors eventually converge to the highest number.

However even if a majority of acceptors accept a value, the AcceptedValue can still change.

Server A proposed 1 with a proposal number 5, but only got it accepted by one other server. Server B proposed 2 with proposal number 1, and got a majority of servers to accept it, and assigned decidedValue=2 . Then, server A looks at what’s been accepted and chooses to propose the one with the highest proposal number, (1,5), and eventually assigned decidedValue=1 .

To solve this, we have to make sure that if a majority of servers accept the same value they always has the highest proposal number. If the majority always has the highest proposal number than the future proposers will always propose that value since they choose to propose the value with the highest proposal number out of a majority of servers.

Idea 5

propose(v):
while acceptedCount < serverMajority:
n = randInt()
acceptedValues, promisedCount = prepareValue(n)
if promisedCount < serverMajority
continue
if acceptedValues not empty
v = value with max n in acceptedValues
acceptedCount = proposeValue(v, n)
DecidedValue = v
accept(v, n):
if n > acceptedN:
acceptedN = n
acceptedValue = v
reply Accept
else:
reply Reject
prepare(n):
if n > promisedN:
promisedN = n
reply (Accept, acceptedValue, acceptedN)
reply (Reject, acceptedValue, acceptedN)

All these changes are to make sure if a majority of acceptors accept the same value they have the highest acceptedN. We’ve introduced promisedN which makes sure that the proposer has chosen an n that’s larger than a majority of promisedN before it continues. Once we know that our n is large enough we can propose. We’re basically preparing the other servers for our proposal, and we’ve broken the process down into two stages: prepare and accept. Once a proposer knows a majority will only accept higher, it can propose the value with that proposal number.

Now when server B will try to prepare until it comes up with a higher proposal number than its seen. Then, when its value gets accepted by a majority that value has the highest proposal number. Any future proposer will see that it has the highest acceptedN and always choose to propose value 2 as well.

However, this still fails when there’s a bit more overlapping

A got passed the prepare stage, then proposer B got through the prepare stage with a higher proposal number. proposer A gets it’s value accepted by a majority thus decidedValue=1. Then, proposer B also gets its value accepted by a majority, thus its decidedValue=2. How can we stop A from getting its value through since A had the smaller proposal number?

Idea 6

propose(v):
while acceptedCount < serverMajority:
n = randInt()
acceptedValues, promisedCount = prepareValue(n)
if promisedCount < serverMajority
continue
if acceptedValues not empty
v = value with max n in acceptedValues
acceptedCount = proposeValue(v, n)
DecidedValue = v
accept(v, n):
if n > promisedN:
promisedN
acceptedN = n
acceptedValue = v
reply Accept
else:
reply Reject
prepare(n):
if n > promisedN:
promisedN = n
reply (Accept, acceptedValue, acceptedN)
reply (Reject, acceptedValue, acceptedN)

We make sure the n is larger than the promisedN in accept so that newer prepares can stop accepts. Now, even after the proposer gets through the prepare stage, the promisedN can still be raised and stop it from getting through the accept stage.

Server B raises the promisedN to 6 thus when server A proposes to server D it gets rejected. This is the final bug.

Alas, consensus has been reached. We’ve created a two stage system where you come up with a number and if its higher than a majority you can move on. You also stop any other proposers who already passed the prepare stage by adding the n > promisedN guard in accept. Once you’ve stopped them, you can get your own value accepted. If you got your value accepted by a majority, we know the majority acceptedValue has the highest acceptedN . If the largest acceptedN is with the majority acceptedValue, then any other proposer will also propose that value. Thus, everyone ends up with the same decidedValue.

If you’ve made it this far, congrats! Paxos takes a while to wrap your brain around. Even after I implemented it, I struggled to write out exactly what each part of the protocol was doing.

Paxos RSM

Once we have this magical consensus protocol what can we do with it? Well, we can build a replicated state machine with it. We can build a cluster of servers that can respond to PUT and GET requests from clients and always stay in sync. We can make a linearizable system. Linearizability means once a write is completed on one server, all future reads on other servers should show it.

We can have each value that Paxos needs to decide be an operation like PUT or GET. So instead of proposing a value like an integer you can propose an Operation struct. Then, we can have a log of decided operations. The log is the order of operations. Each slot in the log does a completely independent Paxos protocol to reach consensus.

The client reaches out to server C with a PUT(2,1), and server C starts Paxos by preparing and proposing the value. Each slot will make a new Paxos protocol and once it’s decided server C can apply the operation and reply to the client.

One of the key parts in this system is how servers catch up on the log. They will only catch up when they need to.

Server B needs to catch up. B will propose PUT(3,8) , however it will discover that there is already a decidedValue. It will add that decided value to the log, apply it, then try to propose PUT(3,8) on the next slot in the log.

Again this slot is already decided and B will discover it, simply by proposing PUT(3,8) again. Calling propose lets B learn the decided value.

B finally finds a new slot to propose and get the PUT(3,8) accepted. B is now caught up and ahead of the other servers. When the other servers get a request they will find this value is decided, apply it, then try to propose their value on the next slot.

Conclusion

Paxos is used as the foundation of many real world systems like Google’s Chubby Lock Service and Apache Zookeeper. It offers linearizability which makes programming systems on top of it very easy. However, systems built on Paxos do suffer from latency because you need a few extra communications between servers per request. Regardless, Paxos is still used today and remains one of the most influential papers in distributed systems.

--

--

Josh Segal

C/C++ | Computer Systems | Low Latency | Distributed Systems | Computer Networks | Operating Systems