Distributed Systems

ECE419, Winter 2026
University of Toronto
Instructor: Ashvin Goel

Lab 4: Fault-tolerant Key/Value Service

Due date: Apr 5

In this lab you will build a linearizable, fault-tolerant key/value storage service using your Raft library from Lab 3. To clients, the service looks similar to the key/value server of Lab 2. However, instead of a single server, the service consists of a set of servers that each maintain a database of key/value pairs. The service uses Raft to implement a replicated state machine that ensures the databases remain identical. Your key/value service should continue to process client requests under node failures or network partitions as long as a majority of the servers are alive and can communicate.

Overview

Clients will interact with your key/value service in much the same way as in Lab 2. A Clerk manages RPC interactions with the servers and implements the Put and Get methods with the same semantics as Lab 2: Puts are at-most-once and Puts and Gets must form a linearizable history.

Providing linearizability is relatively easy for a single server. It is harder if the service is replicated, since all servers must choose the same execution order for concurrent requests, must avoid replying to clients using state that isn't up to date (i.e., ensure read requests read latest data), and must recover their state after a failure in a way that preserves all acknowledged client updates.

This lab has two parts. In part A, you will implement a replicated-state machine package, rsm, using your raft implementation. The rsm package replicates requests in an application-independent manner. In part B, you will implement a replicated key/value service using rsm.

For this lab, you should review the extended Raft paper, in particular Section 8. After this lab, you will have implemented all parts (Clerk, Service, and Raft) shown in the diagram of Raft interactions except for snapshots.

Starter code

Update your local repository with our starter code.

The starter code for this lab is available in the kvraft1 directory. You only need to modify the kvraft1/rsm/rsm.go, kvraft1/client.go and kvraft1/server.go files for this lab. You may modify any other files for testing but please run your final tests with the original versions of these files before submission. Also, do not add any new or remove existing files.

The kvraft1/rsm/rsm.go file provides the starter code for the replicated state machine. In part A, you will modify this file to implement the replicated state machine using your Raft library. Then, in part B, you will modify the kvraft1/client.go and kvraft1/server.go files to implement the replicated key/value service. Your service will use the replicated state machine by implementing the StateMachine interface defined in kvraft1/rsm/rsm.go.

After you have completed your implementation, you can test your code as follows:

cd kvraft1/rsm
go test -run 4A
cd kvraft1
go test -run 4B

Implementation

Each of the servers in your key/value service will be associated with a Raft peer. Each server interacts with its Raft peer in two ways. First, the leader server (the server associated with the Raft leader peer) submits a client operation to Raft using a separate thread. Then the thread waits for the operation to execute so that it can return the result of the operation to the client. Second, all servers receive operations committed by Raft and then execute these operations. At the leader server, the result of the operation is sent to the thread waiting for this operation. In the first part of this lab, you will implement a replicated state machine that encapsulates these interactions in an application-independent manner. In the second part, your key-value service will use the replicated state machine to implement a fault-tolerant service.

Part A: replicated state machine

The rsm package is a layer between the service (e.g. a key/value database) and Raft. In rsm/rsm.go you will need to implement the rsm.Submit() function. This function should invoke raft.Start(command) to initiate the process of appending command to the replicated log. Then it should wait for the result of the execution of the command.

You will also need to implement a Reader() goroutine that reads ApplyMsg messages containing committed commands from Raft's applyCh channel and executes the commands. Then, it should send the results of executing a command to the Submit() call waiting for this command on the leader server.

Your key/value service should use the rsm package and appear to the Reader() goroutine as a StateMachine object that provides a DoOp() method. The Reader() goroutine should hand each committed operation to DoOp(), which will implement the service operations (e.g., Get and Put). The return value of DoOp() should be sent to the corresponding Submit() call. The arguments and return value of DoOp() has type any; the actual values should have the same types as the argument and return values that the service passes to Submit().

Your key/value service should pass each client operation to the Submit() function. To help the Reader() goroutine match commands with the corresponding waiting Submit() call, Submit() should wrap each client operation in an Op structure and add a unique operation identifier. Submit() should then wait until the operation has committed and been executed, and return the result of execution (the value returned by DoOp()).

If raft.Start() indicates that the current peer is not the Raft leader, Submit() should return an rpc.ErrWrongLeader error.

The Submit() function should detect and handle the case where the Raft leader changes after the call to raft.Start() and before the operation is committed, causing the operation to be lost (never committed). A server can determine that it has lost leadership when it notices that Raft's term has changed or a different request has appeared at the index returned by raft.Start(). In these cases, Submit() should return rpc.ErrWrongLeader as well. If the superseded leader is partitioned by itself, it won't know about new leaders. However, a client in the same partition won't be able to talk to the new leader either, so it's OK in this case for the server to wait indefinitely until the partition heals.

Here is a summary of the sequence of events for a client request:

The client sends a request to the leader replica.
The leader replica calls rsm.Submit() with the request.
rsm.Submit() calls raft.Start() with the request, and then waits.
Raft commits the request and sends an ApplyMsg message containing the request on the applyCh channel to all peers.
The rsm.Reader() goroutine on each peer reads the message and passes the command in the message to the service's DoOp() function.
At the leader replica, the rsm.Reader() goroutine hands the DoOp() return value to the rsm.Submit() call that originally submitted the request, and then rsm.Submit() returns that value to the client.

Your servers should not directly communicate; they should only interact with each other through Raft.

Next, skip over the description of Part B of the lab and go to our advice and following sections in this handout.

Part B: replicated key/value service

Now you will use the rsm package to replicate a key/value server. Clerks should send Get() and Put() RPCs to the leader server. The leader server should submit the Get and Put operations to rsm, which sends it to Raft (see above). After Raft replicates and commits these operations, rsm invokes the key/value server's DoOp() function at all servers, which should apply the operations in the same order to their key/value databases. The intent is for the servers to maintain identical replicas of the key/value database. Then, the leader should report the result of the operation to the Clerk by responding to its RPC.

If an operation fails to commit (for example, if the leader was replaced), the rsm package returns an error (see above) that the server should report to the client so that the Clerk can retry with a different server.

A Clerk sometimes doesn't know which server is the Raft leader. If the Clerk sends an RPC to the wrong server, or if it cannot reach the server, the Clerk should re-try by sending to a different server.

Your servers should not directly communicate; they should only interact with each other through rsm/Raft.

Replicated key/value server with reliable network and no server failures

Your first task is to implement a replicated key/value server that works over a reliable network and with no failed servers.

Feel free to start by copying your client code from Lab 2 (kvsrv1/client.go) into kvraft1/client.go. You will need to add logic to determine the leader server to which RPCs should be sent.

You'll also need to implement Get() and Put() RPC handlers in kvraft1/server.go. These handlers should submit their request to Raft using rsm.Submit(). As the rsm package reads commands from the applyCh channel, it should invoke the DoOp() method, which you will have to implement in kvraft1/server.go.

A server should not complete a Get() RPC if it is not part of a majority (so that it does not serve stale data). A simple solution is to enter every Get() (as well as each Put()) in the Raft log using rsm.Submit(). You don't have to implement the optimization for read-only operations that is described in Section 8.

You have completed this task when you reliably pass the first test (TestBasic4A) in the test suite.

Replicated key/value server with unreliable network and server failures

Now you should modify your solution to continue in the face of network and server failures.

One problem you'll face is that a Clerk may have to send an RPC multiple times until it finds a server that replies positively. If a leader fails after receiving an operation, the Clerk may not receive a reply and thus should resend the request to another leader. A leader could fail before or after committing the operation. Each call to Clerk.Put() should result in just a single execution for a particular version number.

Add code to handle failures. Your Clerk should use an similar retry plan as in Lab 2, including returning ErrMaybe if a response to a retried Put() RPC is lost.

You will probably have to modify your Clerk to remember which server turned out to be the leader for the last RPC and send the next RPC to that server first. This will avoid wasting time searching for the leader on every RPC, which may help you pass some of the tests.

Advice

Please revisit our Lab3 advice page for tips on how to develop and debug your code.
Beware of deadlock and races between the server, rsm and the Raft library. It's best to add locking from the start because the need to avoid deadlocks sometimes affects overall code design.
When we grade your submissions, we will run the tests without the race flag. However, you should make sure to run your code with this flag (go test -race) and fix any races it reports.
You should not need to add any fields to the Raft ApplyMsg message, or to Raft RPCs such as AppendEntries, but you may do so.
The tester calls your Raft's Raft.Kill() when it is shutting down a peer. Raft should close the applyCh channel so that your rsm learns about the shutdown, and can exit out of all loops.
The tests in this lab are more demanding than the Lab 3 tests. You may have bugs in your Raft library that are exposed by the tests in this lab. If you make changes to your Raft implementation make sure it continues to pass all of the Lab 3 tests.

Testing

Testing Part A

You have completed part A of this lab when your code passes the 4A tests in the test suite.

$ cd rsm
$ go test -run 4A
Test RSM basic (reliable network)...
  ... Passed --  time  1.1s #peers 3 #RPCs    44 #Ops    0
Test concurrent submit (reliable network)...
  ... Passed --  time  0.2s #peers 3 #RPCs   106 #Ops    0
Test Leader Failure (reliable network)...
  ... Passed --  time  0.9s #peers 3 #RPCs    30 #Ops    0
Test Leader Partition (reliable network)...
  ... Passed --  time  2.1s #peers 3 #RPCs   274 #Ops    0
Test Restart (reliable network)...
  ... Passed --  time 12.3s #peers 3 #RPCs   434 #Ops    0
Test Shutdown (reliable network)...
  ... Passed --  time 10.1s #peers 3 #RPCs     6 #Ops    0
Test Restart and submit (reliable network)...
  ... Passed --  time 22.6s #peers 3 #RPCs   646 #Ops    0
PASS
ok      ece419/kvraft1/rsm  49.385s

The rsm tester acts as a simple service, submitting operations that increment the state of a single integer.

Testing Part B

You have completed part B of this lab when your code passes the 4B tests in the test suite.

$ go test -run 4B
Test: one client (4B basic) (reliable network)...
  ... Passed --  time  3.0s #peers 5 #RPCs  2944 #Ops  493
Test: one client (4B speed) (reliable network)...
  ... Passed --  time 11.2s #peers 3 #RPCs  3683 #Ops    0
Test: many clients (4B many clients) (reliable network)...
  ... Passed --  time  3.6s #peers 5 #RPCs  5209 #Ops  631
Test: many clients (4B many clients) (unreliable network)...
  ... Passed --  time  5.6s #peers 5 #RPCs  1482 #Ops  175
Test: one client (4B progress in majority) (unreliable network)...
  ... Passed --  time  2.0s #peers 5 #RPCs   163 #Ops    3
Test: no progress in minority (4B) (unreliable network)...
  ... Passed --  time  1.3s #peers 5 #RPCs   125 #Ops    3
Test: completion after heal (4B) (unreliable network)...
  ... Passed --  time  1.2s #peers 5 #RPCs    82 #Ops    4
Test: partitions, one client (4B partitions, one client) (reliable network)...
  ... Passed --  time  9.3s #peers 5 #RPCs  2988 #Ops  491
Test: partitions, many clients (4B partitions, many clients (4B)) (reliable network)...
  ... Passed --  time  9.2s #peers 5 #RPCs  5142 #Ops  635
Test: restarts, one client (4B restarts, one client 4B ) (reliable network)...
  ... Passed --  time  6.1s #peers 5 #RPCs  4420 #Ops  479
Test: restarts, many clients (4B restarts, many clients) (reliable network)...
  ... Passed --  time  6.7s #peers 5 #RPCs 10112 #Ops  617
Test: restarts, many clients (4B restarts, many clients ) (unreliable network)...
  ... Passed --  time  9.1s #peers 5 #RPCs  1501 #Ops  167
Test: restarts, partitions, many clients (4B restarts, partitions, many clients) (reliable network)...
  ... Passed --  time 14.7s #peers 5 #RPCs 24367 #Ops  615
Test: restarts, partitions, many clients (4B restarts, partitions, many clients) (unreliable network)...
  ... Passed --  time 18.9s #peers 5 #RPCs  2167 #Ops  131
Test: restarts, partitions, random keys, many clients (4B restarts, partitions, random keys, many clients) (unreliable network)...
  ... Passed --  time 13.3s #peers 7 #RPCs  4992 #Ops  418
PASS
ok      ece419/kvraft1  116.221s

The numbers on each Passed line are the real time the test took in seconds, the number of Raft peers, the number of RPCs sent during the test (including client RPCs), and the number of key/value operations executed (Clerk Get and Put calls).

Checking submission

We have provided you a tool that allows you to check your lab implementation and estimate your lab grade. After setting your path variable, you can run the tool in the raft1 directory as follows:

ece419-check lab4

You can check the output of the tool in the test.log file. Note that an individual test will fail if it takes more than 120 seconds.

Submission

We will use the following files from your code repository for grading: raft1/raft.go, kvraft1/rsm/rsm.go, kvraft1/client.go, and kvraft1/server.go.

Please see lab submission.

Acknowledgements

Prof. Robert Morris, Prof. Frans Kaashoek, and Prof. Nickolai Zeldovich at MIT developed these labs. We have their permission to use them for this course.