Computer Systems Programming

ECE454, Fall 2025
University of Toronto
Instructors: Ashvin Goel, Ding Yuan

Lab 4: Pthreads and Synchronization

Assigned: Oct 30, Due: Nov 16, 11:59 PM

The TAs for this lab are: Shafin Haque, Ao Li

Introduction

OptsRus has been contracted to parallelize a customer's utility program called randtrack that is used to characterize the quality of random numbers generated by the rand_r() function in the C library. The randtrack program uses a hash table to output a distribution (frequency) of the generated random numbers. You are expected to parallelize this program using Pthreads.

Be sure to answer the numbered questions that are embedded in this handout in a report.txt file.

Setup

Start by copying the lab4.tar.gz file from the shared directory /cad2/ece454f/hw4/ on the UG machines in a protected directory in your UG home directory. Then run the command:

tar xzvf lab4.tar.gz

In this lab, we provide C++ sources. The main C++ file is randtrack.cc. In this file, please fill your identifying information in the team structure shown below. Do this right away so you don't forget.

team_t team = {
    "Team Name",    /* Team name */
    "AAA BBB",      /* Member full name */
    "99999999",     /* Member student number */
    "AAABBB@CCC",   /* Member email address */
};

Understanding randtrack.cc

The randtrack program processes 8 streams of random numbers. Each stream, which we call a seed stream, is initialized using a unique seed value. Each seed stream generates random numbers and then samples these numbers every sampling_interval to generate a sample size of 10 million. For example, with a sampling_interval of 10, each stream generates 100 million numbers from which it selects 10 million samples.

Each sample is an integer that is fit in the range 0...99,999 using the modulo operator. For each sample, the program checks whether it exists in the hash table or else inserts it in the hash table and then increments the counter for that sample. After all the samples have been processed, the program prints all the 100,000 sample values (in an arbitrary order) and their counts.

The command-line parameters of randtrack.cc are:

num_threads: The number of threads to use for parallelization. This number should be 1, 2, 4 or 8. The initial program that we have provided runs serially and so this parameter is not used in this version.
sampling_interval: Sample a number every sampling interval.

Your task is to implement a parallel hash table to speedup the randtrack program. While this program is relatively simple, hash tables are a commonly-used data structure and so a parallel hash table can improve the performance of many real-world applications, such as key-value stores and databases.

Understanding the Hash Table Implementation

The hash table is implemented as a one-dimensional array, with each array slot pointing to a linked list. A key is inserted in the hash table by first using a hash function to map (hash) the key into a hash code that indexes an array slot. The key is inserted in the linked list associated with the array slot. The linked list enables handling collisions (multiple keys that map to the same array slot).

The hash.h and list.h files implement the hash table and linked-list structures. The implementation uses template-based hash and list classes. These classes take two template parameters, the type of the list node and the type of the key that is stored in the list node. The randtrack program instantiates these two types with the sample class type and the unsigned type. The sample class stores a number (key for hash table) and the count associated with the number. The C++ template syntax is a bit cumbersome. If you have never programmed with C++ templates, feel free to ask us if you need help understanding them.

Compiling

In this assignment you will be designing and evaluating four different parallel versions of randtrack called randtrack_global_lock, randtrack_list_lock, randtrack_element_lock, and randtrack_reduction. Please modify the Makefile so that make builds all these programs.

You will be using the Pthreads library to create threads. To compile programs that use this library, you may need to add the -lpthread flag to the compilation command in the Makefile.

It is up to you how you write the code for all these versions. You can use separate files for each version. Alternatively, you can code all the versions in the same file and use the C preprocessor to compile the correct code for each version. To do so, you will need to use conditional compilation (e.g., #ifdef GLOBAL_LOCK in the source files and -D GLOBAL_LOCK in the compilation command for randtrack_global_lock in the Makefile). The benefit of conditional compilation is that code that is common to different versions is shared and so any changes there do not need to be duplicated. However, if you use conditional compilation, make sure that each implementation only defines the data structures and methods it uses. For example, if randtrack_global_lock defines a global lock, the other implementations that do not need this global lock should not define or accidentally use it.

Q1. Why is it important to #ifdef out methods and data structures that aren't used by different versions of randtrack?

Debugging

You can use gdb for debugging pthreads programs. The gdb command info threads shows all the threads in the program. The gdb command thread N switches execution to thread N.

For debugging, you will need to add the -g compilation flag in the Makefile. However, you should use the -O3 optimization for evaluation.

Parallelization

You should start by parallelizing the randtrack program using pthreads, following the example code shown in class. For more details about pthreads, see the introductory tutorial.

The command line argument num_threads specifies the number of threads that should be created. With 1 thread, the thread should process all the 8 streams. With 2 threads, each thread should process 4 streams, and so on.

For fair comparison, do not change the size of the hash table array in the different implementations.

You can check the correctness of the output produced by the parallel version by comparing it with the output produced by the randtrack program that we have provided as follows:

randtrack 1 10 | sort > baseline.out
randtrack_global_lock 2 10 | sort > global_lock.out
cmp baseline.out global_lock.out

In the example above, randtrack_global_lock runs with 2 threads and both randtrack and randtrack_global_lock run with a sampling interval of 10.

The cmp program generates no output if the files match. Otherwise, it will tell you the first character and line number at which the file differ, in which case you can use the diff program to see the differences in the files.

Once you have created the threaded version of the program and assigned different streams to the threads as described above, you will get incorrect results with multiple threads since accesses to the shared hash table are not synchronized.

Your task is to synchronize accesses to the hash table from the different threads so that the program generates correct results.

The first challenge is to identify the critical sections in the code, i.e., the parts of the code that access shared data and need to be synchronized (i.e., executed atomically). The rest of the code can be run in parallel without synchronization. Then, use the following four synchronization methods to create four versions of the randtrack program.

Single Global Lock

Create a version of the program called randtrack_global_lock that performs synchronization using a single mutex lock around the critical sections in the code. Test you code to ensure that it produces the same output as the randtrack program, as described above.

List-Level Locks

While a single global lock is easy to implement, its performance can degrade with increasing number of threads due to contention. For performance, it is generally better to have multiple locks that protect smaller data structures. Such locks are called fine-grained locks.

Create a version of the program called randtrack_list_lock that creates a mutex lock for each slot in the hash array (i.e., one for each list) and synchronizes accesses using those locks. You will find that this synchronization method requires some thought.

Q2. Can you implement this version without modifying the hash class, or without knowing its internal implementation? Explain.

Q3. Can you implement this version by solely modifying the hash class methods lookup() and insert()? Explain.

Q4. Can you implement this version by solely adding to the hash class a new method called lookup_and_insert_if_absent()? Explain.

Q5. Can you implement this version by solely adding new methods to the hash class called lock_list() and unlock_list()? Explain.

Element-Level Locks

An even finer-grain locking approach than list-level locks is element-level locks, where each element can be individually locked. Unlike with list-level locks, element-level locks will allow both reader and writer threads to concurrently traverse the list.

Create a version of the program called randtrack_element_lock that creates a mutex lock for every element (i.e., sample) and synchronizes accesses using those locks.

This implementation will require more sophisticated synchronization than the previous schemes. Think carefully about how your implementation handles the case when the key already exists in the hash table versus when it does not exist in the hash table. Also, think about the case when the linked list is empty versus non-empty.

Reduction

The previous parallel versions share the hash table across threads and synchronize on accesses to the hash table. Another approach is to assign each thread its own hash table in which it counts samples for its streams. Then, after all the samples have been processed, the counts from the per-thread hash tables are summed. This last step is called a reduction. Create a version of the program called randtrack_reduction that uses this approach.

Q6. What are the pros and cons of this approach?

Evaluation

For the lab, evaluate the runtime of each version of the program using the /usr/bin/time program. To get a reliable measurement, the machine should not be heavily loaded. You can check the load on either machine with the w command. The lab machines have 8 cores and so the load average can be has high as 8 or more when all the cores are busy. You should be able to get a relatively clean measurement when the load average is less than 1.0.

We have provided a script called run-experiment that you should use for your final evaluation. This script runs all the program versions, with different numbers of threads, and with a sampling interval of 10 and 100. It generates the output and the timing of each run in a corresponding .out file and a .time file. This script should take roughly 7-10 minutes to run.

After that, run the plot-experiment script. This script will generate two PDF files, plot-10.pdf and plot-100.pdf, that show the runtime of the different versions of randtrack for different numbers of threads. The plots also show the ideal performance curve based on scaling the performance of the randtrack program.

Initially, when you have not implemented any of the parallel versions, the run-experiment script will only run the randtrack program and the plot-experiment will only show the ideal performance curve based on running the randtrack program. These scripts will generate output for all the parallel versions you have currently implemented. You can also use variants of the run-experiment script during testing.

Q7. When the sampling_interval is 10, what is the overhead for each parallelization approach? Report this number as the ratio of the runtime of the parallel version when it uses one thread to the runtime of the single-threaded randtrack program.

Q8. When the sampling_interval is 10, explain how the different approaches compare to each other. If performance gets worse for a certain case, explain why that may have happened.

Q9. What are the most significant differences between the results when the sampling_interval is 10 versus 100? Can you explain these differences?

Q10. Which approach should OptsRus ship? Keep in mind that some customers might be using multi-core machines with more than 8 cores, while others might have fewer cores.

FAQ

For more details about the lab, read the Lab 4 FAQ.

Submission

Create a report that answers all of the questions above in a report.txt file. Be sure to submit all of the files necessary for building your solutions. Submit your assignment on one of the UG machines as follows:

submitece454f 4 *.cc
submitece454f 4 *.h
submitece454f 4 Makefile
submitece454f 4 report.txt
submitece454f 4 plot*.pdf

Your grade will be calculated as follows:

Global Lock: 20 points.
List Lock: 25 points.
Element Lock: 20 points.
Reduction: 15 points.
Answers to questions (report): 20 points.
Total: 100 points.