project_presentations_reports_source_code [ECE1724S: Programming Massively Parallel Graphics Processors with CUDA]
Programming Massively Parallel Processors Using CUDA Site Home Page
Writing /fs1/eecg/moshovos/a/a3/moshovos/public_www/CUDA08/data/cache/6/6823abca23497483cfb88eba0a477179.i failed
Unable to save cache file. Hint: disk full; file permissions; safe_mode setting.
Writing /fs1/eecg/moshovos/a/a3/moshovos/public_www/CUDA08/data/cache/6/6823abca23497483cfb88eba0a477179.i failed
Unable to save cache file. Hint: disk full; file permissions; safe_mode setting.
Writing /fs1/eecg/moshovos/a/a3/moshovos/public_www/CUDA08/data/cache/6/6823abca23497483cfb88eba0a477179.xhtml failed

FOREX Analysis Using GPUs

John Pazelli, Tijmen Tieleman
Electrical Engineering, Computer Science
john.pazelli AT, tijmen AT

Abstract We will describe our project in 250 words or less: We used a GPU to efficiently optimize the parameters to various technical indicators often used on chart analysis for currency trading. Issues encountered include: the limited amount of shared memory; how to precompute some values using parallelism; how to distribute tasks over the multiprocessors; and figuring out where the GPU is needed and where it isn't.

  • Presentation: None available.
  • Report: (pdf).
  • Source code: (zip).

Force Directed Placement: GPU Implementation

Bernice Chan, Ivan So, Mark Teper
University of Toronto
bernice521 AT, AT, mark.teper AT

Abstract Automated graph drawing remains a difficult placement and layout problem. This problem is difficult in part due to the complexity of formulating good algorithms to draw graphs which are aesthetically pleasing for human visualization. In this project, we implemented a force-directed placement algorithm using CUDA that will solve the graph layout problem by using an energy minimization technique. This work aims to demonstrate the performance advantage of a GPU implementation as compared to a CPU implementation. Taking advantage of the inherent parallel computational power in the GPU and with some additional memory bandwidth optimizations for our algorithm, the GPU implementation was able to achieve up to a 58x speed-up as compared to the CPU.

Neural Network on CUDA

Daniel L. Ly, Volodymyr Paprotski, Danny Yen
Department of Electrical and Computer Engineering, University of Toronto
lyd AT, paprots AT, danny.yen AT

Abstract This project focuses on how GPUs can be used to take advantage of the inherent parallelism in neural networks to provide a better implementation in terms of performance. We focused on the Restricted Boltzmann machine, a popular type of neural network. The implemented algorithm resulted in a computational speed of 672 million connections-per-second and a speed-up of 66-fold over an optimized C++ program running on a 2.83GHz Intel processor.

Implementing a Speech Recognition System on a GPU using CUDA

Omid Talakoub, Astrid Yi
University of Toronto, University of Toronto
omid.talakoub AT, astridyi AT

Abstract Speech recognition engines can be used in many different applications. An example of such an application is the execution of uttered commands by aircraft pilots to alleviate their workload. All the practical applications rely on the achievement of very high recognition accuracy. Two of the prominent methods include dynamic time warping and hidden Markov model. This paper examines an implementation of the dynamic time warping algorithm with mel scale frequency cepstral coefficients on a graphics processing unit (GPU) using CUDA. By offloading some of the work from the CPU to the GPU, it is shown that this approach is feasible and yields a speed increase of a factor of 5.8.

Speeded-Up Speeded-Up Robust Features

Paul Furgale, Chi Hay Tong, Gaetan Kenway
University of Toronto Institute for Aerospace Studies
paul.furgale AT, ctong AT, kenway AT

Abstract Feature detection and matching is one of the fundamental problems in low-level computer vision. Given a series of images of the same object, feature detection and matching algorithms try to (1) repeatably detect the same point of interest in every image, regardless of the scale and orientation of the object, (2) match each point of interest from one image with the same point in another image. Feature detection and matching forms the basis for many higher-level algorithms such as object recognition, robot navigation, panorama stitching, and three-dimensional scene modeling. The Scale Invariant Feature Transform (SIFT) has become the standard algorithm for this task. However, while its performance is impressive on difficult matching tasks, it is too computationally expensive to use in online or real-time applications. In response to this, Bay et al. created the Speeded-Up Robust Feature (SURF), which uses an integral image (essentially a 2D prefix sum) to approximate many of the calculations in the SIFT algorithm. The SURF algorithm is much faster than SIFT but it is still not fast enough to use in real-time applications. Our interest is the use of feature detection an matching to enable autonomous robot navigation. To this end, we have implemented a pure CUDA version of the SURF algorithm. Our implementation exhibits a 3 to 40 times speedup (depending on the compute capability of the hardware) and we have verified its correctness through rigorous comparison with a parallel CPU implementation.

  • Presentation: (pdf).
  • Report: (pdf).
  • Source code: Please contact the authors (

Dense Image Over-segmentation on a GPU

Alex Rodionov
arod at

Abstract I implemented an image processing algorithm that takes an image and partitions it into segments called superpixels. It was done using CUDA on an NVIDIA GTX280 and runs at 2-3 frames per second on a 640×480 image, with room for future performance enhancements.

* Presentation: (ppt), (pdf). * Report: (pdf). * Source code: (zip).

On GPU-based Parallel Peer-to-Peer Simulation

Di Niu, Zhengjun Feng
ECE University of Toronto, ECE University of Toronto
dniu at, zhengjun.feng at

Abstract Peer-to-peer content distribution systems have become enormously popular on today's Internet. In this project, we explore the feasibility of using GPU to expdite parallel P2P simulation performed in rounds. We consider the problem of disseminating k data blocks from a single source to all the nodes in a network by letting each peer transmit a block to one its neighbors in each round. Two commonly used disseminating algorithms are considered: 1) Random Useful Block (RUB), where each peer chooses a random neighbor to transmit a random useful block in each round, and 2) Global Rarest First (GRF), where each peer tries to transmit the rarest block to a random neighbor. To optimize the performance, we devised 2 algorithms for GPU-based RUB simulation either with or without using shared memory, and 3 algorithms for GRF performing parallelization either at the peer level or the data block level. Using sequential CPU code as a benchmark, we can obtain up to 8x speedup for RUB simulation with a small error, and up to 25x speedup for GRF.

Convolutional Neural Networks for Object Classification

Alex Krizhevsky
Department of Computer Science, University of Toronto
akrizhevsky at

Abstract I implemented a convolutional neural network with one layer of convolution. I tested it on the CIFAR-10 dataset, which consists of 6000 32×32 colour images in each of 10 classes. The convolutional net does well on the classification task and takes roughly 140x less time to train than a CPU implementation.


Michael Kipper, Joshua Slavkin, Dmitry Denisenko
University of Toronto
michael at, jslavkin at, dmitry.denisenko at

Abstract We ported C implementation of AES encryption algorithm to run on Cuda GPU. This resulted in the speedup of up to 14.5x.

Quantum Computer Simulation Using CUDA

Alexander Smith, Khashayar Khavari
University of Toronto Department of Electrical and Computer Engineering, University of Toronto Institute for Aerospace Studies
khashayar.khavari AT, alexander.smith AT


Quantum computers promise immense improvements in computational power over existing classical computers. Due to quantum superposition of states, a quantum computer allows for the simultaneous manipulation of all possible combinations of a set of bits in a single operation, giving some algorithms an exponential speed-up as compared to their classical versions. Currently, the physical realization of a practical quantum computer has yet to be achieved. Until then, simulations are used to validate quantum algorithms.

Classical simulations of quantum algorithms are time-consuming because of the need to simulate an exponential number of states in parallel. This inherent parallelism makes the problem suitable for implementation on a graphics processor. In this project, we have selected the Quantum Fourier Transform, which lies at the heart of many other well-known quantum algorithms, for implementation with CUDA. We have achieved a 160-fold speedup over “libquantum”, an open-source CPU quantum computer simulation toolkit.

Transient Stability of Power System

Amirhassan Asgari Kamiabad
Electrical Engineering, Energy System Group
amirhassan.asgarikamiabad AT

Abstract Transient stability analysis examines the dynamic behavior of a power system for as much as several seconds following a power disturbance.The computational complexity of transient stability analysis has limited the ability to run this type of analysis to support decision making at the time of a disturbance and prevent cascading failures. This project investigates the feasibility of performing transient stability analysis on graphics multiprocessors, an emerging general purpose parallel platform. The transient stability analysis involves solution of extremely large systems of differential and algebraic equations. Various integration techniques are available to transform the differential equations to a set of nonlinear equations. Typically, these equations are solved with iterative methods such as the Newton-Raphson algorithm, which in turn requires the solution of large sparse linear systems. It has been shown in prior research that direct methods (e.g., LU decomposition) perform poorly on parallel platforms, particularly on massively parallel processors akin to the GPU; therefore, iterative Conjugate Gradient method is implemented for the linear solver. The fact that iterative methods involve just matrix addition and multiplication, which are highly parallelizable, suggest significant speed-up potential. In order to improve the convergence rate of the conjugate gradient method, Chebychev Polynomial Preconditionner is implemented on GPU. The preconditionner effects the condition number of jacobian matrix and reduce the total number of CG iterations. Number of iterations before and after preconditioning, execution time and speed up for CG and preconditionner process is reported.

  • Presentation: (ppt),(pdf).
  • Report: (pdf).
  • Source code: Please contact the author amirhassan.asgarikamiabad AT .

Parallelizing an Analytical Placer

Bryce Leung, David Goldman, Jungmoo Oh
Dept of Computer Engineering, University of Toronto
leungbry at, david.goldman at, moo.oh at

Abstract Analytical Placement (AP) is a CAD algorithm for ASIC placement. This algorithm generates a system of linear equations, which we solve using Gaussian elimination. We demonstrate how the specific sub-problem of Gaussian elimination on AP input matrices has the unique property where the pivot row is always in place, and show how this greatly enhances our ability to parallelize the algorithm. We demonstrate why a sparse-matrix implementation of the elimination algorithm would be difficult if not impossible to implement on a GPU. We then parallelize a full-matrix implementation of the placer on a GPU using CUDA, and show good speedups versus an optimized CPU implementation.

Help Conquer Cancer: Using GPUs to Accelerate Protein Crystallography Image Analysis

Roy Bryant, Adin Scannell, Olga Irzak, Christian A. Cumbaa
Dept of Computer Science, University of Toronto
rbryant at

Abstract As part of the greater effort to find a cure for cancer, the Ontario Cancer Institute (OCI) is working to improve the throughput of protein crystallography, which is used to study protein structure. Since it's difficult to predict under which conditions crystals will form, many automated experiments with hundreds or thousands of different solutions are photographed over the course of days or weeks. OCI has written software to extract image features that can be used to identify successful experiments, but the task is computationally intensive. Processing the current image backlog will require over 100,000 years of CPU time, so even running on many thousands of cores, the CPU-based software won't finish until 2015. In our project, we have implemented a CUDA-enabled version of the image analysis software that runs 22 times faster, which we hope will help cancer researchers determine the structure of cancer-related proteins faster.

  • Presentation: (pdf).
  • Report: (pdf).
  • Source code: Contact the authors.


Pranit Patel, Jeff Wong, Tatikonda Manisha, Jarek Marczewski
???, AMD
Manisha.Tatikonda at

Abstract The goal of this project was to explore the potential performance improvements that could be gained through the use GPU processing techniques within the CUDA architecture for JPEG compression algorithm. The choice of compression algorithms as the focus was based on examples of data level parallelism found within the algorithms and a desire to explore the effectiveness of cooperative algorithm management between the system CPU and an available GPU. In our project we ported JPEG Compression algorithm in CUDA and achieved xtimes performance.

Micro-benchmarking the GT200 GPU

Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Henry Wong
Computer Group, ECE, University of Toronto,,

Abstract Graphics processors (GPU) are interesting for non-graphics parallel computation because of the potential for more than an order of magnitude of speedup over CPUs. Because the GPU is often presented as a C-like abstraction like Nvidia’s CUDA, little is known about the hardware architecture of the GPU beyond the high-level descriptions documented by the manufacturer.

We develop a suite of micro-benchmarks to measure the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. We measure properties of the arithmetic pipelines, the stack-based handling of branch divergence, and the warp-granularity operation of the barrier synchronization instruction. We confirm that global memory is uncached with ~441 clock cycles of latency, and measure parameters of the three levels of instruction and constant caches and three levels of TLBs.

We succeed in revealing more detail about the GT200 architecture than previously disclosed.

* Presentation: (odp), (pdf). * Report: (pdf). * Source code: (zip).


project_presentations_reports_source_code.txt · Last modified: 2009/06/30 12:50 by yperxristis