Problem: Find a way to parallelize the code: for(i = 0; i < 100000; i++) a[i + 1000] = a[i] + 1; Sample Solution: Joseph Garvey Here are my two solutions to example 16 from the lecture: 1.) You can apply the same technique we used in example 13: have one thread computing a[0], a[1000], a[2000], ...; another thread computing a[1], a[1001], a[2001],... etc. This allows us to use up to 1000 threads to parallelize the loop. 2.) My second idea involves some code transformation but allows us to fully parallelize the loop. Replace the recursive definition of a[i+1000] by an analytical definition by taking advantage of the fact that a[i+1000] = a[i%1000] + ceil(i/1000). Now there are no loop carried dependences (since the only values of a that ever get read are a[0]-a[999] which are never written to. Thus the entire loop can be parallelized. If you wanted to further optimize this you can remove the ceil operation completely by breaking the code into 2 loops like this (in pseudocode): for i = 0 to 99 for j = 0 to 999 a[(i+1)*1000 + j] = a[j] + i + 1 In this case all 100,000 iterations of "a[(i+1)*1000 + j] = a[j] + i + 1" (i.e. both loops) can be fully parallelized because there are no dependences. Comments for Joe Garvey: excellent explanation, fully detailed and clear; offers 2 solutions, the "obvious" one parallelizing the chunks of 1000 items at a time, as well as the loop transformation one, where he rewrites the formula to fully eliminate dependences, and allow full parallelism. ------------------------------------------------------------------- Comments for Peter Goodman: inner loop parallelism, solution is fine. ------------------------------------------------------------------- Comments for Remi Dufour: it looks like he got the first half of the idea right. ------------------------------------------------------------------- Comments for Yunan Zhao: inner loop parallelism, solution is fine. ------------------------------------------------------------------- Comments for Reza Nakhjavani: Comments for first submitted solution: Some really weird indices (98000), doesn't make much sense. Also not clear if inner or outer loop parallelism, but that might be because of incorrect syntax; obviously, I'm not picking on his syntax, it's just not clear which is intended, outer loop parallelism would not work with those indeces. Comments for later updated solution: This is correct, this is the analytical solution, whereby he fully rewrites the code such that it is completely different code, but that achieves the same result; all dependences are eliminated, and thus the code is fully parallelizable. ------------------------------------------------------------------- Comments for Ehsan Nasiri - good approach; performs a loop transformation such that the whole thing is fully parallelizable (eliminates dependencies). Good out-of-the-box thinking. ------------------------------------------------------------------- Comments for Abderahmane Allalou - I think it's ok, explanation is pretty vague. Has got the overall idea, even if didn't express it very clearly. ------------------------------------------------------------------- Comments for Steve Chun-Hao Hu - solution is correct, runs the 1000-chunks all in parallel. ------------------------------------------------------------------- Comments for Akshay Kumar: correct, the standard inner loop parallelism solution. ------------------------------------------------------------------- Comments for Alan Lu: Yes, this is the analytical loop transformation solution, where he rewrites the code to do the same thing. ------------------------------------------------------------------- Comments for Bozhidar Lenchov: This looks like the standard inner parallelism solution; the difference from the solutions proposed above is that instead of incrementing the outer loop index by 1000 and using that in the inner loop, he multiplies the inner loop index by 1000 explicitly (achieving the same effect). So, as far as I can tell, it's the same inner parallelism solution with a slight variation in how the loop indices are computed (but with same effect). In summary, it looks ok. ------------------------------------------------------------------- Comments for Mohammad Zahedi : Solution given in a pdf file with nice graphical diagram; Using a separate temp array to store elements; but it is basically the same idea as the code-refactoring solution, where he rewrites the code to have the same outcome by directly placing the final values in each element; the code transformation eliminates dependencies, so code is fully parallelizable. ------------------------------- Comments for Wei Ye: Good solution, good explanation, but poor cache locality, and not full parallelism. ------------------------------- Comments for Venkatesh Nandakumar: Correct solution, good explanation, but poor cache locality, and not full parallelism. ------------------------------- Comments for Mike Dai Wang: code-refactoring solution, very good, no problems whatsoever. --------------------------- Comments for Alton Chiu: good solution - both the inner-loop parallelism approach, and a variation of the synthetic code-refactoring solution.