Andreas Moshovos

Spring 2008

 

Subroutines

 

Structured programming relies on subroutines. Restricting our attention to C, we could write a subroutine that accepts three numerical arguments and returns their sum.  This subroutine can then be used (or called) from many different places in our program. That is we could for example write the following in C:

 

int add3 (int a, int b, int c)

{

return a  +  b +c;

}

int sum = 0;

 

main ()

{

   sum += add3 (1, 2, 3);

   sum += 10;

   sum += add3 (10, 20, 30);

}

 

For  C subroutines to work as intended we need the following functionality:

 

We should be able to call a subroutine from anywhere in our program. By “call” we mean being able to change control flow so that the routine is executed.

We should be able to pass parameters that may take different values across different calls.

A subroutine must be able to return a value.

A subroutine must be able to change control flow so that execution continues immediately after the point where it was called. Since a subroutine can be called from many different places this suggests that the routine should be able to differentiate between them and “return” to the right spot depending on where it was called from.

 

We can see these requirements at work in our example:

 

We can see that the add3 subroutine is called from two different places. The place a routine is called from is also referred to as a call site. The three parameters that add3 accepts take different values at each call site ((1,2,3) and (10,20,30)). Add3 “returns” a value which is the sum of its three arguments. So, every time add3 is called, it calculates a result and that result is returned as the subroutine’s value. This value can then be used in an expression (e.g., in add3(10,20,30) evaluates to 60). The first time add3 is called we expect execution to resume immediately after the statement that called it. So, once add3(1,2,3) returns execution will resume by adding the return value to sum. Then, the “sum += 10” statement will get executed. The second time add3 is called execution resumes immediately after the particular call.

 

Let’s review the way control flows in time with subroutines with aid of the add3 example:

This diagram also shows a call site, the caller and the callee. The call site is the place a subroutine is called. The caller is the subroutine that is making the call. The callee is the subroutine that is called.

 

We can now look at the list of required functionalities and since we are interested in how to implement subroutines in assembly we can add concerns/requirements that related to machine level constructs. Thus we need to provide answers for the following questions:

 

How does the subroutine returns immediately after the call site?

Where and how does it return a value?

Where and how are we passing arguments to a subroutine?

Where and how are we allocating storage for any local variables (i.e., variable that belong to the subroutine)

What happens to register values once a subroutine is called. Do we require that the subroutine preserves their values or is it OK to overwrite some registers.

 

We will address each of these issues in turn. For most machines there is a set of rules that all valid subroutines must follow. This set of rules is called the calling convention. This set of rules is not the only viable option for implementing subroutines. However, at some point someone decided on a particular solution. If we want a subroutine to interoperate correctly with other subroutines (possibly written by others) we have to follow these set of rules. This way someone else could also use our subroutines. We will be describing the calling convention used by gcc for the NIOS II family of processors. There are other conventions in use in NIOS II. At any given point of time one of them can be in use, or special code must be devised to translate from one form to the other (and we do not want that).

 

There are a few different options for providing the aforementioned functionality. We will present the solution used in NIOS II and once this is understood we will discuss some of the other options. Key to supporting subroutines in NIOS II is the use of a stack.  This stack is used to provide the functionality explained  in points 1 through 5 above. We first explain how a stack can be implemented in NIOS II machine code and then explain how the stack is used to support subroutines.

 

STACK:

Let’s review what stack is. Stack is a last-in first-out (LIFO) queue. In more detail, the stack is a data structure for which three operations are defined:

 

  1. Push value
  2. Pop
  3. Top (distance)

 

The first operation adds a new element onto the stack. The order in which elements are added onto the stack is important. Internally, the elements are placed in a queue following exactly the order in which they were inserted.  Push takes a single argument which is the value we will insert onto the stack. After a push, the number of stack elements increases by one. The pop operation removes the most recently inserted element within the stack. After a pop, the number of items in the stack is reduced by one. If the stack is empty then pop is not a valid operation.

 

Top returns the value of a stack element without removing it from the stack. Top accepts a single argument which specifies the relative to the most recently inserted element of the element we are interested in. So, top(0) returns the value of the most recently inserted element (this is also called the top of the stack , which corresponds to viewing the stack as a vertical queue with elements being placed on top of each other). Top(3) returns the value of the 4th element in the stack (as measured starting from the top).

 

For example, assuming that initially the stack is empty here’s an example of how the stack operates:

 

  1. Push 3 à (3)
  2. Push 4 à (4,3) , where 4 is the top element
  3. Push 10 à (10, 4, 3)
  4. Top(2) à returns 3
  5. Top(1) à returns 4
  6. Pop à returns 10 and then the stack becomes (4, 3)

 

The NIOS II STACK:

 

By convention the NIOS II uses a stack to support subroutines (the stack is also used to support interrupts and in other machines to support OS calls). The stack is implemented in memory with register r27 pointing to the top of the stack. That is, sp contains a value which corresponds to the address of the top element of the stack in memory. The NIOS II stack (very much like most stacks for other CPU families) grows towards lower addresses. The exact value the stack takes typically depends on the machine and the operating system running. On the DE2, there is an initialization code that sets the stack to be at 0x17fff80 when your code is called. This initialization code makes a call to your main() function. This code is in file crt0.o which is linked with your program. In most machines and OSes today a program is laid out in memory so that the program instructions appear first. Then follows the statically defined data (global variables), and then comes the heap (see note below) which grows towards higher addresses. At the very end of memory starts the stack and grows downwards:

 

0x1000000

Instructions (.text section)

 

Statically defined data (global variables in C)

(.data and .bss sections)

 

Heap: Dynamically allocated data

 

0x17fff80

Stack:

 

In NIOS II there is no way to represent an empty stack. The r27 always has a value and this always corresponds to a valid address (they are both 32 bit numbers). If you do use r27 to read from memory you will read the element at the top of the stack. If you did not push an element you are just reading the value that happens to be there where r27 points to. It would be garbage for all we care but you will still be able to read a value. So, whether the stack is empty and how elements it holds is a concept that only the programmer can understand and its all a matter of convention and of using the appropriate operations in sequence to achieve the effect we desire.

 

On another related note, an element in memory is within the stack if its address is equal to or higher than the value of r27. Finally, we can use sp instead of r27. “sp” is an alias for “r27”.

 

(*) the heap is used for dynamic memory allocation in several programming languages. For example, in C when you use malloc() this is where the memory comes from. Similarly, in C++ you use “new” to get memory in the heap. In Java, objects are allocated within the heap. There is no need for hardware support for the heap. It’s all done with appropriate manipulation of register or memory variables. That is, the heap is a software construct. But for our purposes we can safely ignore this topic for the time being.

 

PUSH value: Here’s how push can be implemented in NIOS II assembly. Let’s assumed that the value we want to push onto the stack is in register r9:

            subi sp, sp, 4  à grow the stack by four bytes (a long word)

            stwio r9, 0(sp) à save the value of d0 onto the top of the stack

 

POP: Let’s assume that in addition to removing an element from the top of the stack pop also returns its value into register r9.

 

            1. ldwio r9, 0(sp)        à read top value

            2. addi sp, sp, 4           à increment the stack point thus removing the top element

 

Top (index): Assuming that all elements in the stack are long words then we could access the ith element using a sequence of instructions. For example, if we assume that i (our index) is in register r9, then we can use:

 

            add r9, r9, r9    à assume r9 holds i

            add r9, r9, r9    à r9 = 4 * i

            add r9, sp, r9 à the address of the element in the stack is sp + 4*i

            ldwio r9, 0(r9) à return value of the ith element into d1

 

If “i” is not a variable but a constant we can use just a load to read the value from the stack. For example:

 

            ldwio r9, 16(sp)

 

reads the fourth element of the stack assuming that all elements are words.

 

To avoid writing too long hexademical numbers in the example that follows we assume that the stack started at 0x70000. Let’s assume that the stack has the following values (all values in hexadecimal):

 

Sp --> 0x6fff0

01

02

03

04

0x6fff4

10

20

30

40

0x6fff8

11

22

33

44

0x6fffc

55

66

77

88

 

Here are a few examples:

 

ldwio r9, 8(sp) à r9 = mem[sp + 8] = mem[0x6fff0 + 8] = mem[0x6fff8] = 0x144332211

ldbio r10, 0xd(sp) à r10 = mem[sp + 0xd] = 0x66 (this reads a single byte and sign extends it into 32 prior to writing into r10).

 

Requirement 1: Calling a subroutine and having it return to the caller

 

Having explained the stack we are now ready to explain how the first requirement can be satisfied using the stack. Let’s focus only on control flow for the time being and thus use subroutines that do not accept arguments and do not return values. Let’s use the following C code as our example:

 

boo ()

{

            coo ();

}

 

coo ()

{

            return;

}

 

In this example function boo calls function coo. Here’s the assembly code:

 

            .text

 boo:

            call coo

           

coo:

            ret

 

There are two new instructions: “call coo” and “ret”. “Call coo” does:

           

r31 = PC + 4

            PC = coo

 

“Ret” is equivalent to “jmp r31” as it does:

            PC = r31

 

(It would seem that “ret” is superfluous. Functionally it is. However, it facilitates performance enhancing techniques where the processor exploits the fact that it knows it advance that this instruction returns from a subroutine. This will be explained if you take the Computer Architecture course.)

 

So, boo(), prior to calling coo(), saves in register r31, the address of the instruction that it wants coo() to return to. This address is the address of the call plus four since every instruction is four bytes long. When coo returns, it simply uses “jmp r31”. In the assembler you can use the alias “ra” (for return address) instead of r31.

 

The aforementioned example works without needing a stack because we have only one call. Let’s see what happens if coo() were to make a few calls also. Let’s look at the following code:

 

boo_calls ()

{

            coo ();

            doo ();

            return;

}

 

void coo ()

{

doo ();

return;

}

 

void doo()

{

return;

}

 

 

Notice that boo calls coo which then calls doo. After doo returns to coo and it returns to boo, boo calls doo directly. Focusing thus on doo, we see an example where a function is called from two different places and is supposed to return at different spots for each of those calls.

 

The “trick” is to apply the following convention: If a function will be calling another it has to save the ra value on the stack in the beginning and restore it from the stack prior to returning. Here’s the code:

 

The NIOS II code is as follows:

 

            .text

boo:                         # boo will be making calls, so it first pushes the ra value on the stack   

            subi sp, sp, 4  

            stwio ra,0(sp)   # push the return address onto the stack

 

            call coo         # resume execution at coo, ra = PC + 4 = boo_ret1

boo_ret1:

 

            call doo         # continue execution at doo, ra = PC + 4 = boo_ret2

boo_ret2:                                        

            ldwio ra, 0(sp)   # pop return address from the stack

            addi sp, sp, 4

            ret               # resume execution there

 

coo:

            subi sp, sp, 4  

            stwio ra,0(sp)    # push the return address onto the stack

            call     doo      # resume execution at coo, ra = PC + 4 = coo_ret

coo_ret:

           

            ldwio ra, 0(sp)   # pop return address of boo from the stack

            addi sp, sp, 4

            ret              # resume execution there

 

doo:                         # doo will not be making any calls, no need to save ra on the stack

            ret              # just return to whoever called

 

 

The first two instructions push on the stack the return address for the call to coo. The return address is the address of the instruction that follows the call in the calling function. We use the label “boo_ret1” to refer to this address which is the PC of the “call coo” plus 4.

 

            subi sp, sp, 4                            à make space on the stack for a word

            stwio ra, 0(sp)                         à save return address for boo onto stack

            call coo                                   à ra = PC + 4, PC = coo, continue execution at coo

 

after these instructions have been executed, the PC points to “coo” and the stack contains a single word element whose value is the address of the instruction that boo should return to eventually.  In coo now, the first two instructions push onto the stack the return address for coo, which in this case is boo_ret1; when coo returns execution should continue immediately after the “call coo”. So, at this stage, the stack contains two words. At the top of the stack is the return address for coo. After it is the return address for boo.

 

Since doo does not call any other routines, it does not need to save the “ra” on the stack.  It can simply return to whoever called it. To do this, “doo” uses a “jmp ra” to resume execution at that address.

Once doo returns, the PC will point to coo_ret. The instruction sequence starting at coo_ret pops the saved return address for coo and then returns to it. Specifically:

 

ldwio ra, 0(sp)             à restore ra from the stack, ra becomes boo_ret1.

add sp, sp, 4                à adjust the stack (value popped)

jmp ra                          à return to boo at address boo_ret

 

Back in boo, we then call doo. This changes the PC to point to doo while ra becomes boo_ret2 which is the address of the “call doo” plus four. Doo again uses ra to return to whoever called it. After doo returns, we go to boo_ret1. Finally, boo pops the saved return value from the stack and returns to whichever function called it.

 

Notice that while the first time doo was called the return address was pointing in coo, during the second call to doo the return is in boo. Also note that while boo called coo, coo was able to call doo. The stack thus allowed us to implement nested calls and have each one of them return appropriately.

 

Here’s how the particular code compiles with the DE2 compiler (note that here I used “jmp ra” instead of “ret” which is equivalent -- in retrospect it would have been better to stick to ret however I realized that after I put in the work of creating the images L ):

 

01000000 <boo>:

 1000000:       deffff04        addi    sp,sp,-4

 1000004:       dfc00035        stwio   ra,0(sp)

 1000008:       100001c0        call    100001c <coo>

 

0100000c <boo_ret1>:

 100000c:       10000340        call    1000034 <doo>

 

01000010 <boo_ret2>:

 1000010:       dfc00037        ldwio   ra,0(sp)

 1000014:       dec00104        addi    sp,sp,4

 1000018:       f800683a        jmp     ra

 

0100001c <coo>:

 100001c:       deffff04        addi    sp,sp,-4

 1000020:       dfc00035        stwio   ra,0(sp)

 1000024:       10000340        call    1000034 <doo>

 

01000028 <coo_ret>:

 1000028:       dfc00037        ldwio   ra,0(sp)

 100002c:       dec00104        addi    sp,sp,4

 1000030:       f800683a        jmp     ra

 

01000034 <doo>:

 1000034:       f800683a        jmp     ra

 

Here’s how the code executes step-by-step on DE2. In this example, we assume that when boo is called ra holds the value 0x1000068, this is the return address for boo(). We also assume that the stack initially points to 0x20000c. This is not the actual value used on DE2 but it does not matter. The point is that as long as the stack does not overflow into .data or .text the code will work independently of where the stack actually is.

 

Initial State:

 

After the first instruction executes:

After the third instruction executes.

 

Limitations of the call instruction

 

In general, the “call” instruction accepts a label as the second argument. There limitations on how far the called function can be. Label is encoded in the call instruction using a 26-bit immediate. The actual target is calculated by multiplying that immediate by 4 (since each instruction is four bytes long) and by concatenating the upper 6 bits (bits 31 through 28) from the call’s PC. Here’s the encoding of the call instruction: