Home Up
Home Teaching Glossary ARM Processors Supplements Prof issues About

Performance – Some Examples


Example 1


A computer executes 2.5 x 1010 instructions in 12 seconds. All instructions except loads take 1 cycle, and loads take 4 cycles. The clock rate is 3,000 MHz. What fraction of the instruction workload consists of register loads?


Solution 1


We need to find the average number of cycles per instruction (which must be greater than 1) and then work out what fraction of loads are needed to achieve this figure.


In 12 seconds 2.5 x 1010 instructions are executed. This period corresponds to 12 x 3,000 x 106 = 3.6 x 1010 cycles.


Consequently, an instruction is executed in 3.6 x 1010/2.5 x 1010 = 1.44 cycles/instruction.


If the fraction of load instructions is f, the fraction of non-load instructions is (1- f). The average cycle time per instruction is 4 x f + 1 x (1 – f) = 3f + 1.


If 3f + 1 = 1.44, then 3f = 0.44 and f = 0.44/3 = 0.137. That is, 13.7% of instructions are load operations.



Example 2


Consider the following loop:


int sum = 0, int x[64]

for (j = 0; j < max, j++){

    sum+= x[j]/max;

}


This can be encoded in an ARM-like language as:


     ADR   r0,x       ; r0 points to array X

     MOV   r1,#max    ; r1 contains max

     MOV   r2,#0      ; r2 is loop variable j initialized to 0

     MOV   r3,#0      ; r3 is sum  initialized to 0

Next  LDR   r4,[r0]    ; get x[j]

     DIV   r4,r4,#max ; calculate r3/max (not ARM code)

     ADD   r3,r4,r4   ; add new element to running total

     ADD   r2,r2,#1   ; increment loop variable j

      ADD   r0,r0,#4   ; point to next element in X

     CMP   r2,r1      ; test for end of loop

     BNE   Next       ; repeat until all done


Note that the integers of array X are expressed as 32-bit values although the largest value is 0x000FFFFF. These instructions have the following characteristics


ADR                                    2 CPI

LDR                                     4 CPI

MOV, ADD,CMP, BNE       1 CPI

DIV                                    15 CPI


  1. What is the average CPI for this code if it is considered as static code?
  2. How many instructions does this code execute in total if the value of max is 50?
  3. What is the average dynamic CPI for this code
  4. If the clock frequency is 1 GHz, what is the execution time?
  5. How can this code be improved to reduce its execution time?
  6. How many cycles does the optimized code take to execute?

Solution 2


  1. There are 11 instructions and these take: 2 + 1 + 1 + 1 + 4 + 15 + 1 + 1 + 1 + 1 + 1 = 29 cycles.

The average CPI is 29/11 = 2.64.

  1. The total number of instructions executed is 4 + 50 x 7 = 354.
  2. The average CPI is the total number of cycles divided by the total number of instructions. That is:

[(2 + 1 + 1 + 1 ) + 50(4 + 15 + 1 + 1 + 1 + 1 + 1)]/(4 + 50 x 7) = 1205/354 = 3.40

  1. The execution takes 1,205 cycles. A cycle is 1 ns at a clock rate of 1 GHz. The total time is 1,205 ns or 1.205 ms.
  2. The code can be improved in several ways. The main improvement is to take the division out of the loop and divide at the end,  since p/a + q/a + r/a is (p+q+r)/a.  We also use two indices, the element address and the loop count, whereas we need only one.  Since the element address goes up in increments of 4, the last address is first + max x 4. We can now write:

     ADR   r0,x            ; r0 points to array X

     ADD   r1,r0,#4*max    ; r1 contains base + 4 x max (multiplication by 4 at compile time)

     MOV   r3,#0           ; r3 is sum  initialized to 0

Next  LDR   r4,[r0]         ; get x[j]

     ADD   r3,r4,r4        ; add new element to running total

     ADD   r0,r0,#4        ; point to next element in X

     CMP   r0,r1           ; test for end of loop

     BNE   Next            ; repeat until all done

     DIV   r3,r3,#max      ; calculate sum/max (take the division out of the loop)

 

  1. The code now takes 2 + 1 + 1 +  50 x (2 + 1 + 1 + 1) + 15 cycles = 4 + 250 + 15 = 269 cycles.