Home Up
Home Teaching Glossary ARM Processors Supplements Prof issues About

VLIW example


Consider a VLIW computer that can execute pairs of instructions. Each instruction is executed in one clock cycle. However, a load instruction and a multiply each take two clock cycles. When a 2-cycle instruction is encountered, the slot vertically below it is a  stall slot. The slot below it and adjacent can still be used. Note that the code is ARM-like and the multiply instruction has no restriction on the use of its registers and can also take a literal operand.


Write a suitable fragment of code to implement the following construct. A, B, and C are memory locations that are pointed at by [r8], [r8,#4], and [r8,#8], respectively.


18(A2 + B2) + 12C2 + 17C + 25


Solution


We can write down the simple serial code as follows.


LDR  r0,[r8]       ;Get A

MUL  r0,r0,r0      ;A*A

LDR  r1,[r8,#4]    ;Get B

MUL  r1,r1,r1      ;B*B

ADD  r0,r0,r1      ;A*A + B*B

MUL  r0,r0,#18     ;18(A*A + B*B)

LDR  r1,[r8,#8]    ;Get C reuse r1

MUL  r2,r1,r1      ;C*C

MUL  r3,r1,#12     ;12C*C

ADD  r2,r2,r3      ;18(A*A + B*B) + 12C*C

MUL  r1,r1,#17     ;17C

ADD  r2,r2,r10     ;18(A*A + B*B) + 12C*C + 17C

ADD  r2,r2,#25     ;18(A*A + B*B) + 12C*C + 17C + 25


There are no data hazards in this code. If it were executed, the total time would be 22 cycles (allowing for load and multiply stalls).


The next step is to convert it into VLIW code.


Execution unit 1                         Execution unit 2

LDR  r0,[r8]                     LDR  r1,[r8,#4]  

Stall load                                    Stall load  

MUL  r0,r0,r0                   MUL  r1,r1,r1  

Stall mult                                    Stall mult  

ADD  r0,r0,r1                   LDR  r1,[r8,#8]  

Stall – no code available         Stall load  

MUL  r3,r1,#12                 MUL  r2,r1,r1  

Stall mult                                     Stall mult  

ADD  r2,r2,r3                   MUL  r1,r1,#17  

ADD  r2,r2,r10                 Stall mult  

ADD  r2,r2,#25   

In this case the cycle time is 11 and one slot is lost due to a data hazard.

VLIW Example 2


Consider a VLIW with three operation slots per instruction. One slot is dedicated to load, store and branch operations. The other two slots perform all other operations. Load and multiply operations both have a 1 cycle penalty.


Consider the vector operation  zi = 6xi + yi + 12, where index i is from 0 to m 0 - 1. Write the code to perform this operation using loop unrolling with three iterations per loop. The processor is a pseudo ARM with a multiply without register restrictions.


We can write the following simple inline code. Blue is the setup portion. Black is the loop and red is the actual data processing part of the code. This non-VLIW codes would require 14 cycles per iteration (11 instructions and 3 delay slots).


    ADR   r0,Z

    ADR   r1,X

    ADR   r2,Y

    MOV   r3,#N     ;N is the loop count divided by 3

Loop LDR   r4,[r1]   ;get x

    LDR   r5,[r2]   ;get y

    ADD   r1,r1,#4  ;increment x pointer

    ADD   r2,r2,#4  ;increment y pointer

    MUL   r4,r4,#6  ;6x

    ADD   r4,r4,r5  ;6x + y

    ADD   r4,r4,#12 ;6x + y + 12

    STR   r4,[r0]   ;store z

    ADD   r0,r0,#4  ;increment z pointer

    SUBS  r3,r3,#1  ;decrement loop counter

    BNE   Loop


The following demonstrates the code running on a hypothetical VLIW computer. The blue stalls indicate a delay cycle and the red stall are due to an empty slot because no instruction is available.


Load Store Branch                                             Data processing                                             Data processing

Stall no operation          ADR   r0,Z                 ADR   r1,X

Stall no operation          ADR   r2,Y                 MOV   r3,#N  

LDR   r4,[r1]               ADD   r1,r1,#4             Stall no operation

Stall load                  Stall no operation         Stall no operation

LDR   r5,[r2]               MUL   r4,r4,#6             ADD   r2,r2,#4

Stall load                  Stall multiply             Stall no operation

Stall no operation          ADD   r4,r4,r5             Stall no operation

Stall no operation          ADD   r4,r4,#12            Stall no operation

STR   r4,[r0]               ADD   r0,r0,#4             SUBS  r3,r3,#1

BNE   Loop                  Stall no operation         Stall no operation


This code takes 8 cycles to execute one pass round the loop. Out of the 8 x 3 execution slots, only 11 are filled. That is, the efficiency is 11/24 = 46%. One iteration is performed in 8 cycles.


Now consider the same system but with loop unrolling. We will perform two cycles of iteration each time we pass round the loop. Below is the linear code. To simplify register naming, we’ve used r4 and rr4, etc., to indicate the same operation at cycle i and at cycle i + 1. Similarly, we’ll use x and xx and y and yy for the pairs of variables.


    ADR   r0,Z        ;r0 points to array z

    ADR   r1,X        ;r1 points to array x

    ADR   r2,Y        ;r2 points to array y

    MOV   r3,#N       ;N is the loop count divided by 2


Loop LDR   r4,[r1]     ;get x

    LDR   rr4,[r1,#4] ;get xx

     LDR   r5,[r2]     ;get y

    LDR   rr5,[r2,#4] ;get yy     

    ADD   r1,r1,#8    ;increment x pointer by 2 places

    ADD   r2,r2,#8    ;increment y pointer by 2 places

    MUL   r4,r4,#6    ;6x

    MUL   rr4,rr4,#6  ;6xx

    ADD   r4,r4,r5    ;6x + y

    ADD   rr4,rr4,rr5 ;6xx + yy

    ADD   r4,r4,#12   ;6x + y + 12

    ADD   rr4,rr4,#12 ;6xx + yy + 12

    STR   r4,[r0]     ;store z

    STR   rr4,[r0,#4] ;store zz

    ADD   r0,r0,#8    ;increment z pointer by 2 places

    SUBS  r3,r3,#1    ;decrement loop counter

    BNE   Loop


We can now write the VLIW code.


Load Store Branch                                                       Data processing                                                    Data processing

Stall no operation               ADR   r0,Z                     ADR   r1,X

Stall no operation               ADR   r2,Y                     MOV   r3,#N  

LDR   r4,[r1]                    Stall no operation             Stall no operation

Stall load                       Stall no operation             Stall no operation

LDR   r5,[r2]                    MUL   r4,r4,#6                 Stall no operation

Stall load                       Stall multiply                 Stall no operation

LDR   rr4,[r1,#4]                ADD   r4,r4,r5                 ADD   r1,r1,#8

Stall load                       ADD   r4,r4,#12                Stall no operation

LDR   rr5,[r2,#4]                ADD   r2,r2,#8                 MUL   rr4,rr4,#6

Stall load                       ADD   rr4,rr4,#12              Stall multiply

STR   r4,[r0]                    ADD   rr4,rr4,rr5              SUBS  r3,r3,#1

STR   rr4,[r0,#4]                ADD   r0,r0,#8                 BNE   Loop


Note that the code order has been slightly rearranged. The instruction in red, ADD rr4,rr4,#12, is out of sequence but can be executed while we are waiting for the next value of yy.


The total number of cycles per trip is 10 and the number of slots used is 30. Of these 17 are filled giving a slot efficiency of 17/30 = 57%. Since two iterations are performed per trip, the average number of cycles per iteration is 10/2 = 5.


Let’s modify the problem by allowing two slots to perform load and store operations.


Load Store Branch Load Store Data processing Data processing

Stall no operation               ADR   r0,Z                     ADR   r1,X

Stall no operation               ADR   r2,Y                     MOV   r3,#N  

LDR   r4,[r1]                    LDR   r5,[r2]                  Stall no operation

Stall load                       Stall load                     Stall no operation

LDR   rr4,[r1,#4]                MUL   r4,r4,#6                 MUL   rr4,rr4,#6

Stall load                       Stall multiply                 Stall multiply

LDR   rr5,[r2,#4]                ADD   r4,r4,r5                 ADD   r1,r1,#8  

Stall load                       ADD   r4,r4,#12                ADD   r2,r2,#8

STR   r4,[r0,#4]                 ADD   rr4,rr4,rr5              Stall no operation

Stall no operation               ADD   rr4,rr4,#12              SUBS  r3,r3,#1

STR   rr4,[r0,#4]                ADD   r0,r0,#8                 BNE   Loop


In this case we’ve shaved a cycle off a loop. Now a trip takes 9 cycles (27 slots). The occupancy is 17/27 = 63%.


Now consider the case when all three slots can handle any type of instruction.

Operation1                                                   Operation 2                                              Operation 3

ADR   r0,Z                       ADR   r1,X                     ADR   r2,Y

MOV   r3,#N                      Stall no operation             Stall no operation

LDR   r4,[r1]                    LDR   r5,[r2]                  LDR   rr4,[r1,#4]       

Stall load                       Stall load                     Stall load

MUL   r4,r4,#6                   MUL   rr4,rr4,#6               LDR   rr5,[r2,#4]

Stall multiply                   Stall multiply                 Stall load

ADD   rr4,rr4,rr5                ADD   r4,r4,r5                 ADD   r1,r1,#8

ADD   rr4,rr4,#12                ADD   r4,r4,#12                ADD   r2,r2,#8

STR   rr4,[r0,#4]                STR   r4,[r0,#4]               SUBS  r3,r3,#1

ADD   r0,r0,#8                   BNE   Loop                     Stall no operation


Now we have 8 cycles per trip using 24 slots of which 17 are filled. The occupancy is 17/24 = 71%. Only one slot per trip is unused because of no available data. Two iterations are performed per trip which means that one iteration takes 8/2 = 4 cycles.

Click here for more detail of the instruction sequencing