DEC(HP) Motion Video Instruction Extensions for Alpha MIPS MIPS 3D SUN Visual Instruction Set Motorola Velocity Engine (AltiVec) Intel MMX,SSE,SSE2,SSE3Currently, utilisation of SIMD instructions through compilers are very limited. In many cases, programmers must resort to use assembly languages or intrinsic routines when he is to exploit SIMD instruction sets.
In this sub theme, we develop program code analysis and transformation method which enables the compiler to generate efficient code including SIMD instructions for those applications shown above, and implement it in our infrastructure.
We have made survey of SIMD type instruction set extensions of recent processors in their basic features of SIMD and specific features more than SIMD[1].
Following is the list of instruction sets which we have surveyed.
| Vendor | Instruction Set | Availability |
|---|---|---|
| SUN | VIS(TM) | after UltraSPARC-I |
| Intel | MMX(R) Technology | after Pentium-II |
| Intel | Streaming SIMD Extensions | after Pentium-III |
| Intel | Streaming SIMD Extensions 2 | after Pentium 4 |
| AMD | 3DNow!(TM) Technology | after K6-2 |
| AMD | Enhanced 3DNow!(TM) Technology | Athlon, Dulon |
| Motorola | AltiVec(TM) Technology | MPC7400, MPC7410 (PowerPC G4) |
| MIPS | MIPS V ISA Extension(currently a part of MIPS_64) | R5000 |
| MIPS | MDMX (Mips Digital Media Extension) | MIPS64 5kc |
| MIPS | MIPS-4D(TM) ASE (Application Specific Extension) | MIPS64 420K, MIPS64 20Kc |
SIMD instructions including complex and data size conscious instructions such as saturation arithmetics and rounding operations. It is impossible to make use of those instructions with simplified instruction description manner that only associates operators in higher-level languages and their corresponding instructions.
To exploit features of SIMD instruction sets which have generally low parallelism but low execution cost, schedulings of instructions including conventional ones are needed. And the compiler must analyse dependencies between instructions precisely.
Then, we made precise and systematic description of functions of the instructions including conventional ones for following instruction set architecture. Really, they describe precise behaviour of instruction execution such as change of condition flags and behaviour on overflow.
To realize following three features
We decided to place the SIMD optimisation on the last phase of LIR level optimisation in the compiler path.
1. Virtual resister assignments
2.1 1st LIR level optimisation
2.2 2nd LIR level optimisation
:
2.n-1 n-1th LIR level optimisation
2.n SIMD optimisation
IF conversion with logical operations
:
SIMD register assignment
3. Frame assignment
4. Pattern matching
5. Physical register assignment
6. Machine code level optimisation
7. Assembly code generation
In this sub theme, we use SIR(S-expression Intermediate Representation) on the implementation. The SIR has some additional features to the 1st version of LIR which is required for the SIMD optimisation, and its' implementation is different from 1st version of LIR. Then, we link it to the infrastructure with a data format conversion.
We developed a code analysis method based on "data size inference".
It guarantees the same result as the integral promotion rule is
strictly obeyed even when narrower processing data sizes than the integral
type are selected.
This analysis scans tree for an right part of an assignment statement at
bottom-up manner and top-down manner in the order.
In the bottom-up scan, it infers a value range for each node, and in the
top-down scan, it infers an available bit set for each node.
Code generator in the back end can select optimal processing data size
for SIMD instructions with matching of instruction template and the available
bit set of each node.
As the result, the parallelism of the SIMD processing can be increased,
useless data size conversions are omitted, matching to instructions
with more complex processing is enabled, and the execution efficiency
can be increased.
Currently, it is implemented in the 2nd version of back end as a prototype.
We tested the prototype shown above with a part of movie compression program. Then we got an about 5 times of speedup.