SIMD Parallelisation

1.The Target

This sub theme belongs to the category of processor level parallelisation. Media processing applications often treat homogeneous data with relatively small size such as PCM audio and bitmaps of pictures. Then, recent processors have equipped with SIMD (Single Instruction Multiple Data stream) type media processing instruction (SIMD instruction for short) sets extensions, which is aimed to accelerate the processing speed for those data. The followings are examples of such SIMD instruction sets:
DEC(HP)	Motion Video Instruction Extensions for Alpha
MIPS	MIPS 3D
SUN	Visual Instruction Set
Motorola Velocity Engine (AltiVec)
Intel	MMX,SSE,SSE2,SSE3
Currently, utilisation of SIMD instructions through compilers are very limited. In many cases, programmers must resort to use assembly languages or intrinsic routines when he is to exploit SIMD instruction sets.

In this sub theme, we develop program code analysis and transformation method which enables the compiler to generate efficient code including SIMD instructions for those applications shown above, and implement it in our infrastructure.

2. Load map

3. Method

While SIMD parallelisation can share some techniques with vector processor parallelisation, it needs new techniques for specific features of the SIMD instruction sets shown below: Then we decide to implement following three optimisation method.

4. Current Result

  1. Survey of SIMD instruction sets

    We have made survey of SIMD type instruction set extensions of recent processors in their basic features of SIMD and specific features more than SIMD[1].

    Following is the list of instruction sets which we have surveyed.
    VendorInstruction Set Availability
    SUN VIS(TM) after UltraSPARC-I
    Intel MMX(R) Technologyafter Pentium-II
    Intel Streaming SIMD Extensionsafter Pentium-III
    Intel Streaming SIMD Extensions 2after Pentium 4
    AMD 3DNow!(TM) Technology after K6-2
    AMD Enhanced 3DNow!(TM) TechnologyAthlon, Dulon
    MotorolaAltiVec(TM) TechnologyMPC7400, MPC7410 (PowerPC G4)
    MIPS MIPS V ISA Extension(currently a part of MIPS_64)R5000
    MIPS MDMX (Mips Digital Media Extension)MIPS64 5kc
    MIPS MIPS-4D(TM) ASE (Application Specific Extension)MIPS64 420K, MIPS64 20Kc

  2. Systematic description of functions of the instructions

    SIMD instructions including complex and data size conscious instructions such as saturation arithmetics and rounding operations. It is impossible to make use of those instructions with simplified instruction description manner that only associates operators in higher-level languages and their corresponding instructions.

    To exploit features of SIMD instruction sets which have generally low parallelism but low execution cost, schedulings of instructions including conventional ones are needed. And the compiler must analyse dependencies between instructions precisely.

    Then, we made precise and systematic description of functions of the instructions including conventional ones for following instruction set architecture. Really, they describe precise behaviour of instruction execution such as change of condition flags and behaviour on overflow.

  3. Development of SIMD optimisation prototype

    To realize following three features

    Implementation of the prototype is divided into following three phases.

    We decided to place the SIMD optimisation on the last phase of LIR level optimisation in the compiler path.

      1. Virtual resister assignments
      2.1 1st LIR level optimisation
      2.2 2nd LIR level optimisation
            :
      2.n-1 n-1th LIR level optimisation
      2.n SIMD optimisation
        IF conversion with logical operations
            :
        SIMD register assignment
      3. Frame assignment
      4. Pattern matching
      5. Physical register assignment
      6. Machine code level optimisation
      7. Assembly code generation
    

  4. Linking to the Infrastructure

    In this sub theme, we use SIR(S-expression Intermediate Representation) on the implementation. The SIR has some additional features to the 1st version of LIR which is required for the SIMD optimisation, and its' implementation is different from 1st version of LIR. Then, we link it to the infrastructure with a data format conversion.

  5. Data Size Inference

    We developed a code analysis method based on "data size inference". It guarantees the same result as the integral promotion rule is strictly obeyed even when narrower processing data sizes than the integral type are selected. This analysis scans tree for an right part of an assignment statement at bottom-up manner and top-down manner in the order. In the bottom-up scan, it infers a value range for each node, and in the top-down scan, it infers an available bit set for each node. Code generator in the back end can select optimal processing data size for SIMD instructions with matching of instruction template and the available bit set of each node. As the result, the parallelism of the SIMD processing can be increased, useless data size conversions are omitted, matching to instructions with more complex processing is enabled, and the execution efficiency can be increased.
    Currently, it is implemented in the 2nd version of back end as a prototype.

  6. Preliminary evaluation of the SIMD parallelisation

    We tested the prototype shown above with a part of movie compression program. Then we got an about 5 times of speedup.