gcc - Manually vectorized code 10x slower than auto optimized

gcc - Manually vectorized code 10x slower than auto optimized - what I did wrong? -

March 15, 2012

i'm trying learn how exploit vectorization gcc. followed tutorial of erik holk ( source code here )

i modified double. used dotproduct compute multiplication of randomly generated square matrices 1200x1200 of doubles ( 300x300 double4 ). checked results same. surprised me is, simple dotproduct 10x faster manually vectorized.

maybe, double4 big sse ( need avx2 ? ) expect in case when gcc cannot find suitable instruction dealing double4 @ once, still able exploit explicit information data in big chunks auto-vectorization.

details:

the results was:

dot_simple: time elapsed 1.90000 [s] 1.728000e+09 evaluations => 9.094737e+08 [ops/s]  dot_sse: time elapsed 15.78000 [s] 1.728000e+09 evaluations => 1.095057e+08 [ops/s]

i used gcc 4.6.3 on intel® core™ i5 cpu 750 @ 2.67ghz × 4 these options -std=c99 -o3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math or -o2 ( result same )

i did using python/scipy.weave() convenience, hope doesn't change anything

the code:

double dot_simple(  int n, double *a, double *b ){     double dot = 0;     (int i=0; i<n; i++){          dot += a[i]*b[i];     }     return dot; }

and 1 using explicitly gcc vector extensiobns

double dot_sse(  int n, double *a, double *b ){     const int vector_size = 4;     typedef double double4 __attribute__ ((vector_size (sizeof(double) * vector_size)));     double4 sum4 = {0};     double4* a4 = (double4 *)a;     double4* b4 = (double4 *)b;     (int i=0; i<n; i++){          sum4 += *a4 * *b4 ;         a4++; b4++;         //sum4 += a4[i] * b4[i];     }     union {  double4 sum4_; double sum[vector_size]; };     sum4_ = sum4;     return sum[0]+sum[1]+sum[2]+sum[3]; }

then used multiplication of 300x300 random matrix measure performance

void mmul( int n, double* a, double* b, double* c ){     int n4 = n*4;     (int i=0; i<n4; i++){         (int j=0; j<n4; j++){             double* ai = + n4*i;             double* bj = b + n4*j;             c[ i*n4 + j ] =  dot_sse( n, ai, bj );             //c[ i*n4 + j ] =  dot_simple( n4, ai, bj );             ijsum++;         }     } }

scipy weave code:

def mmul_2(a, b, c, __force__=0 ):     code = r'''     mmul( na[0]/4, a, b, c );            '''     weave_options = {     'extra_compile_args': ['-std=c99 -o3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math'],     'compiler' : 'gcc', 'force' : __force__ }     return weave.inline(code, ['a','b','c'], verbose=3, headers=['"vectortest.h"'],include_dirs=['.'], **weave_options )

one of main problems in function dot_sse loop on n items when should loop on n/2 items (or n/4 avx).

to fix gcc's vector extensions can this:

double dot_double2(int n, double *a, double *b ) {     typedef double double2 __attribute__ ((vector_size (16)));     double2 sum2 = {};     int i;     double2* a2 = (double2*)a;     double2* b2 = (double2*)b;     for(i=0; i<n/2; i++) {         sum2 += a2[i]*b2[i];     }     double dot = sum2[0] + sum2[1];     for(i*=2;i<n; i++) dot +=a[i]*b[i];      return dot; }

the other problem code has dependency chain. cpu can simultaneous sse addition , multiplication independent data paths. fix need unroll loop. following code unrolls loop 2 (but need unroll 3 best results).

double dot_double2_unroll2(int n, double *a, double *b ) {     typedef double double2 __attribute__ ((vector_size (16)));     double2 sum2_v1 = {};     double2 sum2_v2 = {};     int i;     double2* a2 = (double2*)a;     double2* b2 = (double2*)b;     for(i=0; i<n/4; i++) {                sum2_v1 += a2[2*i+0]*b2[2*i+0];         sum2_v2 += a2[2*i+1]*b2[2*i+1];     }     double dot = sum2_v1[0] + sum2_v1[1] + sum2_v2[0] + sum2_v2[1];     for(i*=4;i<n; i++) dot +=a[i]*b[i];      return dot; }

here version using double4 think wanted original dot_sse function. it's ideal avx (though still needs unrolled) still work sse2 well. in fact sse seems gcc breaks 2 chains unrolls loop 2.

double dot_double4(int n, double *a, double *b ) {     typedef double double4 __attribute__ ((vector_size (32)));     double4 sum4 = {};     int i;     double4* a4 = (double4*)a;     double4* b4 = (double4*)b;     for(i=0; i<n/4; i++) {                sum4 += a4[i]*b4[i];     }     double dot = sum4[0] + sum4[1] + sum4[2] + sum4[3];     for(i*=4;i<n; i++) dot +=a[i]*b[i];      return dot; }

if compile fma generate fma3 instructions. tested these functions here (you can edit , compile code well) http://coliru.stacked-crooked.com/a/273268902c76b116

note using sse/avx single dot production in matrix multiplication not optimal use of simd. should 2 (four) dot products @ once sse (avx) double floating point.

Search This Blog

Two

gcc - Manually vectorized code 10x slower than auto optimized - what I did wrong? -

Comments

Post a Comment

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -