gcc - Manually vectorized code 10x slower than auto optimized - what I did wrong? -
i'm trying learn how exploit vectorization gcc. followed tutorial of erik holk ( source code here )
i modified double. used dotproduct compute multiplication of randomly generated square matrices 1200x1200 of doubles ( 300x300 double4 ). checked results same. surprised me is, simple dotproduct 10x faster manually vectorized.
maybe, double4 big sse ( need avx2 ? ) expect in case when gcc cannot find suitable instruction dealing double4 @ once, still able exploit explicit information data in big chunks auto-vectorization.
details:
the results was:
dot_simple: time elapsed 1.90000 [s] 1.728000e+09 evaluations => 9.094737e+08 [ops/s] dot_sse: time elapsed 15.78000 [s] 1.728000e+09 evaluations => 1.095057e+08 [ops/s] i used gcc 4.6.3 on intel® core™ i5 cpu 750 @ 2.67ghz × 4 these options -std=c99 -o3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math or -o2 ( result same )
i did using python/scipy.weave() convenience, hope doesn't change anything
the code:
double dot_simple( int n, double *a, double *b ){ double dot = 0; (int i=0; i<n; i++){ dot += a[i]*b[i]; } return dot; } and 1 using explicitly gcc vector extensiobns
double dot_sse( int n, double *a, double *b ){ const int vector_size = 4; typedef double double4 __attribute__ ((vector_size (sizeof(double) * vector_size))); double4 sum4 = {0}; double4* a4 = (double4 *)a; double4* b4 = (double4 *)b; (int i=0; i<n; i++){ sum4 += *a4 * *b4 ; a4++; b4++; //sum4 += a4[i] * b4[i]; } union { double4 sum4_; double sum[vector_size]; }; sum4_ = sum4; return sum[0]+sum[1]+sum[2]+sum[3]; } then used multiplication of 300x300 random matrix measure performance
void mmul( int n, double* a, double* b, double* c ){ int n4 = n*4; (int i=0; i<n4; i++){ (int j=0; j<n4; j++){ double* ai = + n4*i; double* bj = b + n4*j; c[ i*n4 + j ] = dot_sse( n, ai, bj ); //c[ i*n4 + j ] = dot_simple( n4, ai, bj ); ijsum++; } } } scipy weave code:
def mmul_2(a, b, c, __force__=0 ): code = r''' mmul( na[0]/4, a, b, c ); ''' weave_options = { 'extra_compile_args': ['-std=c99 -o3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math'], 'compiler' : 'gcc', 'force' : __force__ } return weave.inline(code, ['a','b','c'], verbose=3, headers=['"vectortest.h"'],include_dirs=['.'], **weave_options )
one of main problems in function dot_sse loop on n items when should loop on n/2 items (or n/4 avx).
to fix gcc's vector extensions can this:
double dot_double2(int n, double *a, double *b ) { typedef double double2 __attribute__ ((vector_size (16))); double2 sum2 = {}; int i; double2* a2 = (double2*)a; double2* b2 = (double2*)b; for(i=0; i<n/2; i++) { sum2 += a2[i]*b2[i]; } double dot = sum2[0] + sum2[1]; for(i*=2;i<n; i++) dot +=a[i]*b[i]; return dot; } the other problem code has dependency chain. cpu can simultaneous sse addition , multiplication independent data paths. fix need unroll loop. following code unrolls loop 2 (but need unroll 3 best results).
double dot_double2_unroll2(int n, double *a, double *b ) { typedef double double2 __attribute__ ((vector_size (16))); double2 sum2_v1 = {}; double2 sum2_v2 = {}; int i; double2* a2 = (double2*)a; double2* b2 = (double2*)b; for(i=0; i<n/4; i++) { sum2_v1 += a2[2*i+0]*b2[2*i+0]; sum2_v2 += a2[2*i+1]*b2[2*i+1]; } double dot = sum2_v1[0] + sum2_v1[1] + sum2_v2[0] + sum2_v2[1]; for(i*=4;i<n; i++) dot +=a[i]*b[i]; return dot; } here version using double4 think wanted original dot_sse function. it's ideal avx (though still needs unrolled) still work sse2 well. in fact sse seems gcc breaks 2 chains unrolls loop 2.
double dot_double4(int n, double *a, double *b ) { typedef double double4 __attribute__ ((vector_size (32))); double4 sum4 = {}; int i; double4* a4 = (double4*)a; double4* b4 = (double4*)b; for(i=0; i<n/4; i++) { sum4 += a4[i]*b4[i]; } double dot = sum4[0] + sum4[1] + sum4[2] + sum4[3]; for(i*=4;i<n; i++) dot +=a[i]*b[i]; return dot; } if compile fma generate fma3 instructions. tested these functions here (you can edit , compile code well) http://coliru.stacked-crooked.com/a/273268902c76b116
note using sse/avx single dot production in matrix multiplication not optimal use of simd. should 2 (four) dot products @ once sse (avx) double floating point.
Comments
Post a Comment