c++ - MASM beats unoptimised .cpp but not unoptimised .c using VS -

i have simple function transforms vector (float*) using row major matrix (float**):

int vector_by_matrix(float** m, float* v, float* out, int size) {     int i, j;     float temp;      if (!m || !v || !out) return -1;      (i = 0; < size; i++)     {         temp = 0;          (j = 0; j < size; j++)         {                 temp += m[i][j] * v[j];         }           //out[i] = temp * v[i]; mistake during copying - should've been...         out[i] = temp;``     }      return 0; }

the code being compiled c++ (x64) using visual studio (2013) c++ compiler; , without optimisation quite slow (the function being called hundreds/thousands of times during run , size of system typically large c. size = 10000). optimisation being set high (o2) , floating point mode set fast performance gain huge (x20). however, decided convert file .c source file , compile c using vs again - simple procedural code anyways. performance improved again (over optimised c++ compilation) or without optimisation. in fact, optimisation settings had little affect on performance.

i don't understand why c code faster (optimised/unoptimised). disassembled output c(/c++) compiler , looks horrendous - wrote same function in masm , fifth of code couldn't compete in terms of speed. vs optimise compiled c code? looks disassembled code can't sure. masm code if helps:

 mul_vector_by_martix proc      mov r10, r9      sub rsp, 8      mov qword ptr[rsp], r11      li:         mov rbx, qword ptr[r10*8+rcx[0]-8]          xorps xmm0, xmm0          mov r11, r9          lj:              movss xmm1, dword ptr[r11*4+rbx[0]-4]             mulss xmm1, dword ptr[r11*4+rdx[0]-4]             addss xmm0, xmm1              sub r11, 1          jnz lj          movss dword ptr[r10*4+r8[0]-4], xmm0          sub r10, 1     jnz li      mov r11, qword ptr[rsp]      add rsp, 8      ret  mul_vector_by_martix endp

i won't supply disassembled code - question long enough ;)

thanks in advance help.

update

i got round again today. have implemented packed instructions (current implementation works system size multiple of 4 else you'll crash):

mul_opt_vector_by_martix proc      sub rsp, 8     mov qword ptr[rsp], r12     sub rsp, 8     mov qword ptr[rsp], r13       ; copy rdx arithmetic operations     mov r10, rdx      ; init static global     mov r12, lstep      cmp vsize, r9     je loops      ; sizeof(vector)     mov rax, 4     mul r9     mov r12, rax      ; number of steps in inner loop     mov r11, 16     mov rax, r12     div r11      mov r11, rax      mov r12, r11      mov rax, 16     mul r12     mov r12, rax     sub r12, 16      mov vsize, r9     mov lstep, r12  loops:      li:          mov rbx, qword ptr[r9*8+rcx[0]-8]          xorps xmm0, xmm0          mov r13, r12          lj:              movaps xmm1, xmmword ptr[r13+rbx[0]]             mulps xmm1, xmmword ptr[r13+r10[0]]              ; add packed single floating point numbers             movhlps xmm2, xmm1             addps xmm2, xmm1             movaps xmm1, xmm2             shufps xmm2, xmm2, 1 ; imm8 = 00 00 00 01             addss xmm2, xmm1             addss xmm0, xmm2              sub r13, 16          cmp r13, 0         jge lj          movss dword ptr[r9*4+r8[0]-4], xmm0          sub r9, 1     jnz li      mov r13, qword ptr[rsp]     add rsp, 8     mov r12, qword ptr[rsp]     add rsp, 8      ret  mul_opt_vector_by_martix endp

it improves things 20-30% again can't compete unoptimised compiled c code. disassembled code inner loop:

                sum += v[j] * m[i][j];  movsxd      rax,r8d    add         rdx,8    movups      xmm0,xmmword ptr [rbx+rax*4]    movups      xmm1,xmmword ptr [r10+rax*4]    lea         eax,[r8+4]    movsxd      rcx,eax    add         r8d,8    mulps       xmm1,xmm0    movups      xmm0,xmmword ptr [rbx+rcx*4]    addps       xmm2,xmm1    movups      xmm1,xmmword ptr [r10+rcx*4]    mulps       xmm1,xmm0    addps       xmm3,xmm1    cmp         r8d,r9d    jl          vector_by_matrix+90h (07fedd321440h)    addps       xmm2,xmm3    movaps      xmm1,xmm2    movhlps     xmm1,xmm2  addps       xmm1,xmm2 movaps      xmm0,xmm1    shufps      xmm0,xmm1,0f5h    addss       xmm1,xmm0

at point have concede can't see gains are. haven't bothered rebuilding code c++ see if assembly different suspect in unoptimised mode c++ doesn't lend fast code c vs compiler. perhaps frankie_c's point pertinent. worries though if compiler doing shouldn't - can't see wrong though; in experience half decent hand written assembly outperform unoptimised c not here compiler. floating point operations need strict control on issues of precision otherwise results can vary 1 machine , methods need converge can fail on 1 machine not due instabilities.

update 2=====================================================================

it seems has went quiet thought i'd let know if got more improvement. can match compiler rearranging of operations in loops shown in last update. quite obvious moving - packed - shuffling , addition outside inner loop. again due implicit size of "vectorisation", size of system has multiple of 4 (crash otherwise).

loops:      li:          mov rbx, qword ptr[r9*8+rcx[0]-8]          xorps xmm0, xmm0          mov r13, r12          lj:              movaps xmm1, xmmword ptr[r13+rbx[0]]             mulps xmm1, xmmword ptr[r13+r10[0]]              ; add , accrue             addps xmm0, xmm1              sub r13, 16          cmp r13, 0         jge lj          ;------------ moved block outside --------------;          ; add packed single floating point numbers         movhlps xmm1, xmm0         addps xmm1, xmm0         movaps xmm0, xmm1         shufps xmm1, xmm1, 1 ; imm8 = 00 00 00 01         addss xmm0, xmm1          ;--------------------end block---------------------------          movss dword ptr[r9*4+r8[0]-4], xmm0          sub r9, 1     jnz li

still can't beat compiler getting close equalling it. suppose conclusion is hard beat vs compiler when comes unoptimised c - not experience (unoptimised code) other compilers such gcc. i can out-perform compiler unrolling loops using simd instructions wiht more xmm regsiters. can supply on request self-explanatory.

benchmarking little bit more tricky that.

for example, using clang, following code compiles down exactly same code in main, regardless of whether call vector_by_matrix commented out.

#include <algorithm> #include <numeric>  int main() {     using namespace std;      auto constexpr n = 512;     float* m[n];     generate_n(m, n, []{return new float[n];});      float v[n], out[n];      float start = 0.0;     for(auto& col : m) iota(col, col+n, start += 0.1);     iota(begin(v), end(v), -1.0f);      //vector_by_matrix(m, v, out, n);      for_each(begin(m), end(m), [](float*p) { delete[] p; }); }

the compiler recognizes no observable behaviour changed, can leave thing out.

of course, long inspect assembly, things should fine. (although, had vector_by_matrix function marked file-static, not appear in listing :)).

however, if you're doing measurements, make sure use statistically sound analysis , measuring think measuring.

see assembly:

gcc 5.3: https://goo.gl/wivwse
gcc 5.3 call commented: https://goo.gl/z9hlsz
clang 3.7: https://goo.gl/xidrs6
clang 3.7 call commented: https://goo.gl/guc4ux

full listing reference

int vector_by_matrix(float** m, float *const v, float *out, int size) {     int i, j;     float temp;      if (!m || !v || !out)         return -1;      (i = 0; < size; i++) {         temp = 0;          (j = 0; j < size; j++) {             temp += m[i][j] * v[j];         }          out[i] = temp * v[i];     }      return 0; }  #include <algorithm> #include <numeric>  int main() {     using namespace std;      auto constexpr n = 512;     float* m[n];     generate_n(m, n, []{return new float[n];});      float v[n], out[n];      float start = 0.0;     for(auto& col : m) iota(col, col+n, start += 0.1);     iota(begin(v), end(v), -1.0f);      vector_by_matrix(m, v, out, n); // no difference if commented      for_each(begin(m), end(m), [](float*p) { delete[] p; }); }

Search This Blog

Two