c++ - MASM beats unoptimised .cpp but not unoptimised .c using VS -
i have simple function transforms vector (float*) using row major matrix (float**):
int vector_by_matrix(float** m, float* v, float* out, int size) { int i, j; float temp; if (!m || !v || !out) return -1; (i = 0; < size; i++) { temp = 0; (j = 0; j < size; j++) { temp += m[i][j] * v[j]; } //out[i] = temp * v[i]; mistake during copying - should've been... out[i] = temp;`` } return 0; } the code being compiled c++ (x64) using visual studio (2013) c++ compiler; , without optimisation quite slow (the function being called hundreds/thousands of times during run , size of system typically large c. size = 10000). optimisation being set high (o2) , floating point mode set fast performance gain huge (x20). however, decided convert file .c source file , compile c using vs again - simple procedural code anyways. performance improved again (over optimised c++ compilation) or without optimisation. in fact, optimisation settings had little affect on performance.
i don't understand why c code faster (optimised/unoptimised). disassembled output c(/c++) compiler , looks horrendous - wrote same function in masm , fifth of code couldn't compete in terms of speed. vs optimise compiled c code? looks disassembled code can't sure. masm code if helps:
mul_vector_by_martix proc mov r10, r9 sub rsp, 8 mov qword ptr[rsp], r11 li: mov rbx, qword ptr[r10*8+rcx[0]-8] xorps xmm0, xmm0 mov r11, r9 lj: movss xmm1, dword ptr[r11*4+rbx[0]-4] mulss xmm1, dword ptr[r11*4+rdx[0]-4] addss xmm0, xmm1 sub r11, 1 jnz lj movss dword ptr[r10*4+r8[0]-4], xmm0 sub r10, 1 jnz li mov r11, qword ptr[rsp] add rsp, 8 ret mul_vector_by_martix endp i won't supply disassembled code - question long enough ;)
thanks in advance help.
update
i got round again today. have implemented packed instructions (current implementation works system size multiple of 4 else you'll crash):
mul_opt_vector_by_martix proc sub rsp, 8 mov qword ptr[rsp], r12 sub rsp, 8 mov qword ptr[rsp], r13 ; copy rdx arithmetic operations mov r10, rdx ; init static global mov r12, lstep cmp vsize, r9 je loops ; sizeof(vector) mov rax, 4 mul r9 mov r12, rax ; number of steps in inner loop mov r11, 16 mov rax, r12 div r11 mov r11, rax mov r12, r11 mov rax, 16 mul r12 mov r12, rax sub r12, 16 mov vsize, r9 mov lstep, r12 loops: li: mov rbx, qword ptr[r9*8+rcx[0]-8] xorps xmm0, xmm0 mov r13, r12 lj: movaps xmm1, xmmword ptr[r13+rbx[0]] mulps xmm1, xmmword ptr[r13+r10[0]] ; add packed single floating point numbers movhlps xmm2, xmm1 addps xmm2, xmm1 movaps xmm1, xmm2 shufps xmm2, xmm2, 1 ; imm8 = 00 00 00 01 addss xmm2, xmm1 addss xmm0, xmm2 sub r13, 16 cmp r13, 0 jge lj movss dword ptr[r9*4+r8[0]-4], xmm0 sub r9, 1 jnz li mov r13, qword ptr[rsp] add rsp, 8 mov r12, qword ptr[rsp] add rsp, 8 ret mul_opt_vector_by_martix endp it improves things 20-30% again can't compete unoptimised compiled c code. disassembled code inner loop:
sum += v[j] * m[i][j]; movsxd rax,r8d add rdx,8 movups xmm0,xmmword ptr [rbx+rax*4] movups xmm1,xmmword ptr [r10+rax*4] lea eax,[r8+4] movsxd rcx,eax add r8d,8 mulps xmm1,xmm0 movups xmm0,xmmword ptr [rbx+rcx*4] addps xmm2,xmm1 movups xmm1,xmmword ptr [r10+rcx*4] mulps xmm1,xmm0 addps xmm3,xmm1 cmp r8d,r9d jl vector_by_matrix+90h (07fedd321440h) addps xmm2,xmm3 movaps xmm1,xmm2 movhlps xmm1,xmm2 addps xmm1,xmm2 movaps xmm0,xmm1 shufps xmm0,xmm1,0f5h addss xmm1,xmm0 at point have concede can't see gains are. haven't bothered rebuilding code c++ see if assembly different suspect in unoptimised mode c++ doesn't lend fast code c vs compiler. perhaps frankie_c's point pertinent. worries though if compiler doing shouldn't - can't see wrong though; in experience half decent hand written assembly outperform unoptimised c not here compiler. floating point operations need strict control on issues of precision otherwise results can vary 1 machine , methods need converge can fail on 1 machine not due instabilities.
update 2=====================================================================
it seems has went quiet thought i'd let know if got more improvement. can match compiler rearranging of operations in loops shown in last update. quite obvious moving - packed - shuffling , addition outside inner loop. again due implicit size of "vectorisation", size of system has multiple of 4 (crash otherwise).
loops: li: mov rbx, qword ptr[r9*8+rcx[0]-8] xorps xmm0, xmm0 mov r13, r12 lj: movaps xmm1, xmmword ptr[r13+rbx[0]] mulps xmm1, xmmword ptr[r13+r10[0]] ; add , accrue addps xmm0, xmm1 sub r13, 16 cmp r13, 0 jge lj ;------------ moved block outside --------------; ; add packed single floating point numbers movhlps xmm1, xmm0 addps xmm1, xmm0 movaps xmm0, xmm1 shufps xmm1, xmm1, 1 ; imm8 = 00 00 00 01 addss xmm0, xmm1 ;--------------------end block--------------------------- movss dword ptr[r9*4+r8[0]-4], xmm0 sub r9, 1 jnz li still can't beat compiler getting close equalling it. suppose conclusion is hard beat vs compiler when comes unoptimised c - not experience (unoptimised code) other compilers such gcc. i can out-perform compiler unrolling loops using simd instructions wiht more xmm regsiters. can supply on request self-explanatory.
benchmarking little bit more tricky that.
for example, using clang, following code compiles down exactly same code in main, regardless of whether call vector_by_matrix commented out.
#include <algorithm> #include <numeric> int main() { using namespace std; auto constexpr n = 512; float* m[n]; generate_n(m, n, []{return new float[n];}); float v[n], out[n]; float start = 0.0; for(auto& col : m) iota(col, col+n, start += 0.1); iota(begin(v), end(v), -1.0f); //vector_by_matrix(m, v, out, n); for_each(begin(m), end(m), [](float*p) { delete[] p; }); } the compiler recognizes no observable behaviour changed, can leave thing out.
of course, long inspect assembly, things should fine. (although, had vector_by_matrix function marked file-static, not appear in listing :)).
however, if you're doing measurements, make sure use statistically sound analysis , measuring think measuring.
see assembly:
- gcc 5.3: https://goo.gl/wivwse
- gcc 5.3 call commented: https://goo.gl/z9hlsz
- clang 3.7: https://goo.gl/xidrs6
- clang 3.7 call commented: https://goo.gl/guc4ux
full listing reference
int vector_by_matrix(float** m, float *const v, float *out, int size) { int i, j; float temp; if (!m || !v || !out) return -1; (i = 0; < size; i++) { temp = 0; (j = 0; j < size; j++) { temp += m[i][j] * v[j]; } out[i] = temp * v[i]; } return 0; } #include <algorithm> #include <numeric> int main() { using namespace std; auto constexpr n = 512; float* m[n]; generate_n(m, n, []{return new float[n];}); float v[n], out[n]; float start = 0.0; for(auto& col : m) iota(col, col+n, start += 0.1); iota(begin(v), end(v), -1.0f); vector_by_matrix(m, v, out, n); // no difference if commented for_each(begin(m), end(m), [](float*p) { delete[] p; }); }
Comments
Post a Comment