Which software do you use and would you perhaps mind elaborating on the bolded part? Thanks
I typically write this type of code in C or C++ and compile it. I will then use something like objdump or a compilation with the -S option to see the assembler source code that was generated. I will then look at the assembler and see if there are any inefficiencies in the code that was generated.
The kind of thing I would look for may be:
- Short pieces of code not being inlined
- The compiler can't tell that there is no overwrite between two reads of a global variable, so it gets read every time
- The compiler runs out of registers within a loop, and spill to the the stack
- The compiler generates many jumps to early out of a compound condition that I will almost always be true
- The compiler may make incorrect assumptions about the likelihood of a condition being true or false
- The compiler may correctly choose the likelihood of a condition being true or false, however, the unlikely case may be the speed sensitive one (and the one I want the CPU speculate)
- Compares and jumps can perhaps be avoided by conditional moves
- Vectorization may have unintended side effects
- Hand vectorization may accelerate a given piece of code
- The compiler may not be able to make assumptions about loop variable sizes, so it generates unnecessary remainder handling code that bloats the code cache as it gets prefetched.
- etc.