I dont know if sophistication is the word I'd use. It just means they need very specific and in depth knowledge of C/C++ and how to get the best of them. Those questions are not relevant anywhere else. And to be fair, they are also less relevant if your constraint is money and time and not algorithm speed. I mean, the second one might shave off a few milliseconds, or seconds perhaps, if N is really large, but what if it takes you a day to optimize that particular piece of code?
These questions are extremely relevant if code optimization is important to one's business, and actually has very little to do with C/C++ so much as it has to do with understanding how to optimize code for a particular environment (hardware, OS, libraries, etc.), which is a problem that crops up almost everwhere to one degree or another. It may not be the exact problem as above, but you will find that the people who have dealt with optimization a lot, and have a head for that kind of work tend to do well in this question.
Kudos for mentioning that there is a tradeoff between money and time and optimization. Premature optimization is a poor usage of both time and money - a lot of otherwise rational people, just don't get this and spend an uncalled for effort optimizing the wrong code, rather than finding their real hotspots or thinking of a better way to do things.
Another observation about this particular piece of code is that it is a dot product, very often this (or something very similar) is the inner loop of an O(n^2) or ~O(n^3) algorithm (matrix-vec multiplicaiton and matrix-matrix multiplication respectively). This is really important for a lot of statistical analysis, solving systems of linear equations, and for relatively small N, 4x4 matrix operations in computer graphics. Most apps that depend on using linear algebra, will benefit immemsely from a more efficient implementation. A lot of apps are entirely limited by these operations. (FYI, there are libraries that do a lot of this stuff alread (BLAS), however, people who use them extensively often find the need to customize them further, or to optimize for particular special cases that the library does not support), also there are far more cases where heavy floating point arithmetic is used, and a good understanding of techniques like this can offer a huge benefit.
Algorithm design is something I find personally interesting, but not everyone feels the same.
EDIT: Alg. design is just as important to me as a requirement. I want my hires to have both high level software and algorithm design skills, and a good understanding of how the code will actually run in practice.
EDIT: Incidentally, would you mind sharing the answer to the second one? The only obvious thing I can see is that it should be ++x, not x++.
Sure (FYI, I don't expect everyone to get all these points, but I usually expect 1+3 for a good junior hire, 1+2+3 for a good senior hire, and at least 2 out of 3 for 4+5+6 for someone who considers themself a specialist):
1] Some high level optimizations would be that for very large N, you can multi-thread or distribute the computation. There will be a trade-off between computation and transfer time, which will also have to account for the fact that data transfers can be asyncrhonous.
2] WRT to a single thread, the operations performed for each iteration are 1 add, 1 multiply, 2 loads. There is some loop counter+check overhead, but this can be amortized by simple loop unrolling. The limiter at this point is actually the add operation - the reason for this is that the multiplies and loads are indepedent, and assuming the data fits into cache (or the higher level operations (matrix ops) have been blocked to fit into cache, the load+multiply pipelines (separate to each other and add on most contemporary architectures) will be tightly packed, whereas there will be bubbles in the pipeline due to the dependency on the adds ('r' is the only accumulator). Unrolling and adding multiple accumulators (e.g., r0, r1, r2 and r3), will break this dependency long enough to allow the individual adds to complete before their target accumulator is required again. On current AMD/Intel processors this can give you a huge performance boost (~2-3x).
3] If you are using a reasonably new PC (last decade or so), you can use SIMD ops (available as intrinsics) to paralellize the computation at a finer level (possible since the multiplies are all independent, the data is contiguous and the accumulators can be parallelized accordingly), giving you up to an additional 4x boost in performance assuming each execution of size N is not limited by cache throughput or latency. SSE can get you up to 4x perf, and AVX (on Sandy Bridge and above), can get you an extra 8x theoretically (in practice it's more like 6x since the L1 cache can't keep up with the math given a dot product like this, which was written oblivous to the higher level algorithm). PowerPC, CELL and ARM (when configured with NEON) processors all have SIMD units that can give similar benefits. Extra points for mentioning alignment concerns and restrictions.
4] On future Haswell parts, the dual FMA (Fused Multiply Add) pipelines should give you yet another 2x boost (cache allowing).
5] If you can generate and consume this data on a GPU, a GPU can chew through it much faster. If not, the PCIE data transfer costs will kill your benefit. GPUs also benefit from FMA.
6] Extra points for delving into cache details - e.g., the HW prefetecher should hide most latency, but if structured incorrectly in an outer algortihm/loop the dot product operations can be dominated by bandwidth restrictions (mem-to-L3, L3->L2, L2->L1, L1->reg).