The peripheral code is probably the biggest lag. Best not to bother with 2D structures, just leave the data in a 1D, which avoids all theThat’s 147ns, so roughly 500 clocks on a Xeon Skylake. 4 loads should take 5 clocks (from L1). 4 vector multiply-adds should take 20 clocks, so it should take roughly 25 clocks total or 8ns. There is also capacity to do more than one of these in parallel if need be. If the one matrix needs to be transposed, probably another 4 clocks or so (so 9-10ns), so what you have there is roughly an order of magnitude slower. As the matrix gets bigger I would expect the hand-rolled vs vDSP implementation to converge.
flatMap, map computes, and then also not have the inverse cost either re stride, map. In either case in my world that compute would have been more than adequate; anybody chasing improvement over that would seen as mucking around, different world, different priorities.
Edit: surprisingly the peripheral code wasn't that heavy, but it did knock off another 0. From here it's probably a bit of the cost for the Swift Double type versus a more optimised SIMD type.
Last edited: