How do I achieve the theoretical maximum of 4 FLOPs per cycle?
How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel CPU? As far as I understand it takes three cycles for an SSE add and five cycles for a mul to complete on most of the modern Intel CPUs (see for example Agner … Read more