Why can lambdas be better optimized by the compiler than plain functions?

In his book The C++ Standard Library (Second Edition) Nicolai Josuttis states that lambdas can be better optimized by the compiler than plain functions. In addition, C++ compilers optimize lambdas better than they do ordinary functions. (Page 213) Why is that? I thought when it comes to inlining there shouldn’t be any difference any more. … Read more

How to compile Tensorflow with SSE4.2 and AVX instructions?

This is the message received from running a script to check if Tensorflow is working: I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally … Read more

Why does the Rust compiler not optimize code assuming that two mutable references cannot alias?

As far as I know, reference/pointer aliasing can hinder the compiler’s ability to generate optimized code, since they must ensure the generated binary behaves correctly in the case where the two references/pointers indeed alias. For instance, in the following C code, void adds(int *a, int *b) { *a += *b; *a += *b; } when … Read more

Why does GCC generate 15-20% faster code if I optimize for size instead of speed?

I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3), and I have been wondering ever since why. I have managed to create (rather silly) code that shows this … Read more

Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs

I was looking for the fastest way to popcount large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC. The Benchmark #include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; if (argc … Read more

Why doesn’t GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2) by compiling it into a*a, but the call pow(a,6) is not optimized and will actually call the library function pow, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc, … Read more