> This is surprising, shouldn't the number of comparisons be identical to the standard implementation?
No, to make it branchless (in an efficient way, it can be done regardless of course), you always take ceil(n / 2) as your remaining size, whereas the branchy version chooses either floor(n / 2) or ceil(n / 2) as appropriate.
And in fact, you don't want the number of comparisons to be identical to the standard implementation, for a branchless implementation. Because this makes the branch on loop exit unpredictable.
I did find that by changing the implementation slightly:
template<typename It, typename T, typename Cmp>
It lower_bound(It begin, It end, const T& value, Cmp comp) {
size_t n = 1 + end - begin;
size_t i = -1;
while (n > 1) {
size_t half = n / 2;
size_t m = i + half;
if (comp(begin[m], value)) i = m;
n -= half;
}
return begin + i + 1;
}
that clang gives slightly better code on the Apple M1 (avoiding an unnecessary add in the tight loop):
which results in the classic algorithm beating the overlap method on smaller sizes, but not on larger sizes, on Apple M1: https://i.imgur.com/VjD5EK7.png
> It turns out that it performs exactly the same number of comparisons as my lower_bound_overlap.
This is surprising, shouldn't the number of comparisons be identical to the standard implementation?