Why sorted arrays run faster: branch prediction, cache
Why processing a sorted array is faster: fewer branch mispredictions, improved vectorization and cache locality. C++/Java benchmarks and fixes. Practical fixes.
Why is processing a sorted array faster than processing an unsorted array?
In the following C++ code, sorting the data before the timed region makes the primary loop approximately 6x faster:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{ // Primary loop.
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
std::cout << "sum = " << sum << '\n';
}
Performance results:
- Without
std::sort(data, data + arraySize);, the code runs in 11.54 seconds. - With the sorted data, the code runs in 1.93 seconds.
(Sorting itself takes more time than this one pass over the array, so it’s not actually worth doing if we needed to calculate this for an unknown array.)
Similar results were observed in Java:
import java.util.Arrays;
import java.util.Random;
public class Main
{
public static void main(String[] args)
{
// Generate data
int arraySize = 32768;
int data[] = new int[arraySize];
Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
data[c] = rnd.nextInt() % 256;
// !!! With this, the next loop runs faster
Arrays.sort(data);
// Test
long start = System.nanoTime();
long sum = 0;
for (int i = 0; i < 100000; ++i)
{
for (int c = 0; c < arraySize; ++c)
{ // Primary loop.
if (data[c] >= 128)
sum += data[c];
}
}
System.out.println((System.nanoTime() - start) / 1000000000.0);
System.out.println("sum = " + sum);
}
}
My first thought was that sorting brings the data into the CPU cache, but that’s unlikely since the array was just generated.
What is the explanation for this performance difference? Why is processing a sorted array faster than processing an unsorted array when the code is summing independent terms, where the order should not matter?
Sorting the array makes the conditional (data[c] >= 128) highly predictable so the CPU’s branch predictor (предсказание ветвлений) stops mispredicting and the pipeline avoids costly flushes; that’s the dominant cause of the roughly 6× speedup. Sorted data also creates long runs of identical predicate results, which helps auto-vectorization and reduces instruction stalls; cache locality plays only a secondary role here because your access pattern is already sequential. The user benchmark shows the effect clearly (see the original test) and is consistent with reduced branch-miss activity and easier branchless/vectorized execution User benchmark.
Contents
- Reproduced benchmark
- Why processing a sorted array is faster
- Branch prediction and predictable comparisons
- Cache locality and memory-access patterns
- Vectorization and compiler/JIT optimization
- How to measure and verify
- Practical mitigations and recommendations
- Sources
- Conclusion
Reproduced benchmark
You run a tight inner loop that checks a predicate on every element:
- arraySize = 32,768, looped 100,000 times → 3,276,800,000 conditional evaluations.
- Values are in 0…255 and you test data[c] >= 128, so for uniformly random data you get about a 50/50 split of true/false.
- Measured timings from your test: unsorted 11.54 s, sorted 1.93 s (sorting the 32K array itself costs more than a single pass, so sorting just to speed one pass usually isn’t worth it).
The benchmark you shared is the primary evidence for this discussion; I’ll refer to it below when describing expected counter results and mitigations User benchmark.
Why processing a sorted array is faster
At first glance the loop is doing independent work — the order really “shouldn’t” matter. But under the hood the CPU executes that conditional as a branch (a decision point). Modern processors avoid waiting for the branch outcome by predicting it and speculatively executing the following instructions. If the prediction is wrong the CPU must roll back and re-run along the correct path; that rollback costs many cycles.
Sorting moves all values that produce the same predicate result together (all <128, then all ≥128 if ascending). That turns a sea of alternating true/false outcomes into a few long runs of identical results. Long runs are exactly what branch predictors like: they quickly converge to a stable prediction and then rarely mispredict. The net result: enormous reduction in branch-misprediction penalties across billions of iterations, and that explains the multi× speedup you observed.
Branch prediction and predictable comparisons
Why does predictability matter so much?
- Every if(…) in your inner loop becomes a branch instruction. The CPU tries to guess its direction before the compare completes.
- Mispredictions cause pipeline flushes and cost dozens of cycles each on modern CPUs. When you execute billions of branches, even a small misprediction rate adds up.
- Random (unsorted) data produces a roughly unpredictable outcome stream (close to 50% true/false). Predictors struggle here and misprediction rates are high.
- Sorted data produces very long homogeneous runs. Predictors (2-bit saturating counters, history tables, etc.) saturate quickly and mispredictions fall to near zero except at the boundary between runs.
Quick back-of-the-envelope to show scale: 3.28e9 branch evaluations × even a modest average of 0.1 extra cycles per iteration is a huge cost; multiply that by typical misprediction penalties (tens of cycles) and you easily explain seconds of wall-clock time. In short: branch-miss overhead across billions of iterations dwarfs cost differences due to minor cache effects.
(If you want to see search interest and topic framing, the keyword people use is “branch prediction” / “предсказание ветвлений” — the same phenomenon you’re hitting Yandex Wordstat top requests.)
Cache locality and memory-access patterns
People often jump to cache as the explanation. It’s true cache effects can dominate many programs, but in this particular test cache is not the main reason:
- Both the sorted and unsorted runs access the array in the same linear order: c = 0…arraySize-1. That sequential access pattern is ideal for hardware prefetchers and gives similar cache-line streaming in both cases.
- Sorting changes the values but not the access order. So the hardware will prefetch the same addresses in the same order; cache-miss counts should be similar once the dataset warms up.
- If the array were much smaller and fits entirely in L1, the sort could leave it resident and make the timed loop faster because of fewer L1 misses. But with 32,768 ints (≈128 KB) you exceed typical L1 sizes; the big timing difference you saw is still best explained by branch behavior, not by simple L1 warming.
Measure counters to be sure — you’ll usually find branch-misses are the dominant delta, not cache-misses.
Vectorization and compiler/JIT optimization
Sorting can also indirectly improve throughput by letting the CPU and compiler/JIT keep execution flowing:
- Fewer branch mispredictions means fewer pipeline stalls, which increases instruction-level parallelism and allows execution units (including SIMD units) to be busier.
- Compilers or JITs sometimes transform branches into masked/vector operations or conditional moves. When the predicate is predictable, these transforms (or runtime optimizations) become more effective and throughput rises.
- Practical consequence: if you remove the branch entirely (see branchless transforms below), the unsorted case often performs similarly to the sorted case because you’ve eliminated the predictor’s role.
Branchless example (safe when values are 0…255):
long long sum = 0;
for (unsigned c = 0; c < arraySize; ++c) {
unsigned u = static_cast<unsigned>(data[c]); // 0..255
sum += static_cast<long long>(u) * (u >> 7); // (u>>7) is 0 for <128, 1 for >=128
}
This replaces a conditional with an arithmetic mask; the compiler can vectorize it more easily, and there’s no branch misprediction.
How to measure and verify
Don’t guess — measure. On Linux use perf to compare branch and cache behavior in the two cases:
- Example perf stat command:
perf stat -e branches,branch-misses,cache-references,cache-misses,cycles,instructions ./a.out
Run that on unsorted and sorted variants (or on the branchless/histogram variants). Expected outcome:
- Unsorted: lots of branch-misses; higher cycles/instruction penalty.
- Sorted: very few branch-misses; better IPC (instructions per cycle).
For Java, run your JVM under perf the same way (e.g., perf stat ... java -jar yourapp.jar) or use platform profilers (VTune, JFR, etc.). The user benchmark is a ready test case — run it while collecting branch-misses and cache-misses to confirm the hypothesis User benchmark.
Practical mitigations and recommendations
You shouldn’t sort data just to speed a single threshold test — sort cost usually outweighs benefit. Instead pick one of these lower-cost strategies:
- Branchless arithmetic (as above). Good when value domain is known and you can express the predicate as a cheap mask.
- Histogram / bucket approach (best for small value domains like 0…255):
int hist[256] = {0};
for (unsigned c = 0; c < arraySize; ++c) ++hist[static_cast<unsigned>(data[c])];
long long sum = 0;
for (int v = 128; v < 256; ++v) sum += static_cast<long long>(hist[v]) * v;
This is O(N) with tiny inner loops and almost no branching in the hot path.
- Partition once if you repeatedly apply the same predicate: move items into two arrays (or indexes), then process the “true” group only.
- Use SIMD intrinsics (AVX2/AVX-512) to do eight or more compares in parallel and blend results — removes scalar branch overhead.
- Use profiling: measure branch-misses first. You might be surprised how often a branchless or histogram rewrite wins.
- Use compiler hints with caution:
__builtin_expectcan help in one-sided cases, but it doesn’t fix truly unpredictable patterns.
If you do try any of these, re-run the perf counters to confirm the improvement.
Sources
- User-submitted benchmark and code (original test data)
- Yandex Wordstat top requests (keyword analysis: “предсказание ветвлений” / “branch prediction”)
Conclusion
The big speed difference comes from branch prediction (предсказание ветвлений): sorting turns a near-random stream of branch outcomes into long runs that predictors handle almost perfectly, removing expensive misprediction penalties. That improvement, together with easier vectorization/predication and fewer pipeline stalls, explains the ~6× speedup; raw cache-line streaming is not the primary cause here because both cases access memory sequentially. If you need the same throughput on unsorted data, use branchless code, a histogram, partitioning, or SIMD — and always validate with hardware counters (branch-misses, cache-misses) before trusting an assumed bottleneck.