Benchmark: sorting a 32K array before a threshold test reduced runtime from ~11.5s to ~1.9s. Explain why the element order affects loop performance.

Sorting groups values so the loop's conditional becomes highly predictable. Modern CPUs rely on branch prediction and speculative execution; random data with ~50/50 outcomes causes frequent branch mispredictions and expensive pipeline flushes. Sorted data creates long runs of identical predicate results, drastically reducing branch-misses and improving instructions-per-cycle. This also makes auto-vectorization and branchless transforms more effective; cache effects are secondary here because memory access stays sequential. Verify with hardware counters (e.g., perf stat -e branches,branch-misses,cache-misses). To get similar speed without sorting, use branchless arithmetic, histograms/partitioning, or SIMD.

Programming

Why sorted arrays run faster: branch prediction, cache

Why processing a sorted array is faster: fewer branch mispredictions, improved vectorization and cache locality. C++/Java benchmarks and fixes. Practical fixes.

1 answer• 6 views

01/01/2026, 07:39 PM

Why is processing a sorted array faster than processing an unsorted array?

In the following C++ code, sorting the data before the timed region makes the primary loop approximately 6x faster:

cpp

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
 // Generate data
 const unsigned arraySize = 32768;
 int data[arraySize];

 for (unsigned c = 0; c < arraySize; ++c)
 data[c] = std::rand() % 256;

 // !!! With this, the next loop runs faster.
 std::sort(data, data + arraySize);

 // Test
 clock_t start = clock();
 long long sum = 0;
 for (unsigned i = 0; i < 100000; ++i)
 {
 for (unsigned c = 0; c < arraySize; ++c)
 { // Primary loop.
 if (data[c] >= 128)
 sum += data[c];
 }
 }

 double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;

 std::cout << elapsedTime << '\n';
 std::cout << "sum = " << sum << '\n';
}

Performance results:

Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
With the sorted data, the code runs in 1.93 seconds.

(Sorting itself takes more time than this one pass over the array, so it’s not actually worth doing if we needed to calculate this for an unknown array.)

Reproduced benchmark

You run a tight inner loop that checks a predicate on every element:

arraySize = 32,768, looped 100,000 times → 3,276,800,000 conditional evaluations.
Values are in 0…255 and you test data[c] >= 128, so for uniformly random data you get about a 50/50 split of true/false.
Measured timings from your test: unsorted 11.54 s, sorted 1.93 s (sorting the 32K array itself costs more than a single pass, so sorting just to speed one pass usually isn’t worth it).

The benchmark you shared is the primary evidence for this discussion; I’ll refer to it below when describing expected counter results and mitigations User benchmark.

Why processing a sorted array is faster

At first glance the loop is doing independent work — the order really “shouldn’t” matter. But under the hood the CPU executes that conditional as a branch (a decision point). Modern processors avoid waiting for the branch outcome by predicting it and speculatively executing the following instructions. If the prediction is wrong the CPU must roll back and re-run along the correct path; that rollback costs many cycles.

Sorting moves all values that produce the same predicate result together (all <128, then all ≥128 if ascending). That turns a sea of alternating true/false outcomes into a few long runs of identical results. Long runs are exactly what branch predictors like: they quickly converge to a stable prediction and then rarely mispredict. The net result: enormous reduction in branch-misprediction penalties across billions of iterations, and that explains the multi× speedup you observed.

Branch prediction and predictable comparisons

Why does predictability matter so much?

Every if(…) in your inner loop becomes a branch instruction. The CPU tries to guess its direction before the compare completes.
Mispredictions cause pipeline flushes and cost dozens of cycles each on modern CPUs. When you execute billions of branches, even a small misprediction rate adds up.
Random (unsorted) data produces a roughly unpredictable outcome stream (close to 50% true/false). Predictors struggle here and misprediction rates are high.
Sorted data produces very long homogeneous runs. Predictors (2-bit saturating counters, history tables, etc.) saturate quickly and mispredictions fall to near zero except at the boundary between runs.

Quick back-of-the-envelope to show scale: 3.28e9 branch evaluations × even a modest average of 0.1 extra cycles per iteration is a huge cost; multiply that by typical misprediction penalties (tens of cycles) and you easily explain seconds of wall-clock time. In short: branch-miss overhead across billions of iterations dwarfs cost differences due to minor cache effects.

(If you want to see search interest and topic framing, the keyword people use is “branch prediction” / “предсказание ветвлений” — the same phenomenon you’re hitting Yandex Wordstat top requests.)

Cache locality and memory-access patterns

People often jump to cache as the explanation. It’s true cache effects can dominate many programs, but in this particular test cache is not the main reason:

Both the sorted and unsorted runs access the array in the same linear order: c = 0…arraySize-1. That sequential access pattern is ideal for hardware prefetchers and gives similar cache-line streaming in both cases.
Sorting changes the values but not the access order. So the hardware will prefetch the same addresses in the same order; cache-miss counts should be similar once the dataset warms up.
If the array were much smaller and fits entirely in L1, the sort could leave it resident and make the timed loop faster because of fewer L1 misses. But with 32,768 ints (≈128 KB) you exceed typical L1 sizes; the big timing difference you saw is still best explained by branch behavior, not by simple L1 warming.

Measure counters to be sure — you’ll usually find branch-misses are the dominant delta, not cache-misses.

Vectorization and compiler/JIT optimization

Sorting can also indirectly improve throughput by letting the CPU and compiler/JIT keep execution flowing:

Fewer branch mispredictions means fewer pipeline stalls, which increases instruction-level parallelism and allows execution units (including SIMD units) to be busier.
Compilers or JITs sometimes transform branches into masked/vector operations or conditional moves. When the predicate is predictable, these transforms (or runtime optimizations) become more effective and throughput rises.
Practical consequence: if you remove the branch entirely (see branchless transforms below), the unsorted case often performs similarly to the sorted case because you’ve eliminated the predictor’s role.

Branchless example (safe when values are 0…255):

cpp

long long sum = 0;
for (unsigned c = 0; c < arraySize; ++c) {
 unsigned u = static_cast<unsigned>(data[c]); // 0..255
 sum += static_cast<long long>(u) * (u >> 7); // (u>>7) is 0 for <128, 1 for >=128
}

This replaces a conditional with an arithmetic mask; the compiler can vectorize it more easily, and there’s no branch misprediction.

How to measure and verify

Don’t guess — measure. On Linux use perf to compare branch and cache behavior in the two cases:

Example perf stat command:

bash

perf stat -e branches,branch-misses,cache-references,cache-misses,cycles,instructions ./a.out

Run that on unsorted and sorted variants (or on the branchless/histogram variants). Expected outcome:

Unsorted: lots of branch-misses; higher cycles/instruction penalty.
Sorted: very few branch-misses; better IPC (instructions per cycle).

For Java, run your JVM under perf the same way (e.g., perf stat ... java -jar yourapp.jar) or use platform profilers (VTune, JFR, etc.). The user benchmark is a ready test case — run it while collecting branch-misses and cache-misses to confirm the hypothesis User benchmark.

Practical mitigations and recommendations

You shouldn’t sort data just to speed a single threshold test — sort cost usually outweighs benefit. Instead pick one of these lower-cost strategies:

Branchless arithmetic (as above). Good when value domain is known and you can express the predicate as a cheap mask.
Histogram / bucket approach (best for small value domains like 0…255):

cpp

int hist[256] = {0};
for (unsigned c = 0; c < arraySize; ++c) ++hist[static_cast<unsigned>(data[c])];
long long sum = 0;
for (int v = 128; v < 256; ++v) sum += static_cast<long long>(hist[v]) * v;

This is O(N) with tiny inner loops and almost no branching in the hot path.

Partition once if you repeatedly apply the same predicate: move items into two arrays (or indexes), then process the “true” group only.
Use SIMD intrinsics (AVX2/AVX-512) to do eight or more compares in parallel and blend results — removes scalar branch overhead.
Use profiling: measure branch-misses first. You might be surprised how often a branchless or histogram rewrite wins.
Use compiler hints with caution: __builtin_expect can help in one-sided cases, but it doesn’t fix truly unpredictable patterns.

If you do try any of these, re-run the perf counters to confirm the improvement.

Sources

Conclusion

The big speed difference comes from branch prediction (предсказание ветвлений): sorting turns a near-random stream of branch outcomes into long runs that predictors handle almost perfectly, removing expensive misprediction penalties. That improvement, together with easier vectorization/predication and fewer pipeline stalls, explains the ~6× speedup; raw cache-line streaming is not the primary cause here because both cases access memory sequentially. If you need the same throughput on unsorted data, use branchless code, a histogram, partitioning, or SIMD — and always validate with hardware counters (branch-misses, cache-misses) before trusting an assumed bottleneck.

Authors