Algorithmica HPC Notes - Chapter 3
Ch3: Instruction-Level Parallelism
Modern CPUs use pipelining: after an instruction passes through the first stage, they start processing the next one right away, without waiting for the previous one to fully complete.
3.1: Pipeline Hazards
The next instruction might not execute on the following clock cycle:
- A structural hazard happens when two or more instructions need the same part of CPU (e.g., an execution unit).
- A data hazard happens when you have to wait for an operand to be computed from some previous step.
- A control hazard happens when a CPU can’t tell which instructions it needs to execute next.
3.2: The Cost Of Branching
volatile int s;
for (int i = 0; i < N; i++)
if (a[i] < 50)
s += a[i];
~14 CPU Cycles per element
Replacing the hardcoded 50 with a parameter P that effectively sets the probability of the < branch:
for (int i = 0; i < N; i++)
if (a[i] < P)
s += a[i];
Note that it costs almost nothing to check for a condition that never or almost never occurs.
Pattern Detection:
for (int i = 0; i < N; i++)
a[i] = rand() % 100;
std::sort(a, a + n);
Hinting Likeliness of Branches:
for (int i = 0; i < N; i++)
if (a[i] < P) [[likely]]
s += a[i];
Read more: stackoverflow
3.3: Branchless Programming
Predication:
for (int i = 0; i < N; i++)
s += (a[i] < 50) * a[i];
~7 cycles per element instead of the original ~14.
Explanation:
If the expression a[i] - 50 is negative (implying a[i] < 50), then the highest bit of the result will be set to one, which we can then extract using a right-shift.
mov ebx, eax ; t = x
sub ebx, 50 ; t -= 50
sar ebx, 31 ; t >>= 31
imul eax, ebx ; x *= t
Compiler produces:
mov eax, 0
mov ecx, -4000000
loop:
mov esi, dword ptr [rdx + a + 4000000] ; load a[i]
cmp esi, 50
cmovge esi, eax ; esi = (esi >= 50 ? esi : eax=0)
add dword ptr [rsp + 12], esi ; s += esi
add rdx, 4
jnz loop ; "iterate while rdx is not zero"
Using predication eliminates a control hazard but introduces a data hazard. Cheaper pipeline stall: wait for cmov to be resolved and not flush the entire pipeline in case of a mispredict.
It can be more efficient to leave branchy code as it is: the cost of computing both branches instead of just one outweighs the penalty for the potential branch mispredictions.
Example: branchless cpp lower_bound(int x)
int lower_bound(int x) {
int *base = t, len = n;
while (len > 1) {
int half = len / 2;
base += (base[half - 1] < x) * half; // will be replaced with a "cmov"
len -= half;
}
return *base;
}
on small arrays (that fit into cache) it works ~4x faster than the branchy std::lower_bound.
3.4: Instruction Table
Provides latency and throughput numbers for an architecture
3.5: Throughput Computing
int s = 0;
s += a[0];
s += a[1];
s += a[2];
s += a[3];
// ...
1 cycle each iteration to add another value to s.
For example, the throughput of add is 2 on a CPU. The solution is to use two accumulators and just sum up odd and even elements separately:
int s0 = 0, s1 = 0;
s0 += a[0];
s1 += a[1];
s0 += a[2];
s1 += a[3];
// ...
int s = s0 + s1;
The computation no longer has any critical paths that limit the throughput.