A Digression on Divergence

Most people who do serious GPU work understand that shaders and other GPU programs are executed in a SIMD or “SIMT” fashion using “warps” or “wavefronts” that group together multiple program instances (vertices, fragments, work-items, “threads”). It is also widely understood that when the condition computed for a dynamic ifelse is different across these instances, the warp must execute “both sides.”

Aside:If you weren’t aware of any of that, you might like to look over this presentation from Kayvon Fatahalian before you proceed.

Terminology

Our terminology for GPU programming is kind of a mess, since every hardware vendor and every API/language seems to invent its own nomenclature. In this post, I’ll refer to individual running instances of a program as invocations (you might know these as vertices, fragments, work-items, or “threads”), and the implementation-specific aggregates of these instances that are actually scheduled as warps (you may know these as wavefronts or threads).

Caveat:For those familiar with OpenCL or Compute Shader, a warp is not the same thing as a “work-group” or “thread-group.”

When a the different invocations in a warp disagree on the way to go at a conditional branch, we will say they diverge. This is actually a terrible choice of words, since divergence is already used in CS theory to refer to non-termination, but there aren’t many good alternatives.

The whole topic is closely related to the question of when particular expressions/variables are uniform or varying across a warp.

So Why Does This Matter?

Many people don’t really understand what goes on for such “divergent” branches, beyond the “execute both sides” mantra. The various hardware vendors tend not to talk about it, and the API/language specs are surprisingly mum.

Maybe it doesn’t matter. After all, who cares how it is implemented if it Just Works™, right?

Well, the problem is that ever since we put things like global-memory atomics into our GPU languages, the behavior of divergent branches has been semantically observable. This is perhaps most pronounced if you try to get clever and write a per-pixel mutex, as my colleague Marco Salvi did. The following example is in HLSL, but you can write an equivalent program in GLSL, OpenCL, or CUDA:

// We will use each 'uint' entry in this buffer as a lock
RWStructuredBuffer lockBuffer;

void FeelingClever()
{
    uint pixelID = ...; // compute which lock to use

    // try to acquire per-pixel lock,
    // using comapre-and-swap

    [allow_uav_condition] // need this to appease fxc...
    while (true)
    {
        uint oldValue;
        InterlockedCompareExchange(lockBuffer[pixelID],
                                   0, 1, oldValue);

        // if we successfully swapped 0->1,
        // then we got the lock
        if( oldValue == 0 )
            break;
    }

    DoSomethingWhileHoldingLock();

    // release the lock and give everybody else a shot
    lockBuffer[pixelID] = 0;
}

In a world where every Pixel Shader invocation ran as a truly independent thread, this ought to work. If you try to run code like this on current D3D11 GPUs, though, you will find that it seems to work on some implementations, and breaks on others.

The crux of the problem is that a subset of the shader invocations in warp A might grab their mutex right away, while others fail to grab theirs, forcing the entire warp to spin in the while(true) loop. If the locks that A is waiting for are held by a different warp B, which is also in the same situation, then we have a deadlock.

Now, you might decide that this code just isn’t clever enough (as I initially did, when Marco brought this problem to me). If the warps are stuck spinning in that while(true) loop, then the solution might be as simple as rearranging the code so they never have to leave the loop in the first place:

RWStructuredBuffer lockBuffer;

void FeelingExtraClever()
{
    uint pixelID = ...;
    [allow_uav_condition]
    while (true)
    {
        uint oldValue;
        InterlockedCompareExchange(lockBuffer[pixelID],
                                   0, 1, oldValue);

        // if we successfully swapped 0->1,
        // then we got the lock
        if( oldValue == 0 )
        {
            // rather than break out of the loop,
            // just do the work here
            DoSomethingWhileHoldingLock();

            // release the lock and give
            // everybody else a shot
            lockBuffer[pixelID] = 0;

            // only bail from the loop after
            // we've release the lock
            break;
        }
    }

}

It turns out that this version didn’t work either. I honestly can’t remember which versions failed on which hardware, but nothing we tried worked across the board.

Stepping Back

At this point it is worth stopping to ask: why might we expect one of these two options to work, and not the other? What kind of assumptions are we making about how the HW handles divergent branches? Are those assumptions valid in practice?

It turns out that we naturally make a lot of very strong intuitive assumptions about how GPUs will execute our programs, and most of these assumptions aren’t actually guaranteed by the programming models or the hardware.

Aside:This strikes me as similar to the situation with memory models, where people have a strong tendency to assume sequential consistency, just because it is so natural. Un-learning “common sense” is one of the hardest parts of parallel programming.

What Do We Know?

Let’s start with the stuff that seems to be common across most GPU architectures.

We have our warp, made up of several invocations. Along with that we have an active mask which tells us which of the invocations are currently live. The warp as a whole has a single program counter, from which it reads and executes instructions for all of the invocations in the active mask.

Aside: Some people will insist that GPUs have a distinct program counter for every invocation in the warp. These people are either (A) confused, (B) trying to confuse you, or (C) work with some really odd hardware.

When the warp executes a divergent branch, the set of active invocations divide into two (or more) subsets. The warp must pick one of these subsets for the new active mask, and defer the other(s). It is easiest to think of the deferred invocations as sitting on a “divergence stack,” although the particular implementations may vary a lot between vendors.

What Don’t We Know?

At this point I hope some questions are nagging at you, in particular:

  • When we split the active set at a divergent branch, which invocation(s) get “followed” first, and which deferred?
  • When do the invocations that get deferred get picked up and followed?
  • When do the invocations that have diverged “re-converge”?

Because current GPU programming languages do not specify these behaviors, there is a lot of latitude in how different hardware can answer these questions. I can’t easily comment on what particular vendors do (except where information is public), but I can try to illuminate the breadth of the design space.

A particular hardware platform might support structured control-flow instructions (akin to the control flow operations in the D3D shader bytecode), or unstructured branches (c.f. CUDA’s PTX), or both. Things that might seem obvious in the high-level language code can sometimes be subtle when you are looking at a low-level ISA.

Which Path Gets Followed First?

In general, it shouldn’t matter which path gets followed first at a divergent branch, although my per-pixel mutex example clearly depends on the invocation(s) that obtained their lock being followed before those that failed to obtain theirs. There are, however, some cases where an implementer might clearly want to follow one option over another.

If we have a one-sided if:

if (someVaryingCondition)
{
    // inside
}
// after


it seems obvious that we should follow the invocations that go inside the if before those that are just going to skip over it. Of course, remember that “obvious” can be a dangerous word.

Similarly, for the back-edge on a dowhile loop:

do
{
    // inside
}
while (someVaryingCondition);


it seems obvious to follow the invocations that continue looping first.

Lets look at some possible strategies:

Follow the “branch taken” option first

For a simple conditional branch in the hardware ISA, we could prefer to follow the invocations that take the branch over those that continue normaly. For the loop example above, this does the obvious thing, but not for the if example.

Follow the “not taken” option first

This option would do the expected thing for the if, but go against intuition for the loop. It might also be better for instruction caching, since the instructions immediately after the branch are likely to be in the cache already.

Follow “taken” first for back edges, “not taken” for forward edges

We can switch between the two strategies above based on whether the destination of the branch is before or after the branch instruction itself. This gives the expected “obvious” behavior for both of the examples above (assuming the compiler lays out basic blocks in the same order as the high-level code).

Follow whichever path has fewer active invocations

This may sound like an odd design decision, but it is precisely what is described in AMD’s Southern Islands (GCN) ISA documentation.

If we always follow the path that has fewer active invocations, then we know that every divergent branch reduces the number of active invocations by at least half. As a result, the maximum depth of our divergence stack is logarithmic, rather than linear, in the size of the warp (6 rather than 64 entries, in the case of GCN).

Multi-way (switch) and indirect branches (function pointers) add more complexity, but we already have several viable designs alternatives given only two-way branching.

Where/when do we re-converge?

After a divergent branch, a warp is executing with reduced SIMD efficiency: every instruction only applies to a subset of the invocations. In order to reclaim that efficiency, we would like to re-converge and continue executing with the original active mask.

This isn’t just a performance concern, though. There are various operations that communicate or synchronize between the invocations in a warp: finite-difference derivatives in fragment shaders, synchronization barriers in compute languages, and horizontal SIMD reduction operations (e.g., CUDA’s warp-granularity “voting” operations). These operations typically require control-flow reconvergence for their semantics.

In practice, the “re-convergence point” for any potentially-divergent branch must be selected at compile time by the driver’s low-level compiler. This ensures that when the micro-architecture encounters a divergent branch at runtime, it already knows where it can expect to re-converge, and can arrange any internal data structures accordingly.

Re-converge at the end of an if or loop

We as programmers tend to assume that after a divergent branch, the invocations re-converge as soon as possible: typically at the end of the associated high-level control-flow construct. Note, though, that current GPU programming systems do not give this as a guarantee, no matter how intuitive it might seem.

The OpenGL Shading Language spec defines a notion of uniform and non-uniform control flow, and specifies that uniform control flow resumes at the end of a high-level control-flow construct. This spec language is meant to capture the essence of divergence and re-convergence, but it should be noted that it is only used to define when certain operations (e.g., partial-difference derivatives in a fragment shader) are valid. No stronger guarantees are made.

When using a language with unstructured control flow (goto in OpenCL or CUDA; PTX), high-level control-flow structures may not be discernible. In these cases, re-convergence rules must be defined on an arbitrary control-flow graph (CFG).

Re-converge at the immediate post-dominator

A simple approach that can be applied to an arbitrary CFG (and is thus applicable to assembly-level interfaces, or in the presence of goto) is to identify the immediate post-dominator of the basic block that has the branch.

Aside: For those unfamiliar with the term, it is simplest to think of the immediate post-dominator as the earliest point in the code that control flow must eventually pass through. For a precise definition, please consult any compiler textbook, or Wikipedia.

On paper this sounds like a good approach, but fails to match our intuitive expectations for several cases of structured control flow:

while( X )
{
    if( A )
    {
        if( Y )
            break;
    }
    B;
}

In this case, we expect any invocations that diverged at the branch on A, but did not subsequently break out of the loop, to re-converge before executing B. However, this isn’t guaranteed by immediate-post-dominator reconvergence, since the immediate post dominator of the A branch is at the end of the while loop, and not at the end of the if.

The same basic problem arises for switch statements with fall-through:

switch( A )
{
case 0:
    X;
    // fall through
case 1:
    B;
    break;
default:
    break;
}

Here again, an immediate post-dominator rule tells us to re-converge after the whole switch, and does not tell us to re-converge the case 0 invocations with the case 1 invocations before executing B.

As a reminder, re-convergence isn’t just about performance. If B in these examples included a Compute Shader barrier operation, then failure to re-converge at the right place could cause a deadlock.

Re-converge at “thread frontiers”

This paper describes a refined approach to re-convergence that addresses some of the shortcomings of immediate post-dominator reconvergence. They define “thread frontiers” that capture the intuitive notion of re-converging as soon as possible after a branch. Achieving thread-frontier reconvergence can require adding new conservative control flow edges (forcing the warp to visit basic blocks it would otherwise skip). The paper also describes new hardware requirements to implement this scheme efficiently.

Maximal convergence

The ISPC system, which implements a GPU-style programming model on SIMD CPU cores, defines a notion of “maximal convergence.” Intuitively, this can be though of as re-convergence at each and every sequence point as defined by the C language. If operation X happens before Y from the point of view of one invocation, then X for all invocations appears to happen before Y for any invocation in the warp.

This very strict model of convergence is often what people think of when they talk about “warp-synchronous programming,” or about invocations in a warp executing in “lock-step.” This kind of semantic guarantee can be beneficial to users (it would allow us to write a working per-pixel mutex), but it comes at the cost of flexibility for compiler optimizations. For example, if an implementation wants to emulate 16-wide warps on 8 wide SIMD hardware by doubling up math instructions, it is very restricted in how it can schedule those independent 8-wide operations.

Other than ISPC, I am not aware of any GPU-like programming models that give this strong guarantee. CUDA documentation used to give general advice for programmers attempting to use warp-synchronous approaches, but more recent documentation seems to advise against the practice.

When Are Deferred Paths Followed?

When control flow for a warp arrives at a re-convergence point, it needs to run any invocations for the “other” path(s) that had been deferred before it continues execution past the re-convergence point. The simplest way this might be achieved is by popping those deferred invocations off of the “divergence stack” (or alternative hardware mechanism).

This is one case where it seems like there really is an obvious Right Thing to do, and I’m not aware of a lot of design alternatives. Of course, if you don’t know which path is followed first, or where re-convergence points will be placed, there still isn’t much you can do.

Conclusion

If you find yourself trying to write “clever” GPU code that depends on order of execution or memory operations between parallel invocations within a single warp, please be aware that you are stepping into a minefield of implementation-specific behavior. For those of you targeting fixed hardware platforms, you might be able to find out what your GPU is doing under the hood and exploit this knowledge.

For those of you working on more open platforms, I would invite you to carefully read through the available API and language documentation. As a general rule, if the API/language spec doesn’t explicitly guarantee the behavior you are hoping for, you can bet that some hardware out there violates your assumptions.

I hope I’ve given you a sense of just how deep this semantic rabbit-hole goes.

5 thoughts on “A Digression on Divergence

  1. Hi, I was wondering whether I got this right? You are saying that ‘re-convergence after high level control flow’ would not guarantee re-convergence at B in the example below (simplified version of your with nested ifs). If I understand correctly a shader compiler may choose to re-converge at C? If so, I don’t see how you could ever do anything that requires all threads (e.g. derivatives) inside a loop with any other control flow.

    while( X )
    {
    if(Y)
    {
    A;
    }
    B;
    }
    C;

    • Sorry for the slow response.

      The important thing to remember is that “re-convergence” isn’t really a well-defined term in most of these systems. The GLSL spec defines a notion of “uniform” and “non-uniform” control flow, and for your particular example, if the condition X was uniform, and the condition Y was non-uniform, then B would count as “uniform control flow,” and it would be completely valid to use a derivative function like dFdx() there. This would be true even if A contained something like a conditional ‘break’ out of the loop (so long as it was never taken at run-time); the requirements of uniform control flow in GLSL are more strict than what Immediate Postdominator re-convergence guarantees.

      If you didn’t put a derivative or similar operation in B, though, you really wouldn’t have any guarantees that the compiler would re-converge all of your SIMD lanes before executing B. Uniform control flow doesn’t actually guarantee re-convergence, but rather just defines where it is valid to call certain functions (where typical implementations of those functions require re-convergence).

  2. D works on console… oh sorry that’s “(C) work with some really odd hardware.”

    Also, mobile GPUs are quite behind in term of design but are catching up just like PowerVR series.

    Case like:
    if (someVaryingCondition)
    {
    // inside
    }

    It doesn’t need to “branch” but instead we can just rely on the per-invocation execution masks. if – else requires branching instructions. There are quite few cases of non-divergent branching in Southern Islands: Dynamically uniform jumps with VSKIP, per invocation execution masks, per wrap execution mask jumps (quite the same than per-invocation execution masks), non-conditional jumps, conditional jumps and “complex graph” jumps. Divergent cases are only 2 out of the 5 options.

    Someone asked me once how “branching works on a GPU?” as an obvious question. Ermm!

    Your conclusion sounds like the dynamically uniform indexing “definition” in GLSL.

    Great post!

  3. Pingback: The PopcornFX script execution model | PopcornFX

  4. Great article – although as mentioned on Twitter obsolete now with Volta 🙂
    But I personally would still keep it as is and not update it. Just write a new blog post…

    Btw, found a couple of typos:
    – First paragraph: ending ” after “both sides” includes the .
    – Terminology paragraph: “When a the different invocations”
    – Maximal Convergence paragraph:
    – notion of “maximal convergence.” again do included
    – Intuitively, this can be though .. missing t

Leave a comment