A Digression on Divergence

Most people who do serious GPU work understand that shaders and other GPU programs are executed in a SIMD or “SIMT” fashion using “warps” or “wavefronts” that group together multiple program instances (vertices, fragments, work-items, “threads”). It is also widely understood that when the condition computed for a dynamic ifelse is different across these instances, the warp must execute “both sides.”

Aside:If you weren’t aware of any of that, you might like to look over this presentation from Kayvon Fatahalian before you proceed.


Our terminology for GPU programming is kind of a mess, since every hardware vendor and every API/language seems to invent its own nomenclature. In this post, I’ll refer to individual running instances of a program as invocations (you might know these as vertices, fragments, work-items, or “threads”), and the implementation-specific aggregates of these instances that are actually scheduled as warps (you may know these as wavefronts or threads).

Caveat:For those familiar with OpenCL or Compute Shader, a warp is not the same thing as a “work-group” or “thread-group.”

When a the different invocations in a warp disagree on the way to go at a conditional branch, we will say they diverge. This is actually a terrible choice of words, since divergence is already used in CS theory to refer to non-termination, but there aren’t many good alternatives.

The whole topic is closely related to the question of when particular expressions/variables are uniform or varying across a warp.

So Why Does This Matter?

Many people don’t really understand what goes on for such “divergent” branches, beyond the “execute both sides” mantra. The various hardware vendors tend not to talk about it, and the API/language specs are surprisingly mum.

Maybe it doesn’t matter. After all, who cares how it is implemented if it Just Works™, right?

Well, the problem is that ever since we put things like global-memory atomics into our GPU languages, the behavior of divergent branches has been semantically observable. This is perhaps most pronounced if you try to get clever and write a per-pixel mutex, as my colleague Marco Salvi did. The following example is in HLSL, but you can write an equivalent program in GLSL, OpenCL, or CUDA:

// We will use each 'uint' entry in this buffer as a lock
RWStructuredBuffer lockBuffer;

void FeelingClever()
    uint pixelID = ...; // compute which lock to use

    // try to acquire per-pixel lock,
    // using comapre-and-swap

    [allow_uav_condition] // need this to appease fxc...
    while (true)
        uint oldValue;
                                   0, 1, oldValue);

        // if we successfully swapped 0->1,
        // then we got the lock
        if( oldValue == 0 )


    // release the lock and give everybody else a shot
    lockBuffer[pixelID] = 0;

In a world where every Pixel Shader invocation ran as a truly independent thread, this ought to work. If you try to run code like this on current D3D11 GPUs, though, you will find that it seems to work on some implementations, and breaks on others.

The crux of the problem is that a subset of the shader invocations in warp A might grab their mutex right away, while others fail to grab theirs, forcing the entire warp to spin in the while(true) loop. If the locks that A is waiting for are held by a different warp B, which is also in the same situation, then we have a deadlock.

Now, you might decide that this code just isn’t clever enough (as I initially did, when Marco brought this problem to me). If the warps are stuck spinning in that while(true) loop, then the solution might be as simple as rearranging the code so they never have to leave the loop in the first place:

RWStructuredBuffer lockBuffer;

void FeelingExtraClever()
    uint pixelID = ...;
    while (true)
        uint oldValue;
                                   0, 1, oldValue);

        // if we successfully swapped 0->1,
        // then we got the lock
        if( oldValue == 0 )
            // rather than break out of the loop,
            // just do the work here

            // release the lock and give
            // everybody else a shot
            lockBuffer[pixelID] = 0;

            // only bail from the loop after
            // we've release the lock


It turns out that this version didn’t work either. I honestly can’t remember which versions failed on which hardware, but nothing we tried worked across the board.

Stepping Back

At this point it is worth stopping to ask: why might we expect one of these two options to work, and not the other? What kind of assumptions are we making about how the HW handles divergent branches? Are those assumptions valid in practice?

It turns out that we naturally make a lot of very strong intuitive assumptions about how GPUs will execute our programs, and most of these assumptions aren’t actually guaranteed by the programming models or the hardware.

Aside:This strikes me as similar to the situation with memory models, where people have a strong tendency to assume sequential consistency, just because it is so natural. Un-learning “common sense” is one of the hardest parts of parallel programming.

What Do We Know?

Let’s start with the stuff that seems to be common across most GPU architectures.

We have our warp, made up of several invocations. Along with that we have an active mask which tells us which of the invocations are currently live. The warp as a whole has a single program counter, from which it reads and executes instructions for all of the invocations in the active mask.

Aside: Some people will insist that GPUs have a distinct program counter for every invocation in the warp. These people are either (A) confused, (B) trying to confuse you, or (C) work with some really odd hardware.

When the warp executes a divergent branch, the set of active invocations divide into two (or more) subsets. The warp must pick one of these subsets for the new active mask, and defer the other(s). It is easiest to think of the deferred invocations as sitting on a “divergence stack,” although the particular implementations may vary a lot between vendors.

What Don’t We Know?

At this point I hope some questions are nagging at you, in particular:

  • When we split the active set at a divergent branch, which invocation(s) get “followed” first, and which deferred?
  • When do the invocations that get deferred get picked up and followed?
  • When do the invocations that have diverged “re-converge”?

Because current GPU programming languages do not specify these behaviors, there is a lot of latitude in how different hardware can answer these questions. I can’t easily comment on what particular vendors do (except where information is public), but I can try to illuminate the breadth of the design space.

A particular hardware platform might support structured control-flow instructions (akin to the control flow operations in the D3D shader bytecode), or unstructured branches (c.f. CUDA’s PTX), or both. Things that might seem obvious in the high-level language code can sometimes be subtle when you are looking at a low-level ISA.

Which Path Gets Followed First?

In general, it shouldn’t matter which path gets followed first at a divergent branch, although my per-pixel mutex example clearly depends on the invocation(s) that obtained their lock being followed before those that failed to obtain theirs. There are, however, some cases where an implementer might clearly want to follow one option over another.

If we have a one-sided if:

if (someVaryingCondition)
    // inside
// after

it seems obvious that we should follow the invocations that go inside the if before those that are just going to skip over it. Of course, remember that “obvious” can be a dangerous word.

Similarly, for the back-edge on a dowhile loop:

    // inside
while (someVaryingCondition);

it seems obvious to follow the invocations that continue looping first.

Lets look at some possible strategies:

Follow the “branch taken” option first

For a simple conditional branch in the hardware ISA, we could prefer to follow the invocations that take the branch over those that continue normaly. For the loop example above, this does the obvious thing, but not for the if example.

Follow the “not taken” option first

This option would do the expected thing for the if, but go against intuition for the loop. It might also be better for instruction caching, since the instructions immediately after the branch are likely to be in the cache already.

Follow “taken” first for back edges, “not taken” for forward edges

We can switch between the two strategies above based on whether the destination of the branch is before or after the branch instruction itself. This gives the expected “obvious” behavior for both of the examples above (assuming the compiler lays out basic blocks in the same order as the high-level code).

Follow whichever path has fewer active invocations

This may sound like an odd design decision, but it is precisely what is described in AMD’s Southern Islands (GCN) ISA documentation.

If we always follow the path that has fewer active invocations, then we know that every divergent branch reduces the number of active invocations by at least half. As a result, the maximum depth of our divergence stack is logarithmic, rather than linear, in the size of the warp (6 rather than 64 entries, in the case of GCN).

Multi-way (switch) and indirect branches (function pointers) add more complexity, but we already have several viable designs alternatives given only two-way branching.

Where/when do we re-converge?

After a divergent branch, a warp is executing with reduced SIMD efficiency: every instruction only applies to a subset of the invocations. In order to reclaim that efficiency, we would like to re-converge and continue executing with the original active mask.

This isn’t just a performance concern, though. There are various operations that communicate or synchronize between the invocations in a warp: finite-difference derivatives in fragment shaders, synchronization barriers in compute languages, and horizontal SIMD reduction operations (e.g., CUDA’s warp-granularity “voting” operations). These operations typically require control-flow reconvergence for their semantics.

In practice, the “re-convergence point” for any potentially-divergent branch must be selected at compile time by the driver’s low-level compiler. This ensures that when the micro-architecture encounters a divergent branch at runtime, it already knows where it can expect to re-converge, and can arrange any internal data structures accordingly.

Re-converge at the end of an if or loop

We as programmers tend to assume that after a divergent branch, the invocations re-converge as soon as possible: typically at the end of the associated high-level control-flow construct. Note, though, that current GPU programming systems do not give this as a guarantee, no matter how intuitive it might seem.

The OpenGL Shading Language spec defines a notion of uniform and non-uniform control flow, and specifies that uniform control flow resumes at the end of a high-level control-flow construct. This spec language is meant to capture the essence of divergence and re-convergence, but it should be noted that it is only used to define when certain operations (e.g., partial-difference derivatives in a fragment shader) are valid. No stronger guarantees are made.

When using a language with unstructured control flow (goto in OpenCL or CUDA; PTX), high-level control-flow structures may not be discernible. In these cases, re-convergence rules must be defined on an arbitrary control-flow graph (CFG).

Re-converge at the immediate post-dominator

A simple approach that can be applied to an arbitrary CFG (and is thus applicable to assembly-level interfaces, or in the presence of goto) is to identify the immediate post-dominator of the basic block that has the branch.

Aside: For those unfamiliar with the term, it is simplest to think of the immediate post-dominator as the earliest point in the code that control flow must eventually pass through. For a precise definition, please consult any compiler textbook, or Wikipedia.

On paper this sounds like a good approach, but fails to match our intuitive expectations for several cases of structured control flow:

while( X )
    if( A )
        if( Y )

In this case, we expect any invocations that diverged at the branch on A, but did not subsequently break out of the loop, to re-converge before executing B. However, this isn’t guaranteed by immediate-post-dominator reconvergence, since the immediate post dominator of the A branch is at the end of the while loop, and not at the end of the if.

The same basic problem arises for switch statements with fall-through:

switch( A )
case 0:
    // fall through
case 1:

Here again, an immediate post-dominator rule tells us to re-converge after the whole switch, and does not tell us to re-converge the case 0 invocations with the case 1 invocations before executing B.

As a reminder, re-convergence isn’t just about performance. If B in these examples included a Compute Shader barrier operation, then failure to re-converge at the right place could cause a deadlock.

Re-converge at “thread frontiers”

This paper describes a refined approach to re-convergence that addresses some of the shortcomings of immediate post-dominator reconvergence. They define “thread frontiers” that capture the intuitive notion of re-converging as soon as possible after a branch. Achieving thread-frontier reconvergence can require adding new conservative control flow edges (forcing the warp to visit basic blocks it would otherwise skip). The paper also describes new hardware requirements to implement this scheme efficiently.

Maximal convergence

The ISPC system, which implements a GPU-style programming model on SIMD CPU cores, defines a notion of “maximal convergence.” Intuitively, this can be though of as re-convergence at each and every sequence point as defined by the C language. If operation X happens before Y from the point of view of one invocation, then X for all invocations appears to happen before Y for any invocation in the warp.

This very strict model of convergence is often what people think of when they talk about “warp-synchronous programming,” or about invocations in a warp executing in “lock-step.” This kind of semantic guarantee can be beneficial to users (it would allow us to write a working per-pixel mutex), but it comes at the cost of flexibility for compiler optimizations. For example, if an implementation wants to emulate 16-wide warps on 8 wide SIMD hardware by doubling up math instructions, it is very restricted in how it can schedule those independent 8-wide operations.

Other than ISPC, I am not aware of any GPU-like programming models that give this strong guarantee. CUDA documentation used to give general advice for programmers attempting to use warp-synchronous approaches, but more recent documentation seems to advise against the practice.

When Are Deferred Paths Followed?

When control flow for a warp arrives at a re-convergence point, it needs to run any invocations for the “other” path(s) that had been deferred before it continues execution past the re-convergence point. The simplest way this might be achieved is by popping those deferred invocations off of the “divergence stack” (or alternative hardware mechanism).

This is one case where it seems like there really is an obvious Right Thing to do, and I’m not aware of a lot of design alternatives. Of course, if you don’t know which path is followed first, or where re-convergence points will be placed, there still isn’t much you can do.


If you find yourself trying to write “clever” GPU code that depends on order of execution or memory operations between parallel invocations within a single warp, please be aware that you are stepping into a minefield of implementation-specific behavior. For those of you targeting fixed hardware platforms, you might be able to find out what your GPU is doing under the hood and exploit this knowledge.

For those of you working on more open platforms, I would invite you to carefully read through the available API and language documentation. As a general rule, if the API/language spec doesn’t explicitly guarantee the behavior you are hoping for, you can bet that some hardware out there violates your assumptions.

I hope I’ve given you a sense of just how deep this semantic rabbit-hole goes.

Are Explicit Location Bindings a Good Idea for a Shading Language?

Probably Not.


Both the HLSL and GLSL shading languages support mechanisms for assigning explicit, programmer-selected locations to certain shader parameters. In HLSL, this is done with the regsiter keyword:

// texture/buffer/resource:
Texture2D someTexture          : register(t0); 

// sampler state:
SamplerState linearSampler     : register(s0);

// unordered access view (UAV):
RWStructuredBuffer<T> someData : register(u0);

// constant buffer:
cbuffer PerFrame               : register(b0)    
    // offset for values in buffer:
    float4x4 view              : packoffset(c0);
    float4x4 proj              : packoffset(c4);
    // ...

When setting shader parameters through the Direct3D API, these explicit locations tell us where to bind data for each parameter. For example, since the cbuffer PerFrame is bound to register b0, we will associate data with it by binding an ID3D11Buffer* to constant buffer slot zero (with, say, PSSetConstantBuffers).

The OpenGL Shading Language did not initially support such mechanisms, but subsequent API revisions have added more and more uses of the layout keyword:

// texture/buffer/resource
layout(binding = 0) sampler2D someTexture;

// shader storage buffer (SSB)
layout(binding = 0) buffer T someData[];

// uniform buffer:
layout(binding = 0) PerFrame
    mat4 view;
    mat4 proj;
    // ...

// input and output attributes:
layout(location = 2) in  vec3 normal;
layout(location = 0) out vec4 color;

// default block uniforms (not backed by buffer)
layout(location = 0) uniform mat4 model;
layout(location = 1) uniform mat4 modelInvTranspose;

It is clear that location binding was an afterthought in the design of both languages; the syntax is ugly and obtrusive. Using explicit locations can also be error-prone, since it becomes the programmer’s responsibility to avoid conflicts, etc., and ensure a match between application and shader code. The shading languages “want” you to write your code without any explicit locations.

If you talk to game graphics programmers, though, you will find that they use explicit locations almost exclusively. If you try to give them a shading language without this feature (as GLSL did), they will keep demanding that you add it until you relent (as GLSL did).

Why Do We Need These Things?

If the programmer does not assign explicit locations, then it is up to the shader compiler to do so. Unfortunately, there is no particular scheme that the compiler is required to implement, and in particular:

  • The locations assigned to parameters might not reflect their declared order.
  • A parameter might not be assigned a location at all (if it is statically unreferenced in the shader code).
  • Two different GL implementations might (indeed, will) assign locations differently.
  • A single implementation might assign locations differently for two shaders that share parameters in common.

When an application relies on the shader compiler to assign locations, it must then query the resulting assignment through a “reflection” interface before it can go on to bind shader parameters. What used to be a call like PSSetConstantBuffers(0, ...) must now be somethin glike PSSetConstantBuffers(queriedLocations[0], ...). In the case of Direct3D, these locations can be queried once a shader is compiled to bytecode, after which the relevant meta-data can be stripped, and the overhead of reflection can be avoided at runtime; this is not an option in OpenGL.

Even statically querying the compiler-assigned locations does not help us with the issue that two different shaders with identical (or near-identical) parameter lists may end up with completely different location assignments. This makes it impossible to bind “long-lived” parameters (e.g., per-frame or camera-related uniforms) once and re-use that state across many draw calls. Every time we change shaders, we would need to re-bind everything since the locations might have changed. In the context of OpenGL, this issue means that linkage between separately-compiled vertex and fragment requires an exact signature match (no attributes dropped), unless explicit locations are used.

As it stands today, you can have clean shader code at the cost of messy application logic (and the loss of some useful mix-and-match functionality), or you can have clean application logic at the cost of uglier shader code.

A Brief Digression

On the face of it, the whole situation is a bit silly. When I declare a function in C, I don’t have to specify explicit “locations” for the parameters lest the compiler reorder them behind my back (and eliminate those I’m not using):

int SomeFunction(
    layout(location = 0) float x,
    layout(location = 1) float A[] );

When I declare a struct, I don’t have to declare the byte offset for each field (or, again, worry about unused fields being optimized away):

struct T {
    layout(offset = 0) int32_t x;
    layout(offset = 4) float y;

In practice, most C compilers provide fairly strong guarantees about struct layout, and conform to a platform ABI which guarantees the calling convention for functions, even across binaries generated with different compilers. A high level of interoperability can be achieved, all without the onerous busy-work of assigning locations/offsets manually.

Guarantees, and the Lack Thereof

Why don’t shader compilers provide similar guarantees? For example, why not just assign locations to shader parameters in a well-defined manner based on lexical order: the first texture gets location #0, the next gets #1, and so on? After all, what makes the parameters of a shader any different from the parameters of a C function?

(In fact, the Direct3D system already follows just such an approach for the matching of input and output attributes across stage boundaries. The attributes declared in a shader entry point are assigned locations in the input/output signature in a well-defined fashion based on order of declaration, and unused attributes aren’t skipped.)

Historically, the rationale for not providing guarantees about layout assignment was so that shader compilers could optimize away unreferenced parameters. By assigning locations only to those textures or constants that are actually used, it might be possible to compile shaders that would otherwise fail due to resource limits. In the case of GLSL, different implementations might perform different optimizations, and thus some might do a better job of eliminating parameters than others; the final number of parameters is thus implementation-specific.

This historical rationale breaks down for two reasons. First is the simple fact that on modern graphics hardware, the limits are much harder to reach. The Direct3D 10/11 limits of 128 resources, 16 samplers, and 15 constant buffers is more than enough for most shaders (the limit of only 8 UAVs is a bit more restrictive). Second, and more important, is that if a programmer really cares about staying within certain resource bounds, they will carefully declare only the parameters they intend to use rather than count on implementation-specific optimizations in driver compilers to get them under the limits (at which point they could just as easily use explicit locations).

One wrinkle is that common practice in HLSL is to define several shaders in the same file, and to declare uniform and resource parameters at the global scope. This practice increases the apparent benefit of optimizing away unreferenced parameters. The underlying problem, though, is that the language design forces programmers to use global variables for what are, logically, function parameters. Trying to “fix” this design decision by optimizing away unused parameters is treating the symptoms rather than the disease.

As far as I can tell, there is no particularly compelling reason why a modern shading language should not just assign locations to parameters in a straightforward and deterministic manner. We need an ABI and calling convention for interface between the application and shader code, not a black box.

So Then What About Explicit Locations?

If our compiler assigned locations in a deterministic fashion, would there still be a need for explicit location bindings? Deterministic assignment would serve the same purpose in eliminating the need for a “reflection” API to query parameter bindings (though one could, of course, still be provided).

The remaining benefit of explicit locations is that they allow for us to make the parameter signatures of different shaders “match,” as described above. In the simplest case, a deterministic assignment strategy can ensure that if two shaders share some initial subsequence of parameters in common, then they assign matching locations to those parameters. In the more complex cases (where we want a “gap” in the parameter matching), it seems like the answer is right there in C already: unions allow us to create heterogeneous layouts just fine.

All we need to do, then, is provide a shading language the same kinds of tools that C has for describing layouts (structs and unions), and we should be able to rid ourselves of all these fiddly layout and register declarations.

So What Now?

In the long run, this whole issue is probably moot, as “bindless” resource mechanisms start to supplant the very idea of binding locations as a concept in shading languages and graphics APIs.

In the short term, though, I hope that we can get support for deterministic and predictable layout assignment in near-future versions of HLSL and GLSL. This would allow us to write cleaner and simpler shader declarations, without having to compromise the simplicity of our C/C++ application logic.

Spark is on GitHub

The source code for the Spark shading language is now hosted on GitHub, under a commercial-friendly open-source license:


There is also a binary/source package available for download, and for those who just want more details on the Spark language and Direct3D 11 interface, you can look at the User’s Guide. If you missed the SIGGRAPH paper, though, it is probably the best starting point.

If you’d like to give Spark a try – either as part of a research project, or to get ideas for your next in-house shader system – contact me. I need to gather feedback on Spark from users, as part of my continuing research at Stanford, and I’d be happy to provide support to anybody who uses Spark.

Even if you just want to talk about how to integrate some of the more down-to-earth ideas from Spark into your in-house shader system, I’m game. I’ve already gotten feedback from one game developer, who rearchitected their shader-stiching system based on ideas from Spark, and was very happy with the result.

Emulating Classic Games in HD – Part 2

Last time, we learned the basics of the Game Boy hardware, and tried a simple strategy for rendering games with high-res graphics. Sometimes this simple approach worked well:

but other times it didn’t work at all:

In this part we’ll look at the timing of the Game Boy display hardware, and see how this interferes with our simple scheme.

Horizontal Blank

Computing the appropriate color for a pixel, given the background and sprites, takes time. As such, the graphics hardware does not compute colors for the entire screen simultaneously, but instead render pixels one at a time via raster scanning.

If we imagine how an old-fashioned CRT display scans out an image, there is an “electron gun” that scans each row of pixels from left to right, and proceeds down the screen through the rows. While the GameBoy uses an LCD screen rather than a CRT, it employs a similar notion of how the screen is scanned. Pixels are scanned from left to right in each horizontal line, and lines are scanned from the top down.

After the each line has been scanned, it takes time for the “electron gun” to move back to the start of the next line; this time is called the horizontal blank. At the end of the last line, it takes an even longer time to scan back to the start of the first line; this time is called the vertical blank.

During the two blanking periods, no rendering takes place. Because the rendering hardware is idle during this time, it is possible for software to safely change the values of hardware registers and affect how the next line will be drawn. For example, in Super Mario Land, the level scrolls from right to left as the player progresses, but the status/score display does not. In order to achieve this effect, the game changes the horizontal background-scroll register at the end of line 15.

A Less Simple Renderer

Now it should be clear why our simple rendering approach has problems: it only looks at the values of the hardware registers at the end of a frame (vertical blank), but those registers might have changed on each and every scan line (horizontal blank). Once we realize this, though, it is really quite simple to work around

First, we need to “snapshot” the state of the graphics registers at the end of each scan line. A clever implementation might only take a new snapshot when one or more registers actually change, or might store compact deltas, but my implementation takes a pretty naive snapshot at the end of each line.

Next, instead of rendering a textured quad for every tile, we can simply render a one-scan-line tall textured quad for every tile for every scan-line. On each line we render the tile using the state snapshot for that line.

This approach is somewhat wasteful of CPU and GPU time, but it rectifies the problems we saw before:

Try It Out!

I’ve gone ahead and thrown the code (such as it is) up on GitHub, just to prove I have nothing up my sleeves. You can also download a pre-built binary of the emulator for Mac (I haven’t had time to work on a Windows build yet, sorry).

Keep in mind that this is only a proof of concept; the controls are hard-wired, there is no support for save states or save slots, and it only supports a handful of Game Boy games.

Next Steps

My goal here was to show that this kind of graphics replacement was possible in theory, and I hope I’ve achieved that. I don’t have the kind of time it would take to turn this project into a real, usable emulator, so if you want to fork the code and run with it, feel free.

Scaling this approach up to a more powerful console would bring some interesting challenges, so I’ll throw out a few notes for anybody who might be interested in trying. Readers should feel free to bail out now if none of this makes sense:

  • I dealt with palettes in the easiest possible fashion: I use a 4-channel image to store weights for each of the colors in a 4-color palette (effectively treating the palette as a basis). For 16-color palettes this would involve 16-channel images, which is still pretty tractable. Beyond that, you’ll probably need a new technique.
  • Because the GameBoy is pure grayscale, it was easy to “project” arbitrary grayscale images into this basis by expressing them as a weighted sum of colors from the palette. For more complex palettes this runs the risk of being over- or under-constrained, so you’d likely want a tool to directly edit replacement images in a paletted fashion.
  • More accurate emulation is often required for more interesting consoles. In order to handle mid-scan-line effects on things like the NES, you might need to draw the replacement graphics on a per-pixel basis, rather than the per-line approach I use. In that case, it would help to detect when the graphics registers change, and only do the more expensive rendering when/where required.
  • I’ve ignored relative sprite priorities in my emulator, but I know that subtle priority details often have big effects (e.g., the clipping for item blocks in Mario 3). The key to getting priority right while still using replacement graphics is to cast the pixel-compositing operations usually applied into something like the Porter-Duff algebra, which will naturally scale to images and alpha transparency.
  • Many SNES games use HDMA to make rounded shapes (e.g., the Power Bomb in Super Metroid) or wavy distortion effects (e.g., Phantoon). If these kind of scanline effects get scaled up, they will become blocky. Ideally, an emulator should be able to interpolate between the state of registers at different scanlines to smooth out these shapes. The challenge then is figuring out when not to interpolate.

Emulating Classic 2D Games in HD – Part 1

Many emulators for 3D consoles support rendering games at higher resolutions than supported by the original hardware. Some emulators for older 3D consoles even support replacing the texture maps used in games with higher-resolution versions. Emulators for 2D consoles, such as the venerable Super NES, have no such functionality. Frames are rendered at the original native resolution, and then may be upsampled to higher resolutions using a handful of simple filters.

Why can’t we play classic 2D games in HD? Couldn’t we take the same concept of replacing low-resolution texture maps and apply it to inserting high-resolution sprites or background tiles into 2D games?

I’ve seen comments from emulator creators that this just isn’t possible, but I hadn’t really seen a solid technical explanation. So I decided to find out the answer for myself. Just as a little teaser, though, I’ll say that I was pleasantly surprised by what I could achieve (my own poorly-drawn programmer art aside):

Super Mario Land, now at 4 times the resolution!

The Game Boy

In order to understand both the opportunities and challenges, we need to cover some background on sprite-based 2D graphics hardware. To keep things simple, both for my implementation and for this article, I’ve chosen to focus on the original Nintendo Game Boy.

The Game Boy has a 160×144 pixel LCD screen, and uses 4 grayscale colors (2-bit color). Rather than store the image to be displayed in a frame buffer, the hardware renders the image by compositing three kinds of elements: the background, sprites, and window. For simplicity, I’ll only talk about the first two.

Background Tile Map

The background is a 32×32 grid of tiles, where each tile is an 8×8 pixel image. Thus the background defines a conceptual 256×256 pixel image.. We can represent the background as a 32×32 array of tile indices (the tile map) along with an array of the corresponding tile images.

uint8_t tileMap[32][32];
TileImage tileImages[256];

Since the background is larger than the screen, two 8-bit hardware registers are used to scroll the background and select what part will be visible. As we scroll past the edge of the background, it repeats. A game can replace tile indices that are out of view to implement a continuously-scrolling map.


The sprites are defined by a structure called the Object Attribute Map (OAM). The Game Boy supports 40 sprites, and each sprite has an X and Y position, flags for horizontal and vertical flip, and the index of the tile image to use:

struct Sprite {
    uint8_t x;
    uint8_t y;
    bool flipX;
    bool flipY;
    uint8_t tileIndex;
Sprite oam[40];


Both the background and sprite use tiles to define their image data. Each tile is a 2-bit-per-pixel paletted image (yes, even though the Game Boy only supports 2-bit color, it uses paletted images). An 8×8 pixel tile takes up 8x8x2=128 bits (16 bytes). There are distinct color palettes for sprites and the background:

Color bgPalette[4];
Color spritePalette[4];

In the sprite palette, color index zero is special: it indicates a transparent pixel.


There are, of course, a lot more details to the Game Boy rendering hardware, and other 2D platforms will of course have different designs. Still, these basic concepts – tile maps, sprites, and paletted tile images – are shared by many different 2D consoles.

Tile Replacement

In order to render a Game Boy game at higher resolution, we must find a way to map the existing 8×8 tiles to high-res replacements. A simple approach would be to hash the contents of the tile image, and use the resulting hash as a key for a table of replacements. In the case of the Game Boy, tile images are only 16 bytes, so we can just use the tile data itself as our hash key. Of course we also need the replacement images; for now I’ll be using my own programmer art:

On the left we have an original sprite from Super Mario Land, and on the right we have my replacement. The use of paletted images complicates this process somewhat, but let’s ignore those details for right now.

A Simple Scheme

At first blush it seems like we could accomplish our goal with a relatively simple renderer. Whenever a frame needs to be drawn we can:

  • Iterate over the background tile map, and draw one textured quad per background tile.
  • Iterate over the sprites, and draw one textured quad per sprite.

We need a cache that maps tile images to texture maps, but that can easily be folded into our tile-replacement step.

Let’s try this scheme out on a few games (note, I’m only replacing a subset of tiles right now – even programmer art takes time to create):

Link's Awakening

Those results seem good (my terrible art skills aside). Suspiciously good, in fact, if I’m claiming that this problem merits a whole article. And yes, if we look a little harder we can find examples where the results of our simple renderer are not as satisfying:

Super Mario Land again, not looking as good

Castlevania II looks like a complete mess

What is going on in these images that makes them fail? In Part 2 of this series, we will discuss the timing of sprite hardware, and see how to resolve these graphical issues.

Metroid: Other M

When I find that my feelings towards a game are ambiguous, I like to look at its design decisions (whether gameplay, technology, art direction, or narrative) and categorize them into “what worked” and “what didn’t work.” In the case of Other M, though, I found that my “what worked” pile was pretty thin.

Other M is a competently made game (though not, obviously, without bugs). As with almost any Nintendo-published game, it has that layer of polish and an overall cohesiveness that make it stand out from the pack. But these are not enough to make a game good – competence and cohesiveness should be the least we demand from our games.

The best thing I can say about Other M is that it is very aware that it is a Metroid game. It’s overall structure (both in narrative and world design) is an obvious homage to Metroid Fusion. Samus’ powers are precisely those she acquired in Super Metroid. The morph-ball sequences show a clear influence from the Prime games. The best enemy designs are those lifted from prior games, and the best boss fights are effectively cameos. Other M even acknowledges some of the more silly aspects of the series mythology (e.g., the “evolved” forms of Metroids) which had been quietly ignored by the Prime games (over the course of which series the story has increasingly been sanitized into generic sci-fi pablum).

Put that kind of pastiche before a dedicated Metroid fan, and they might play through the whole thing on auto-pilot. Samus is a fun character to control; her full suite of abilities allow most situations to be tackled in a variety of ways, and so mastering her moveset lends a sense of fluency to interactions that most characters lack.

Once Other M steps out of these narrow confines of emulation and homage, though, things get rockier. The modifications and extensions to the Metroid formula all serve to compromise the fluency of the interface between the player and Samus. Perhaps this was intentional – calling attention to and disrupting the interface between the player and Samus as avatar to serve the narrative goals of portraying Samus as Other.

I could almost believe that if the storytelling on display wasn’t so painfully clumsy and ham-fisted at every term.

I’m definitely in the crowd that doesn’t feel that the Samus we see in Other M is the same woman we saw in the other games. It isn’t that the other games were laden with characterization, but what was there (even in the story-heavy Fusion) is almost entirely contradictory with Other M.

My wife, for her part, sees the portrayal of Samus in Other M as part of an overall downward trend in Nintendo’s handling of the character, that started with the introduction of the “Zero Suit.” Samus has always stood out as gaming’s first, and for a long time, best, female role-model character; now she is portrayed in-canon as walking around in a skin-tight leotard and desperately seeking male approval. One can’t really expect female gamers to respond any better to this than toward Aya Brea’s revamped in-game portrayal.

(I’ll leave aside the issue of the Justin Bailey code and the uncomfortable strip-tease that has become a hallmark of Metroid endings. Maybe objectification of Samus has been with us from the start, but it was easier to overlook when it was restricted to low-fi graphics in an “easter egg” or special ending.)

I’ve played enough games with embarrassing or poorly translated stories in my time that I might have been able to give Other M a pass if the gameplay had held up. Unofortunately, once you get out of the “auto-pilot” situation I describe above and look at things carefully, it doesn’t hold up.

Moment to moment, playing Other M gives you the sense that a really great 2.5D Metroidvania could be built on this engine. The same could be said of Chair’s Shadow Complex. The problem for both games is that the world map and player progression just don’t live up to the promise of the moment-to-moment gameplay.

Like Metroid Fusion, Other M actively thwarts attempts by the player to explore out of the prescribed sequence. Arbitrary locked doors that only open on storyline triggers (the so-called “broken bridge” trope) are a distinctly un-Metroid concept. While Fusion also succumbed to this problem, it also used the design trick of letting the player think they were going “off the rails” even when they were following the designated path (an approach used to great effect in Portal). In contrast, Other M never really tries to give the player this sense of subversion (with one late-game exception: using the power bomb to break out of a stalled elevator), and as a result the player is left with the impression that the designers are leading them on a short leash from one expository cut-scene to another.

Other M has innumerable other gameplay foibles that further limit the potential for enjoyment:

  • The first-person view is tedious and clumsy to use in battle (leading to many cheap hits from enemies while waiting for the Wii remote sensor to register a perspective change), and leads to the most frustrating and horribly un-Metroid gameplay bottlenecks with the mandatory pixel-hunting sequences. The battle sequences that played out in first-person were a frustrating exercise as the player had no way to avoid damage but to expertly use the clumsy controls. That these unnecessary and unusable first-person elements were used to advertise the game in a world where Metroid Prime had already showed just how well a first-person Metroid could play was unbelievable.
  • The fact that most of the hidden items can only be acquired in the late- or post-game as part of a fetch-quest “cleanup” pass is irritating for fans of the classic Metroid formula (although Zero Mission commits this same sin). Putting the last few power-ups behind power-bomb-able panels that trigger pointless button-mashing QTE-laden fights was also a kick in the teeth.
  • For that matter, all of Other M‘s combat “innovations” were pretty lackluster. The idea of giving Samus more melee options might have sounded good on paper, but the implementation basically boiled down to QTE finishers plus a “mash me to not die” button. An even more egregious sin was that the game would sometimes expect QTE input during what looked to the player like a cutscene (without any on-screen prompt), so that a player would never know which threats Samus faced in a cut-scene needed to be dodged, and which were just scripted choreography.
  • This final point leads to the most important realization that led me to conclude that Other M is not a good game (whether interpreted as a Metroid game or not): Other M sets up a contract with the player and then violates its own rules. I’ll get to how in a second, but let’s talk about the “authorization” system:

    The presentation of authorization is patently ridiculous (which negates its purported benefit of “finally” explaining how Samus loses all her powers between games). It also breaks the important link between exploration and discovery that is the core of the Metroidvania genre. Other M doesn’t reward exploration with new abilities for Samus – at best it awards her with a few more missiles, a longer health bar, or shorter charge times – and so exploration is no longer the driving force behind the player’s progression. Instead, abilities are tied to story progression, and so the player’s first priority is (nominally) to advance the story. Unfortunately, the way that new powers are almost always given out right after they would be useful leads players to feel resentful when a new ability is authorized, rather than excited to try out the new tool.

    This is the opposite of how a traditional Metroidvania works. Usually you’d notice ledges you can’t reach or walls you can’t break through, and upon getting the item that would render those impediments surmountable you’d be excited to go back and see where they lead. In Other M you are just annoyed at Adam; his stinginess has forced you to backtrack again.

    For all its faults, though, the authorization system sets up one important and immutable rule – a contract with the player: a new ability can only be used when the requisite “authorization” cut-scene has played, hilighting the new item in Samus’ inventory. As a player you can count on the fact that any puzzle and any battle that you are expected to tackle along the main story-line can be bested with just those abilities that have been explicitly enabled.

    That is, until the very end of the game, where you are suddenly required to make use of Power Bombs, without any prompting, within a limited timeframe, or else Samus will die. Before this point you have probably already seen Samus die rather than use her Varia suit, so why are players expected to know that Samus should now use her Power Bombs without permission rather than die a pointlessly noble death? Once players do figure out this ridiculous “press X to not die” QTE they are rewarded with a post-facto authorization cut scene, but the damage has already been done: the game has violated its contract with the player.

    And for what?

    This scene was probably meant to be a cathartic one for the player: the culmination of the marriage of story and mechanics that the “authorization” system had represented. Unfortunately, the game designers crafted an ending that relied on players feeling a particular way: they wanted players to think that they had nothing to lose and could thus finally cut loose and use the most destructive weapon in their arsenal without concern – Samus finally acting like the independent and resourceful bounty hunter we had always thought her to be.

    Speaking for myself, though, this wasn’t how the game made me feel. I had spent hours of time with the game, in which Other M struck down my every attempt to break out of the strictures of its narrow world design, and had listened to its plodding narrative tell me that Samus was incapable or unwilling to act on her own. Other M had left me feeling powerless and without direction – it had turned me (and Samus) into the kind of person who would sit patiently in the stomach of a monster, waiting for death, not even thinking to use the one weapon in my arsenal that might have saved my life, because the arbitrary fiat of my commanding officer (or rather, of the game designer, finally revealed once the mask of Adam had been cast aside) had not explicitly sanctioned it.

    So there’s another “positive” thing to say about Other M: it would make a good tool for inculcation of recruits into a regimented military organization or totalitarian regime in which independent thought must be supressed, even to the point of death.

    It’s called “Stereo,” not “3D”

    When speaking or writing about the entertainment industry’s current gimmick du jour, please refer to it correctly as “stereoscopic 3D” or “stereo 3D.” If there is no chance of confusion with stereo audio (and honestly, how often do we talk about stereo audio any more?) you may simply use “stereo.”

    What you may not, however, call it is “3D.”

    The term “3D” can, obviously, refer to anything three-dimensional. In the case of films, TV, and video games, it has usually refered to “3D graphics”; the rendering of three-dimensional objects and scenes as viewed through a virtual camera. We’ve had numerous successful 3D animated films (with Toy Story being the first feature-length example), and almost every big-budget live-action movie now uses 3D graphics for special effects. Since the advent of the PSX/N64 generation, a majority of video games have used 3D graphics.

    Thus, when I hear otherwise-intelligent people asking whether we will soon be seeing “3D video games,” I cringe. We already have 3D video games – or are games like Half-Life and Mario Galaxy now “2D” simply by virtue of the movie industry’s latest marketing campaign?

    Those of us who understand what words mean have a responsibility to use them correctly. This is the only way we can hope to stem the tide of those who use words they don’t understand, incorrectly.

    Special Addendum for Pedants

    Some readers might want to split hairs and point out that the “3D” graphics I refer to above are always projected onto a 2D screen. It would seem, then, that stereo graphics is “more 3D” by virtue of adding a sense of depth to an image that already has height and width.

    This argument is entertaining (as most pedantic arguments are), but ultimately misses the point … or rather, two points:

    • The “3D” in 3D graphics doesn’t refer to the final projected image, but to the object or scene that is being rendered. The dimensionality of the content and the final projected image don’t need to match. You can have a 2D projection of 3D content, just as you could have a stereo projection of a 2D image.
    • Projected stereo graphics are not, in the mathematical sense, three dimensional. There are merely a sum of two two-dimensional projections. Adding a depth component to a flat image doesn’t change its dimensionality either. Crumple up a piece of paper and it is still a 2D surface.

    In the end, the only method of projection that could reasonably lay claim to the name “3D” would be some sort of omnidirectional holographic projection (and researchers are working on various experimental display technologies of this sort). Still, if such displays ever became commonplace, we’d be better off calling them “holographic.”

    So there you have it, stereo “3D” apologists (if there are any); your nomenclature doesn’t have a leg to stand on – not even a pedantic and wobbly one.